Wrangle OpenStreetMap Data: Zurich, Switzerland

Project 3, Udacity Data Analyst Nanodegree

Contents:

Repository contents
Source data
- Parsing challenges
Data overview
Zurich exploration
Looking ahead
- Payments
- Limits of Zurich-only filter
Resources

Repository contents

Folder/file	Description
`extracts/`	contains osm and json sample data
`images/`	contains images used in this README
`clean_data.py`	master file for OSM XML sanitization
`insert_data.py`	master file for inserting sanitized json data into MongoDB
`process_sample.py`	generate `zurich_larger_extract.osm` for sanitization testing
`report.pdf`	pdf version of this README (though not as nicely formatted)
`user_viz.py`	generate donut chart to visualize top user activity
`zurich_cities.py`	strategy file for auditing cities in the data

Note: complete data files (both source .osm and sanitized .json) have been added to .gitignore to prevent slowness and bloating on GitHub.

Source data

The OpenStreetMap (OSM) extract for Zurich, Switzerland covers the city center and surrounding suburbs, and contains data on establishments, transportation systems, points of interest, and more for the area. The OSM XML dataset from MapZen is 614.7 MB.

In accordance with the OSM XML API, the data consists of three main types of elements:

Nodes
Ways
Relations

In the sanitization process, all three types of elements are accounted for.

XML parsing challenges

Domain & language knowledge

I chose to parse Zurich data because of an interest in the city, not a personal connection. Having never visited or lived in Zurich and not knowing German, I was initially lacking knowledge to audit tag values.

Because of special characters in German (e.g. umlaut), I needed to familiarize myself with Python's handling of unicode.

City names

Because of the umlaut in Zürich and the presence of city/district modifiers, the city names needed to be audited.

Using the fuzzywuzzy string approximation library, here are the top 20 fuzzy matches to Zurich before sanitization. Within each tuple, the string represents the "addr:city" value, and the integer its match ratio: how good of a match it is to the root term, "Zurich."

[('Zurich', 100), ('Zuerich', 92), ('zuerich', 92), ('Zürich', 91), ('zürich', 91), ('Egg bei Zürich', 82), ('Zürich-Flughafen', 78), ('Zürich 50 Oerlikon', 78), ('Zürich-Altstetten', 78), ('Zürich Gockhausen', 78), ('Muri', 68), ('Höri', 60), ('Forch', 55), ('Buchs', 55), ('Embrach', 46), ('Zumikon', 46), ('Zufikon', 46), ('Neerach', 46), ('Uerikon', 46), ('Seuzach', 46)]

Any city value with a score below 78 does not contain Zurich. Thus, to filter down to only Zurich (or city-unspecified) records, I used that ratio cutoff to cull documents that do not directly reference Zurich in "addr:city".

To add complexity, the dataset's "addr:city" values contained a handful of integers. These integers represent Zurich's districts, which are often referenced by number. To standardize city names and keep only Zurich records, I decided to convert any integers I saw in the "addr:city" field into "Zurich".

Tag key separators

OSM records level of specificity in the key values in tags. In the json conversion process, it made the most sense to nest these attributes together, but the approach had to differ based on the tag. For the purposes of this exercise, I only processed two-tiered keys (one separator), and disregarded three-tiered keys.

Two-tier key types & conversions:

<tag k="wheelchair" v="limited"/>
<tag k="wheelchair:description" v="rund 50% der Fahrzeuge verkehren mit Niederflur-Einstieg"/>

# converted to:
"wheelchair": {
  "description": "rund 50% der Fahrzeuge verkehren mit Niederflur-Einstieg",
  "base_value": "limited"
}

<tag k="maxspeed" v="100"/>
<tag k="source:maxspeed" v="sign"/>

# converted to:
"maxspeed": {
  "source": "sign",
  "base_value": "100"
}

Also, some separators departed from the standard : separator, and had ., like surface.material.

Tag value separators

I had to conduct research on the data to ensure that I was handling the data correctly. Take the following XML:

<tag k="destination" v="Bern;Chur;Luzern;Flughafen;Nordring-Zürich"/>
<tag k="source:maxspeed" v="sign"/>
<tag k="destination:symbol" v="airport"/>
...
<tag k="name" v="Salomon-Bleuler-Weg"/>
<tag k="highway" v="residential"/>
<tag k="maxspeed" v="30"/>
...
<tag k="bus:lanes" v="no|designated" />

;, -, and | all potentially act as separators.

The first portion with "Bern;Chur;Luzern;Flughafen;Nordring-Zürich" is meant to be separated, as all of those are towns/cities in Switzerland. The third is also a clear separator. Values are converted to ["Bern","Chur","Luzern","Flughafen","Nordring-Zürich"] and ["no", "designated"].

Searches for "Salomon-Bleuler-Weg" on other maps, however, confirm that the value is indeed the full name of the street.

Data overview

Note: the data is filtered down to Zurich proper, meeting the following conditions:

City name approximately "Zurich"
An integer representing a Zurich district number
No explicit city mentioned (i.e. no "addr:city" field in the document)

File attributes

File	Type	Size
`zurich_switzerland.osm`	Source	614.7 MB
`just_zurich.json`	Sanitized	702.2 MB

Counts

Type	Count	Query
All	3146959	`db.just_zurich.count()`
Nodes	2706650	`db.just_zurich.find({"type":"node"}).count()`
Ways	432670	`db.just_zurich.find({"type":"way"}).count()`
Relations	134	`db.just_zurich.find({"type":"relation"}).count()`

Dates

Type	Result	Query
Oldest record	"2006-05-05T16:19:04Z"	`db.just_zurich.find().sort({"created.timestamp": 1}).limit(1)`
Newest record	"2017-03-11T13:48:07Z"	`db.just_zurich.find().sort({"created.timestamp": -1}).limit(1)`

Users

Unique users

> db.just_zurich.distinct("created.uid").length

Note: "created.user" is not a viable field for uniqueness in this dataset -- four usernames share the same id.

> db.just_zurich.aggregate([{$group: {_id: {uid:"$created.uid"}, users: {$addToSet: "$created.user"}}}, {$project: {_id: 1,users: 1,num_users: { $size: "$users" }}}, {$match: {num_users: {$gt: 1}}}])

{ "_id" : { "uid" : "5351349" }, "users" : [ "Jan:", "Jan Huber" ], "num_users" : 2 }
{ "_id" : { "uid" : "5007203" }, "users" : [ "someone12345678", "ManuDroid94" ], "num_users" : 2 }

Top user

> db.just_zurich.aggregate([{$group: {_id:{"user": "$created.user","uid": "$created.uid"},count:{$sum:1}}},{"$sort": {"count": -1}},{"$limit": 1}])

{ "_id" : { "user" : "mdk", "uid" : "178186" },
  "count" : 566235 }

Top five users vs the rest

> db.just_zurich.aggregate([{$group: {_id:{"user": "$created.user","uid": "$created.uid"},count:{$sum:1}}},{"$sort": {"count": -1}},{"$limit": 5}])

{ "_id" : { "user" : "mdk", "uid" : "178186" }, "count" : 566235 }
{ "_id" : { "user" : "SimonPoole", "uid" : "92387" }, "count" : 334879 }
{ "_id" : { "user" : "Sarob", "uid" : "1218134" }, "count" : 146217 }
{ "_id" : { "user" : "hecktor", "uid" : "465052" }, "count" : 117316 }
{ "_id" : { "user" : "feuerstein", "uid" : "194843" }, "count" : 102162 }

Zurich exploration

For the purposes of this exercise, I'm only considering tags that explicitly list Zurich as a city as within Zurich.

Number of records

> db.just_zurich.aggregate([{"$match": {"addr.city": {"$exists": 1}}}, {$group:{_id:null,count:{$sum:1}}}])

What are the top 5 amenities?

> db.just_zurich.aggregate([ {"$match": {$and: [{"addr.city": {"$exists": 1}}, {"amenity": {"$exists": 1}}]}}, {$group: {_id: "$amenity",count:{$sum:1}}}, {"$sort": {"count": -1}}, {"$limit": 5}])

{ "_id" : "restaurant", "count" : 327 }
{ "_id" : "car_sharing", "count" : 212 }
{ "_id" : "school", "count" : 67 }
{ "_id" : "place_of_worship", "count" : 64 }
{ "_id" : "cafe", "count" : 52 }

Which areas of the city have the most diverse cuisine?

> db.just_zurich.aggregate([{$unwind: "$cuisine"}, {$match:{$and: [{"addr.city": {"$exists": 1}}, {"amenity":"restaurant"}]}}, {$group:{_id: "$addr.postcode", cuisines:{"$addToSet":"$cuisine"}}},{$project: {_id: 1, cuisines: 1,num_cuisines: { $size: "$cuisines" }}}, {$sort: {num_cuisines: -1}}, {$limit: 3}])

{ "_id" : "8004",
  "cuisines" : [ "american", "vegetarian", "Bier, Bar", "japanese", "vietnamese", "international", "tea", "asian", "spanish", "kebab", "indian", "italian", "kosher", "coffee_shop", "lebanese", "cake", "burger", "regional", "vegan", "tapas", "chinese" ],
  "num_cuisines" : 21 }
{ "_id" : "8001",
  "cuisines" : [ "thai", "greek", "japanese", "sushi", "brazilian", "fish", "spanish", "indian", "asian", "bistro", "regional", "italian", "pizza" ],
  "num_cuisines" : 13 }
{ "_id" : "8005",
  "cuisines" : [ "burger", "greek", "turkish", "indian", "lebanese", "pizza", "italian", "regional", "international", "thai", "asian", "vegetarian", "american" ],
  "num_cuisines" : 13 }

Looking ahead

Payments

Currently, the payments schema looks like the following, where keys are descriptive and values are standardized:

"payment" : {
		"notes" : "yes",
		"coins" : "yes",
		"cash" : "yes",
		"american_express" : "yes",
		"visa_electron" : "yes",
		"visa" : "yes",
		"mastercard" : "yes",
		"maestro" : "yes"
  }

With this schema, it is difficult to perform aggregations in the MongoDB framework. One can really only query the payments sub-document for counts, rather than form more complex queries.

> db.just_zurich.aggregate([{"$match":{$and: [{"addr.city": {"$exists": 1}},{"payment": {"$exists": 1} } ] } }, {$group:{_id: null, "num_payments": {"$sum": 1} } }, {"$sort": {num_payments: -1}}, {"$limit": 10}])

Returns { "_id" : null, "num_payments" : 25 }

Whereas:

> db.just_zurich.aggregate([{"$match":{$and: [{"addr.city": {"$exists": 1}},{"payment": {"$exists": 1} } ] } }, {$group:{_id: "$payment", "num_payments": {"$sum": 1} } }, {"$sort": {num_cuisines: -1}}, {"$limit": 10}])

Returns "_id" : { "coins" : "yes", "bitcoin" : "yes" }, rather than payment types as intended.

In the future, I'd want to sanitize the payments field so that the structure for a given document follows this schema: {"payments": ["notes", "visa", "bitcoin"]}.

Limits of "addr:city"

My method for extracting Zurich-only data needs improvement. At the moment, the only filter I apply is through the "addr:city" key -- if the value doesn't fuzzy match to "Zurich", then I skip the record.

I loaded a non-filtered version of the dataset into MongoDB, and compared the counts of both collections:

Dataset	Count	Query
Filtered	3146959	`db.just_zurich.count()`
Unfiltered	6084959	`db.all_zurich.count()`

Despite culling around 3 million records, I don't do any checks for location other than "addr:city". In the future, I may consider using records' lat and lon keys to determine whether a document falls within what I consider to be the city boundaries. However, I anticipate that this approach would be computationally intensive, especially over 6 million records.

alemosie / zurich-osm-munging Goto Github PK

zurich-osm-munging's Introduction