adrianulbona / osm-parquetizer Goto Github PK

View Code? Open in Web Editor NEW

88.0 7.0 32.0 77 KB

A converter for the OSM PBFs to Parquet files

Home Page: http://adrianulbona.github.io/2016/12/18/osm-parquetizer.html

License: Apache License 2.0

Java 100.00%

parquet-files openstreetmap converter apache-spark pbf

osm-parquetizer's Introduction

OpenStreetMap Parquetizer

The project intends to provide a way to get the OpenStreetMap data available in a Big Data friendly format as Parquet.

Currently any PBF file is converted into three parquet files, one for each type of entity from the original PBF (Nodes, Ways and Relations).

In order to get started:

git clone https://github.com/adrianulbona/osm-parquetizer.git
cd osm-parquetizer
mvn clean package
java -jar target/osm-parquetizer-1.0.1-SNAPSHOT.jar path_to_your.pbf

For example, by running:

java -jar target/osm-parquetizer-1.0.1-SNAPSHOT.jar romania-latest.osm.pbf

In a few seconds (on a decent laptop) you should get the following files:

-rw-r--r--  1 adrianbona  adrianbona   145M Apr  3 19:57 romania-latest.osm.pbf
-rw-r--r--  1 adrianbona  adrianbona   372M Apr  3 19:58 romania-latest.osm.pbf.node.parquet
-rw-r--r--  1 adrianbona  adrianbona   1.1M Apr  3 19:58 romania-latest.osm.pbf.relation.parquet
-rw-r--r--  1 adrianbona  adrianbona   123M Apr  3 19:58 romania-latest.osm.pbf.way.parquet

The parquet files have the following schemas:

node
 |-- id: long
 |-- version: integer
 |-- timestamp: long
 |-- changeset: long
 |-- uid: integer
 |-- user_sid: string
 |-- tags: array
 |    |-- element: struct
 |    |    |-- key: string
 |    |    |-- value: string
 |-- latitude: double
 |-- longitude: double

way
 |-- id: long
 |-- version: integer
 |-- timestamp: long
 |-- changeset: long
 |-- uid: integer
 |-- user_sid: string
 |-- tags: array
 |    |-- element: struct
 |    |    |-- key: string
 |    |    |-- value: string
 |-- nodes: array
 |    |-- element: struct
 |    |    |-- index: integer
 |    |    |-- nodeId: long

relation
 |-- id: long
 |-- version: integer
 |-- timestamp: long
 |-- changeset: long
 |-- uid: integer
 |-- user_sid: string
 |-- tags: array
 |    |-- element: struct
 |    |    |-- key: string
 |    |    |-- value: string
 |-- members: array
 |    |-- element: struct
 |    |    |-- id: long
 |    |    |-- role: string
 |    |    |-- type: string

osm-parquetizer's People

Contributors

Stargazers

Watchers

osm-parquetizer's Issues

Error: org.openstreetmap.osmosis.core.OsmosisRuntimeException

Running on Mac OS Ventura 13.2.1, Macbook Air M1.
apache-maven-3.9.1
OpenJDK 20

Got the following error:

➜ java -jar target/osm-parquetizer-1.0.1-SNAPSHOT.jar massachusetts.pbf

2023-03-24 00:27:16 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-03-24 00:27:16 INFO  CodecPool:153 - Got brand-new compressor [.snappy]
2023-03-24 00:27:16 INFO  CodecPool:153 - Got brand-new compressor [.snappy]
2023-03-24 00:27:16 INFO  CodecPool:153 - Got brand-new compressor [.snappy]
Exception in thread "main" org.openstreetmap.osmosis.core.OsmosisRuntimeException: A PBF decoding worker thread failed, aborting.
	at org.openstreetmap.osmosis.pbf2.v0_6.impl.PbfDecoder.sendResultsToSink(PbfDecoder.java:97)
	at org.openstreetmap.osmosis.pbf2.v0_6.impl.PbfDecoder.processBlobs(PbfDecoder.java:163)
	at org.openstreetmap.osmosis.pbf2.v0_6.impl.PbfDecoder.run(PbfDecoder.java:175)
	at org.openstreetmap.osmosis.pbf2.v0_6.PbfReader.run(PbfReader.java:115)
	at io.github.adrianulbona.osm.parquet.App.main(App.java:38)

Strings are in base64 encoding after conversion.

What's the issue
In the parquet files generated from the conversion, strings are encoded in base64. It occurs to all the string fields, which may diverge from user's intentions.
Take RelationWriteSupport.java as an example.
memberRoleType = new PrimitiveType(REQUIRED, BINARY, "role");
In the above piece of code, we call this constructor of primitiveType,
we are actually setting its logicalTypeAnnotation to null. Therefore, parquet converter knows nothing about its actual type, then uses its default way to convert it as a binary - which is base64.
How to fix
To fix, we can set the logicalTypeAnnotation parameter to stringtype. We know the tags are actully in string format, it should be safe to do so, and parquet convert will be aware the field is string and convert it using UTF-8 instead of base64.

Release a latest update?

Hey, could u release the latest update to v1.0.1 or v2.0.0 from current head. Thanks! I am also willing to help if possible.

Converting dataframe tempView to correct schema

Hey,

this is more of a support request since the real issue is my unfamiliarity with scala but i really hope you'll humor me.

I followed this tutorial: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4082562773728035/2089274675795739/3712305628257488/latest.html

But i get binary values where i am expecting strings.

%spark2.spark
val nodesDF = spark.read.parquet("/tmp/geo/bremen-filtered.osm.pbf.node.parquet")
nodesDF.createOrReplaceTempView("nodes")
nodesDF.show()
---result----
+-----+-------+-------------+---------+-------+--------------------+--------------------+--------------------+
|   id|version|    timestamp|changeset|    uid|            user_sid|                tags|             members|
+-----+-------+-------------+---------+-------+--------------------+--------------------+--------------------+
| 2952|     26|1488545617000| 46548312| 339581|[6E 79 75 72 69 6...|[[[B@2343598a,[B@...|[[293249683,[B@75...|
| 3800|      3|1491746258000| 47593689|  44217|       [6B 6A 6F 6E]|[[[B@16219edd,[B@...|[[4287246,[B@3111...|

%spark2.spark
nodesDF.printSchema()
nodesDF.select($"tags".getItem(1)).head().get(0)
---result---
root
 |-- id: long (nullable = true)
 |-- version: integer (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- changeset: long (nullable = true)
 |-- uid: integer (nullable = true)
 |-- user_sid: binary (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: binary (nullable = true)
 |    |    |-- value: binary (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
res11: Any = [[B@5d543cb8,[B@ce6535f]

The schema recognized all string types, i.e. user_sid and key as well as value of element, as binary.
How do i convert them back to strings or better yet, apply the correct schema at the start?

Thanks!

Tag Array outputs to Binary

I tired the parquetizer with data from Japan: http://download.geofabrik.de/asia/japan.html For all regions, the array value in the tags are outputted as Binaries instead of strings. Transforming the nested dataframes is very tricky.

Is there a way to supply a working Python / Scala Schema for creating dataframes from the parequet files or fixing the issue at the sources?

Support for output folder as a command line argument

Ignore visible info when converted to Parquet file

Though the parquetizer follows the v0_6 schema (which seems not containing the visible) from osmosis, it will cause the output files ignore the visible info if orignial PBF file contains.

Are there any fast ways for us to also include it in the output?

Member Type changed

The relation member type "node", "way", "relation" were converted to "Node", "Way", "Relation" in parquet format.

Could we avoid this kind of change? In theory, the conversion should not change any field value. Haven't checked value in tags but if these are processed in the same way, there may exist case that some elements are changed to UpperCase unexpectedly.

uid and user_sid are not parsed and set to dummy values

The fields uid and user_sid are not parsed and set to dummy values (0, and empty string).

Can this be fixed?

Incorrect schema

In the README.md file there's a following piece of schema for node:

 |-- tags: array
 |    |-- element: struct
 |    |    |-- key: string
 |    |    |-- value: string

whereas, after converting pbf to parquet, loading it to spark and printing schema it is:

 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: binary (nullable = true)
 |    |    |-- value: binary (nullable = true)

The difference is between the type of key and value fields - string vs binary. Is this an inconsistency or am I somehow mistaken?

Also, if these fields have binary type can I issue a PR where it will be changed to string? I how no idea when these fields can be useful in the binary format.

adrianulbona / osm-parquetizer Goto Github PK

osm-parquetizer's Introduction

OpenStreetMap Parquetizer

osm-parquetizer's People

Contributors

Stargazers

Watchers

Forkers

osm-parquetizer's Issues

Recommend Projects

Recommend Topics

Recommend Org