Git Product home page Git Product logo

osm-parquetizer's Introduction

OpenStreetMap Parquetizer

Build Status

The project intends to provide a way to get the OpenStreetMap data available in a Big Data friendly format as Parquet.

Currently any PBF file is converted into three parquet files, one for each type of entity from the original PBF (Nodes, Ways and Relations).

In order to get started:

git clone https://github.com/adrianulbona/osm-parquetizer.git
cd osm-parquetizer
mvn clean package
java -jar target/osm-parquetizer-1.0.1-SNAPSHOT.jar path_to_your.pbf

For example, by running:

java -jar target/osm-parquetizer-1.0.1-SNAPSHOT.jar romania-latest.osm.pbf

In a few seconds (on a decent laptop) you should get the following files:

-rw-r--r--  1 adrianbona  adrianbona   145M Apr  3 19:57 romania-latest.osm.pbf
-rw-r--r--  1 adrianbona  adrianbona   372M Apr  3 19:58 romania-latest.osm.pbf.node.parquet
-rw-r--r--  1 adrianbona  adrianbona   1.1M Apr  3 19:58 romania-latest.osm.pbf.relation.parquet
-rw-r--r--  1 adrianbona  adrianbona   123M Apr  3 19:58 romania-latest.osm.pbf.way.parquet

The parquet files have the following schemas:

node
 |-- id: long
 |-- version: integer
 |-- timestamp: long
 |-- changeset: long
 |-- uid: integer
 |-- user_sid: string
 |-- tags: array
 |    |-- element: struct
 |    |    |-- key: string
 |    |    |-- value: string
 |-- latitude: double
 |-- longitude: double

way
 |-- id: long
 |-- version: integer
 |-- timestamp: long
 |-- changeset: long
 |-- uid: integer
 |-- user_sid: string
 |-- tags: array
 |    |-- element: struct
 |    |    |-- key: string
 |    |    |-- value: string
 |-- nodes: array
 |    |-- element: struct
 |    |    |-- index: integer
 |    |    |-- nodeId: long

relation
 |-- id: long
 |-- version: integer
 |-- timestamp: long
 |-- changeset: long
 |-- uid: integer
 |-- user_sid: string
 |-- tags: array
 |    |-- element: struct
 |    |    |-- key: string
 |    |    |-- value: string
 |-- members: array
 |    |-- element: struct
 |    |    |-- id: long
 |    |    |-- role: string
 |    |    |-- type: string

osm-parquetizer's People

Contributors

adrianulbona avatar bogdans2-telenav avatar igor-suhorukov avatar mihaic-telenav avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

osm-parquetizer's Issues

Error: org.openstreetmap.osmosis.core.OsmosisRuntimeException

  • Running on Mac OS Ventura 13.2.1, Macbook Air M1.
  • apache-maven-3.9.1
  • OpenJDK 20

Got the following error:

โžœ java -jar target/osm-parquetizer-1.0.1-SNAPSHOT.jar massachusetts.pbf

2023-03-24 00:27:16 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-03-24 00:27:16 INFO  CodecPool:153 - Got brand-new compressor [.snappy]
2023-03-24 00:27:16 INFO  CodecPool:153 - Got brand-new compressor [.snappy]
2023-03-24 00:27:16 INFO  CodecPool:153 - Got brand-new compressor [.snappy]
Exception in thread "main" org.openstreetmap.osmosis.core.OsmosisRuntimeException: A PBF decoding worker thread failed, aborting.
	at org.openstreetmap.osmosis.pbf2.v0_6.impl.PbfDecoder.sendResultsToSink(PbfDecoder.java:97)
	at org.openstreetmap.osmosis.pbf2.v0_6.impl.PbfDecoder.processBlobs(PbfDecoder.java:163)
	at org.openstreetmap.osmosis.pbf2.v0_6.impl.PbfDecoder.run(PbfDecoder.java:175)
	at org.openstreetmap.osmosis.pbf2.v0_6.PbfReader.run(PbfReader.java:115)
	at io.github.adrianulbona.osm.parquet.App.main(App.java:38)

Strings are in base64 encoding after conversion.

  • What's the issue
    In the parquet files generated from the conversion, strings are encoded in base64. It occurs to all the string fields, which may diverge from user's intentions.
    Take RelationWriteSupport.java as an example.
    memberRoleType = new PrimitiveType(REQUIRED, BINARY, "role");
    In the above piece of code, we call this constructor of primitiveType,
    we are actually setting its logicalTypeAnnotation to null. Therefore, parquet converter knows nothing about its actual type, then uses its default way to convert it as a binary - which is base64.

  • How to fix
    To fix, we can set the logicalTypeAnnotation parameter to stringtype. We know the tags are actully in string format, it should be safe to do so, and parquet convert will be aware the field is string and convert it using UTF-8 instead of base64.

Release a latest update?

Hey, could u release the latest update to v1.0.1 or v2.0.0 from current head. Thanks! I am also willing to help if possible.

Converting dataframe tempView to correct schema

Hey,

this is more of a support request since the real issue is my unfamiliarity with scala but i really hope you'll humor me.

I followed this tutorial: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4082562773728035/2089274675795739/3712305628257488/latest.html

But i get binary values where i am expecting strings.

%spark2.spark
val nodesDF = spark.read.parquet("/tmp/geo/bremen-filtered.osm.pbf.node.parquet")
nodesDF.createOrReplaceTempView("nodes")
nodesDF.show()
---result----
+-----+-------+-------------+---------+-------+--------------------+--------------------+--------------------+
|   id|version|    timestamp|changeset|    uid|            user_sid|                tags|             members|
+-----+-------+-------------+---------+-------+--------------------+--------------------+--------------------+
| 2952|     26|1488545617000| 46548312| 339581|[6E 79 75 72 69 6...|[[[B@2343598a,[B@...|[[293249683,[B@75...|
| 3800|      3|1491746258000| 47593689|  44217|       [6B 6A 6F 6E]|[[[B@16219edd,[B@...|[[4287246,[B@3111...|
%spark2.spark
nodesDF.printSchema()
nodesDF.select($"tags".getItem(1)).head().get(0)
---result---
root
 |-- id: long (nullable = true)
 |-- version: integer (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- changeset: long (nullable = true)
 |-- uid: integer (nullable = true)
 |-- user_sid: binary (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: binary (nullable = true)
 |    |    |-- value: binary (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
res11: Any = [[B@5d543cb8,[B@ce6535f]

The schema recognized all string types, i.e. user_sid and key as well as value of element, as binary.
How do i convert them back to strings or better yet, apply the correct schema at the start?

Thanks!

Tag Array outputs to Binary

I tired the parquetizer with data from Japan: http://download.geofabrik.de/asia/japan.html For all regions, the array value in the tags are outputted as Binaries instead of strings. Transforming the nested dataframes is very tricky.

Is there a way to supply a working Python / Scala Schema for creating dataframes from the parequet files or fixing the issue at the sources?

Ignore visible info when converted to Parquet file

Though the parquetizer follows the v0_6 schema (which seems not containing the visible) from osmosis, it will cause the output files ignore the visible info if orignial PBF file contains.

Are there any fast ways for us to also include it in the output?

Member Type changed

The relation member type "node", "way", "relation" were converted to "Node", "Way", "Relation" in parquet format.

Could we avoid this kind of change? In theory, the conversion should not change any field value. Haven't checked value in tags but if these are processed in the same way, there may exist case that some elements are changed to UpperCase unexpectedly.

Incorrect schema

In the README.md file there's a following piece of schema for node:

 |-- tags: array
 |    |-- element: struct
 |    |    |-- key: string
 |    |    |-- value: string

whereas, after converting pbf to parquet, loading it to spark and printing schema it is:

 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: binary (nullable = true)
 |    |    |-- value: binary (nullable = true)

The difference is between the type of key and value fields - string vs binary. Is this an inconsistency or am I somehow mistaken?

Also, if these fields have binary type can I issue a PR where it will be changed to string? I how no idea when these fields can be useful in the binary format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.