Git Product home page Git Product logo

pygeobuf's Introduction

Geobuf

Geobuf is a compact binary geospatial format for lossless compression of GeoJSON and TopoJSON data.

Build Status Coverage Status

Note well: this project has been transferred by Mapbox to the new pygeobuf organization.

Advantages over using GeoJSON and TopoJSON directly (in this revised version):

  • Very compact: typically makes GeoJSON 6-8 times smaller and TopoJSON 2-3 times smaller.
  • Smaller even when comparing gzipped sizes: 2-2.5x compression for GeoJSON and 20-30% for TopoJSON.
  • Easy incremental parsing — you can get features out as you read them, without the need to build in-memory representation of the whole data.
  • Partial reads — you can read only the parts you actually need, skipping the rest.
  • Trivial concatenation: you can concatenate many Geobuf files together and they will form a valid combined Geobuf file.
  • Potentially faster encoding/decoding compared to native JSON implementations (i.e. in Web browsers).
  • Can still accommodate any GeoJSON and TopoJSON data, including extensions with arbitrary properties.

Think of this as an attempt to design a simple, modern Shapefile successor that works seamlessly with GeoJSON and TopoJSON.

Unlike Mapbox Vector Tiles, it aims for lossless compression of datasets — without tiling, projecting coordinates, flattening geometries or stripping properties.

pygeobuf

This repository is the first encoding/decoding implementation of this new major version of Geobuf (in Python). It serves as a prototyping playground, with faster implementations in JS and C++ coming in future.

Sample compression sizes

normal gzipped
us-zips.json 101.85 MB 26.67 MB
us-zips.pbf 12.24 MB 10.48 MB
us-zips.topo.json 15.02 MB 3.19 MB
us-zips.topo.pbf 4.85 MB 2.72 MB
idaho.json 10.92 MB 2.57 MB
idaho.pbf 1.37 MB 1.17 MB
idaho.topo.json 1.9 MB 612 KB
idaho.topo.pbf 567 KB 479 KB

Usage

Installation:

pip install geobuf

Command line:

geobuf encode < example.json > example.pbf
geobuf decode < example.pbf > example.pbf.json

As a module:

import geobuf

pbf = geobuf.encode(my_json) # GeoJSON or TopoJSON -> Geobuf string
my_json = geobuf.decode(pbf) # Geobuf string -> GeoJSON or TopoJSON

The encode function accepts a dict-like object, for example the result of json.loads(json_str).

Both encode.py and geobuf.encode accept two optional arguments:

  • precision — max number of digits after the decimal point in coordinates, 6 by default.
  • dimensions — number of dimensions in coordinates, 2 by default.

Tests

py.test -v

The tests run through all .json files in the fixtures directory, comparing each original GeoJSON with an encoded/decoded one.

pygeobuf's People

Contributors

bozdoz avatar jlaine avatar jperelli avatar mourner avatar seadude avatar sgillies avatar tmcw avatar waldyrious avatar yvaucher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pygeobuf's Issues

pip install pygeobuf?

My apologies if I'm missing something dead obvious, but I didn't see how to install the pygeobuf module anywhere in the README.

Trying !pip install pygeobuf resulted in:

ERROR: Could not find a version that satisfies the requirement pygeobuf (from versions: none)
ERROR: No matching distribution found for pygeobuf
WARNING: You are using pip version 19.3.1; however, version 20.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Could you advise me?
Thank you

Autodetect dimensions and precision

Encoding script should automatically detect how many dimensions the GeoJSON has, and what precision coordinates have (for optimal delta-encoding compression) instead of having to pass them as encode arguments.

Handle missing data in coordinates

Coordinate data can be really crappy: [[100, 100], [100, 100, 5], [100, 100, null, 4], ...] — different dimensions, nulls, etc.

But we can store the indexes to missing data in linestrings:

message LineString {
    repeated sint32 values = 1 [packed = true];
    repeated sint32 missing [packed = true];
}

Then we just pad everything with zeroes: [[100, 100, 0, 0], [100, 100, 5, 0], [100, 100, 0, 4], ...] and store missing value indexes as point_index/coord_index pairs: [0, 2, 0, 3, 1, 3, 2, 2].

Repeated fields are optional so they don't take space, so this won't affect all normal datasets. Only crappy ones will get some more bytes to fill in the blanks, but this won't be critical at all.

cc @mick @tmcw @morganherlocker

Geobuf Rountrip to Filesystem

Nice module, converted all 2018 US Census tracts geojson files from 139.8MB to 26.7MB.

The following is not an issue, but is provided for anyone looking to roundtrip to the filesystem:

import json
import geobuf
from pathlib import Path
import geopandas as gpd


def geojson_to_geobuf(dir_source: Path, dir_target: Path) -> None:
    """
    Given a source directory of geojson, provides geojson as compressed geo/protobuf
    files in the target directory. GeoJSON will be 6-8x smaller, TopoJSON; 2-3x.

    Usage
    ----------
    dir_source = Path.cwd() / 'data' / 'us_census_tracts_2018_original'
    dir_target = Path.cwd() / 'data' / 'us_census_tracts_2018'

    # convert
    geojson_to_geobuf(dir_source, dir_target)

    Parameters
    ----------
    :param dir_source: Path: The Path to the geojson files to be converted.
    :param dir_target: Path: The Path to were the geobuf files should be saved.
    :return: None, takes internal actions converting and saving file.
    """

    # get files in the source directory
    sources = dir_source.iterdir()

    for item in sources:

        # filter on json/geojson files only
        if item.suffix in {'.json', '.geojson', '.gjson'}:

            # open file to read
            with open(item, "r") as file:

                # load geojson file
                data = json.load(file)

                # encode to geobuf
                gbf = geobuf.encode(data)

                # save to target directory
                with open(dir_target / (item.stem + ".pbf"), "wb") as new_file:
                    new_file.write(gbf)


def load_geobuf(filepath: Path, as_type: str = 'dict', crs="epsg:4236") -> Union[dict, str, gpd.GeoDataFrame]:
    """
    Loads specified geobuf file and convert to dict, geojson or GeoDataFrame.

    Usage
    ----------
    # designate file path
    filepath = Path.cwd() / 'data' / 'us_census_tracts_2018' / 'cb_2018_01_tract_500k.gbf'

    # load as dict
    data = load_geobuf(filepath)

    # load as geojson
    data = load_geobuf(filepath, as_type="gjson")

    # load as geojson
    data = load_geobuf(filepath, as_type="gdf")

    Parameters
    ----------
    :param filepath: Path: The path to the file to be loaded.
    :param as_type: The data type to decode the geobuf file to; Python Dictionary, GeoJSON or GeoDataFrame.
    Designate 'dict', 'gjson', or 'gdf' respectively.
    :param crs: str: The Coordinate Reference System to use if creating a GeoDataFrame.
    :return: Union[dict, str]: The converted data.
    """

    # get files in the source directory
    assert filepath.suffix == '.pbf', "The file must be geobuf/protobuf format ending in '.pbf.'"

    # read file
    with open(filepath, "rb") as file:

        # ensure start of file
        file.seek(0)

        # decode geobuf binary file
        decoded = geobuf.decode(file.read())

        # as geojson
        if as_type in {'json', 'gjson', 'geojson'}:

            # dump as json since hierarchy was originally geojson it will be retained
            return json.dumps(decoded)

        # as python dict
        elif as_type == 'dict':
            return decoded

        # as Geopandas GeoDataFrame
        elif as_type == 'gdf':

            # generate from features
            gdf = gpd.GeoDataFrame.from_features(decoded["features"], crs=crs)

            # ensure geometry is last column
            columns = list(gdf.columns)
            columns.remove("geometry")
            columns.append("geometry")
            return gdf[columns]

1.1 release

Adding Python 3 support.

  • note changes
  • tag
  • release to PyPI

Proper Python packaging

My way of getting to know more about the format. Will be interesting to get this on PyPI as well.

Flatten nested coordinates with single ring

Flattening nested coords with single ring should nicely improve compression ratio on many data cases (e.g. lots of polygons that look like [[coords]]). Especially relevant for TopoJSON - [[1]] could be flattened to just 1, saving ~4 bytes for each feature.

Option to double-delta-encode coordinates

Delta-encoding the deltas makes sizes smaller (up to 10%) for some data cases, especially simpified or not overly detailed datasets. So we can add it as an option.

failed to convert geojson to geobuf

in windows, I had try all kinds of way, but still failed.

In windows command line,

`python (get into interpreter)

import geobuf
geobuf.encode("city_streets.geojson", "c.pbf")
Traceback (most recent call last):
File "", line 1, in
File "C:\PYTHON27\lib\site-packages\geobuf-1.1.0-py2.7.egg\geobuf_init_.py"
, line 8, in encode
return Encoder().encode(*args)
File "C:\PYTHON27\lib\site-packages\geobuf-1.1.0-py2.7.egg\geobuf\encode.py",
line 30, in encode
self.e = pow(10, precision) # multiplier for converting coordinates into int
egers
TypeError: unsupported operand type(s) for ** or pow(): 'int' and 'str'
`

can you show me where I am wrong?

Not decoding pbf to json

I used geobuf to decode pbf file I have to json:

geobuf decode < 3138.pbf > 3138.json

but the resulting json file only had null as a value. I decoded the same pbf file using an online decoder, and got the data I was expecting. I installed geobuf using pip:

pip install geobuf

I noticed that the pip install also installed google protobuf in the python library site-packages directory:

lib/python3.5/site-packages/google/protobuf

However, does pygeobuf also need the Protocol Compiler (protoc) libraries?

Thanks
Jim

Maintainership of pygeobuf

The pygeobuf project began as a prototype. Today https://github.com/mapbox/geobuf is the reference implementation of the geobuf encoding and Mapbox currently has no users of pygeobuf. We (Mapbox) would like to transfer the pygeobuf project to developers interested in actively maintaining it and making releases to the Python package index.

In #35 I proposed to transfer pygeobuf to the mysidewalk org, but I had overlooked the earlier offer by @jlaine in #32 (comment). My apologies for that oversight, @jlaine.

@jlaine @jguthmiller: can I ask the two of you to discuss joint maintainership (here or in other channels) and where you'd like to do it?

Coordinates along LineStrings (and others) become increasingly inaccurate

If you encode the same LineString with pygeobuf in the reverse order (i.e. flipped line), the coordinates do not match. This is because of this section of code in encode.py:

    def add_line(self, coords, points, is_closed=False):
        sum = [0] * self.dim
        r = range(0, len(points) - int(is_closed))
        for i in r:
            for j in range(0, self.dim):
                n = int(round(points[i][j] * self.e) - sum[j])
                coords.append(n)
                sum[j] += n

Edit: I inaccurately blamed sum for the issue, though I'm not sure that's why this issue exists.

I think this issue also impacts MultiLineStrings, Polygons, and MultiPolygons, though I haven't tested them.

encode should return the geobuf Data instance

encode should return the geobuf Data instance and leave stringification up to the user with data.SerializeToString() if needed.

The reasoning here is that I need/want to write a protobuf that has a field to hold the geobuf feature for the record. There shouldn't be a need to re-encode string with

data = geobuf.geobuf_pb2.Data()
data.ParseFromString(datastring)

--Karl

Python 3.x support

I'm using python 3.5, so I wonder if it's possible to use this package with python 3.x.

Rounding issues with coordinates

After encoding and then decoding, coordinates are sometimes rounded in a weird way:

[-85.431413, 34.124869] // original
[-85.431414, 34.12487] // encoded/decoded

Encode arbitrary properties of each object?

GeoJSON/TopoJSON spec allows adding any amount of custom properties to any object. A lot of libraries that deal with GeoJSON ignore this, but it can be useful in some apps.

There's a way to encode this info by adding a custom_properties field to every object of the schema. The drawback is that it will make the schema more verbose and encoding/decoding more complex.

Using argument dim > 2 during encoding implies weird output during the decoding

If you use the dim argument of the Encoder.encode method and then decodes, the output contains again just 2 dimensions but with more points at the end with weird values. This problem appear in the version 1.1.1

Quick example:
import geojson
import geobuf
l = [(1,2,3),(4,5,6)]
geobuf.decode(geobuf.encode(geojson.LineString(l),6,3))

The output of this encode - decode is:

OrderedDict([('type', 'LineString'), ('coordinates', [[1.0, 2.0], [4.0, 5.0], [7.0, 8.0]])])

It also ouputs a weird result if dim is 4 or 5.

Encoding error when precision > 7

Hi, I have problem with encoding geojson with coordinates of precision 8 or higher.
I am trying to encode the following geoJson:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "id": "86",
      "geometry": {
        "type": "Point",
        "coordinates": [
          139.776078424,
          35.717489323,
          36.50732
        ]
      },
      "properties": {
        "attr-shape": "CIRCLE",
        "attr-id": "1",
        "attr-colors": [
          "RED"
        ],
        "attr-type": "PROHIBITION",
        "attr-faceWidth": 0.81,
        "attr-faceHeight": 0.8,
        "attr-heading": 298.82,
        "roadagramPart": "ROAD_SIGN"
      }
    }
  ]
}

When invoking encoding with precision = 8, the following error occur:

Traceback (most recent call last):
  File "/home/swidersk/Projects/geobuf/python/encode/encode.py", line 18, in <module>
    pbf = geobuf.encode(data, 8)
  File "/home/swidersk/Projects/geobuf/.gradle/python/pythonVenvs/virtualenv-3.8.2/lib/python3.8/site-packages/geobuf/__init__.py", line 8, in encode
    return Encoder().encode(*args)
  File "/home/swidersk/Projects/geobuf/.gradle/python/pythonVenvs/virtualenv-3.8.2/lib/python3.8/site-packages/geobuf/encode.py", line 38, in encode
    if data_type == 'FeatureCollection': self.encode_feature_collection(data.feature_collection, obj)
  File "/home/swidersk/Projects/geobuf/.gradle/python/pythonVenvs/virtualenv-3.8.2/lib/python3.8/site-packages/geobuf/encode.py", line 51, in encode_feature_collection
    self.encode_feature(feature_collection.features.add(), feature_json)
  File "/home/swidersk/Projects/geobuf/.gradle/python/pythonVenvs/virtualenv-3.8.2/lib/python3.8/site-packages/geobuf/encode.py", line 58, in encode_feature
    self.encode_geometry(feature.geometry, feature_json.get('geometry'))
  File "/home/swidersk/Projects/geobuf/.gradle/python/pythonVenvs/virtualenv-3.8.2/lib/python3.8/site-packages/geobuf/encode.py", line 111, in encode_geometry
    self.add_point(geometry.coords, coords)
  File "/home/swidersk/Projects/geobuf/.gradle/python/pythonVenvs/virtualenv-3.8.2/lib/python3.8/site-packages/geobuf/encode.py", line 184, in add_point
    for x in point: self.add_coord(coords, x)
  File "/home/swidersk/Projects/geobuf/.gradle/python/pythonVenvs/virtualenv-3.8.2/lib/python3.8/site-packages/geobuf/encode.py", line 181, in add_coord
    coords.append(coord if self.transformed else int(round(coord * self.e)))
ValueError: Value out of range: 13977607842

Execution failed for task ':python:geobufEncode'.
> Process 'command '/home/swidersk/Projects/geobuf/.gradle/python/pythonVenvs/virtualenv-3.8.2/bin/python'' finished with non-zero exit value 1

Invoking encode with default or 7 precision does not produce exception, however in my case those values of precision are too low.

Incorrect encoding of coordinates in meters, not in degrees.

Is it possible to encode meters instead of degrees coordinates into geobuf? It is a bug or not it is by-design for geobuf?

# GeoJSON: geojson.Point((lat2y(28.3815), lon2x(153.3814))) => {"coordinates": [3297157.912557, 17074339.345159], "type": "Point"}

geobuf.decode(geobuf.encode({"coordinates": [3297157, 17074339], "type": "Point"}, 0, 2))['coordinates']
geobuf.decode(geobuf.encode({"coordinates": [3297157.912557, 17074339.345159], "type": "Point"}, 01, 2))['coordinates']

Expected output:

[3297157, 17074339]
[3297157.912557, 17074339.345159]

Actual output:

[3.297157, 17.074339]
[3.297158, 17.074339]

decode pbf to json ouputs NULL

The command I'm sending is this:

geobuf decode < great-britain-latest-regular.pbf > great-britain-latest-regular.pbf.json

The only output in the json file is a string called null, is anything I'm doing wrong? The file I'm working with has ~1GB.

Many thanks,
VG

It drops feature properties for Decimal values

If a property has a value as a Decimal type, it seems to drop the property, e.g.

ipdb> feature = { 'type': 'Feature', 'geometry': {'coordinates': [0.0, 0.0], 'type': 'Point'}, 'properties': { 'name': 'null island', 'address': 'nowhere'} }
ipdb> pbf = geobuf.encode(feature)
ipdb> dict(geobuf.decode(pbf))
{'type': 'Feature', 'geometry': OrderedDict([('type', 'Point'), ('coordinates', [0.0, 0.0])]), 'properties': {'name': 'null island', 'address': 'nowhere'}}

ipdb> feature = { 'type': 'Feature', 'geometry': {'coordinates': [0.0, 0.0], 'type': 'Point'}, 'properties': { 'name': 'null island', 'address': 'nowhere', 'extent': Decimal(0.0)} }
ipdb> pbf = geobuf.encode(feature)
ipdb> dict(geobuf.decode(pbf))
{'type': 'Feature', 'geometry': OrderedDict([('type', 'Point'), ('coordinates', [0.0, 0.0])]), 'properties': {'name': 'null island', 'address': 'nowhere'}}

The use-case driving this issue is working with binary attributes in AWS DynamoDB, where floats (numeric) are all converted to Decimal in the boto3 client. It's possible to add an extra step in the serializations to handle this custom-requirement, but I figured it's worth asking whether a pythonic library for supporting JSON might also try to support Decimal values. It's fine by me to close the issue as something out-of-scope for the library.

Error with dependancy protobuf v4.21.0

Cross posting this issue as perhaps it belongs here instead (I wrote this in the library repo as well here)

I recieve the following error on import of a library using pygeobuf:

import dash_leaflet.express as dlx

With the latest version of geobuf and protobuf also installed (4.21.0 released May 25, 2022)

The error I receive is:

File   "...my_page.py", line 10, in <module>
import dash_leaflet.express as dlx
File   "/home/.../python3.10/site-packages/dash_leaflet/express.py",   line 1, in <module>
import geobuf
File   "/home/.../python3.10/site-packages/geobuf/__init__.py",   line 1, in <module>
from .encode import Encoder
File   "/home/.../python3.10/site-packages/geobuf/encode.py",   line 9, in <module>
from . import geobuf_pb2
File   "/home/.../python3.10/site-packages/geobuf/geobuf_pb2.py",   line 33, in <module>
_descriptor.EnumValueDescriptor(
File   "/home/.../python3.10/site-packages/google/protobuf/descriptor.py",   line 755, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created   directly.
If   this call came from a _pb2.py file, your generated code is out of date and   must be regenerated with protoc >= 3.19.0.
If   you cannot immediately regenerate your protos, some other possible   workarounds are:
1.   Downgrade the protobuf package to 3.20.x or lower.
2.   Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use   pure-Python parsing and will be much slower).
More information:   https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

I can confirm the error goes away when downgrading to protobuf 3.20.1. The error itself points to geobuf_pb2.py

Update proto file to support higher precision

The latest proto file for geobuf uses sint64 for coordinates, but pygeobuf uses sint32. This means that if you have a feature encoded with precision higher than 6 (which admittedly the reference encoder will refuse to do, but nowhere does it say it's not allowed), the coordinates will be read incorrectly on deserialization. I'm using higher precision coordinates as intermediary storage for our seamless census data extraction tool. Six decimal places on geographic coordinates is about 10 cm at the equator, which is not high enough resolution for some applications.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.