Git Product home page Git Product logo

geo's Introduction

geo


codecov
Maven Central

Java utility methods for geohashing.

Features

  • simple api
  • encodes geohashes from latitude, longitude to arbitrary length (GeoHash.encodeHash)
  • decodes latitude, longitude from geohashes (GeoHash.decodeHash)
  • finds adjacent hash in any direction (GeoHash.adjacentHash), works on borders including the poles too
  • finds all 8 adjacent hashes to a hash (GeoHash.neighbours)
  • calculates hash length to enclose a bounding box (GeoHash.hashLengthToCoverBoundingBox)
  • calculates geohashes of given length to cover a bounding box. Returns coverage ratio as well (GeoHash.coverBoundingBox)
  • calculates height and width of geohashes in degrees (GeoHash.heightDegrees and GeoHash.widthDegrees)
  • encodes and decodes long values from geohashes (Base32.encodeBase32 and Base32.decodeBase32)
  • good performance (~3 million GeoHash.encodeHash calls per second on an I7, single thread)
  • no mutable types exposed by api
  • threadsafe
  • 100% unit test coverage (for what that's worth of course!)
  • Apache 2.0 licence
  • Published to Maven Central

Status: production, available on Maven Central

Maven site reports are here including javadoc.

Getting started

Add this to your pom:

<dependency>
    <groupId>com.github.davidmoten</groupId>
    <artifactId>geo</artifactId>
    <version>VERSION_HERE</version>
</dependency>

Bounding box searches using geohashing

What is the problem?

Databases of events at specific times occurring at specific places on the earth's surface are likely to be queried in terms of ranges of time and position. One such query is a bounding box query involving a time range and position constraint defined by a bounding lat-long box.

The challenge is to make your database run these queries quickly.

Some databases may either not support or suffer significant performance degradation when large datasets are queried with inequality conditions on more than one variable.

For example, a search for all ship reports within a time range and within a bounding box could be achieved with a range condition on time combined with a range condition on latitude combined with a range condition on longitude ( combined with = logical AND). This type of query can perform badly on many database types, SQL and NoSQL. On Google App Engine Datastore for instance only one variable with inequality conditions is allowed per query. This is a sensible step to take to meet scalability guarantees.

What is a solution?

The bounding box query with a time range can be rewritten using geohashes so that only one variable is subject to a range condition: time. The method is:

  • store geohashes of all lengths (depends on the indexing strategies available, a single full length hash may be enough) in indexed fields against each lat long position in the database. Note that storing hashes as a single long integer value may be advantageous (see Base32.decodeBase32 to convert a hash to a long).
  • calculate a set of geohashes that wholly covers the bounding box
  • perform the query using the time range and equality against the geohashes. For example:
(startTime <= t < finishTime) and (hash3='drt' or hash3='dr2')
  • filter the results of the query to include only those results within the bounding box

The last step is necessary because the set of geohashes contains the bounding box but may be larger than it.

What hash length to use?

So how long should the hashes be that we try to cover the bounding box with? This will depend on your aims which might be one or more of minimizing: cpu, url fetch time, financial cost, total data transferred from datastore, database load, 2nd tier load, or a heap of other possible metrics.

Calling GeoHash.coverBoundingBox with just the bounding points and no additional parameters will return hashes of a length such that the number of hashes is as many as possible but less than or equal to GeoHash.DEFAULT_MAX_HASHES (12).

You can explicitly control maxHashes by calling GeoHash.coverBoundingBoxMaxHashes.

As a quick example, for a bounding box proportioned more a less like a screen with Schenectady NY and Hartford CT in USA at the corners:

Here are the hash counts for different hash lengths:

m is the size in square degrees of the total hashed area and a is the area of the bounding box.

length  numHashes m/a    
1           1     1694   
2           1       53     
3           4        6.6    
4          30        1.6    
5         667        1.08   
6       20227        1.02   

Only testing against your database and your preferrably real life data will determine what the optimal maxHashes value is. In the benchmarks section below a test with H2 database found that optimal query time was when maxHashes is about 700. I doubt that this would be the case for many other databases.

A rigorous exploration of this topic would be fun to do or see. Let me know if you've done it or have a link and I'll update this page!

Hash height and width formulas

This is the relationship between a hash of length n and its height and width in degrees:

First define this function:

    parity(n) = 0 if n is even otherwise 1

Then

    width = 180 / 2(5n+parity(n)-2)/2 degrees

    height = 180 / 2(5n-parity(n))/2 degrees

The height and width in kilometres will be dependent on what part of the earth the hash is on and can be calculated using Position.getDistanceToKm. For example at (lat,lon):

double distancePerDegreeWidth =
     new Position(lat,lon).getDistanceToKm(new Position(lat, lon+1));

Benchmarks

Inserted 10,000,000 records into an embedded H2 filesystem database which uses B-tree indexes. The records were geographically randomly distributed across a region then a bounding box of 1/50th the area of the region was chosen. Query performed as follows (time is the time to run the query and iterate the results):

hashLength numHashes  found   from  time(s) 
2          2          200K    10m   56.0    
3          6          200k    1.2m  10.5
4          49         200k    303k   4.5
5          1128       200k    217K   3.6
none       none       200k    200k  31.1 (multiple range query)

I was pleasantly surprised that H2 allowed me to put over 1000 conditions in the where clause. I tried with the next higher hash length as well with over 22,000 hashes but H2 understandably threw a StackOverFlowError.

To run the benchmark:

mvn clean test -Dn=10000000

Running with n=1,000,000 is much quicker to run and yields the same primary result:

multiple range query is 10X slower than geohash lookup if the hash length is chosen judiciously

Links

geo's People

Contributors

anusien avatar bramp avatar bryant1410 avatar davidmoten avatar dependabot[bot] avatar dhagberg avatar guilhem-lk avatar niqueco avatar twillouer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geo's Issues

Remove dependency of Guava

Remove the dependency on Guava which is especially relevant for Android projects. This small of a library really does not need Guava.

What am I doing wrong?

GeoHash.encodeHash(0, 0) gives "s00000000000"
Base32.decodeBase32("s00000000000") gives 864691128455135232

If I check it using ruby I get different results

irb(main):001:0> 864691128455135232.to_s(32)
=> "o00000000000"
irb(main):002:0> "o00000000000".to_i(32)
=> 864691128455135232
irb(main):003:0> "s00000000000".to_i(32)
=> 1008806316530991104
irb(main):004:0>

Where is the issue?

GeoHash#encodeHash(double, double, int) allows invalid 'length'

Nested call to GeoHash#fromLongToString(long) does a range check on a hash, but only catches some invalid cases. I can pass in 20 for the length in encodeHash() for example and I wind up with a shorter hash instead of an error. Passing in 13 for the length throws an error.

GeoHash algorithm not producing expected results

Hi,
I've been having some issues with generating Geo Hashes of points and then of bounding boxes that I would expect to share a prefix. E.g. this fairly arbitrary point in New Mexico ( 35.37322998046875, -108.7591552734375) should be contained in a rough bounding box of the United States (13.923403897723347, -126.91406249999999 x 55.37911044801047, -74.1796875 ). Am I doing something wrong?

thanks

Unit test below:

package org.digitalantiquity.skope;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;

import org.apache.log4j.Logger;
import org.digitalantiquity.skope.service.LuceneIndexingService;
import org.junit.Test;

import com.github.davidmoten.geo.Coverage;
import com.github.davidmoten.geo.GeoHash;
import com.github.davidmoten.geo.LatLong;

public class ShapefileParserTest {

    private final Logger logger = Logger.getLogger(getClass());


    @Test
    public void testHash() {
        double x1 = -126.91406249999999;
        double y1 = 13.923403897723347;
        double x2 = -74.1796875;
        double y2 = 55.37911044801047;

        logger.debug(String.format("start (%s,%s) x(%s,%s)", x1, y1, x2, y2));
        String point = "9w69jps00000";
        LatLong latLong = GeoHash.decodeHash(point);
        logger.debug(latLong);
        String hash2 = GeoHash.encodeHash(35.37322998046875, -108.7591552734375);
        assertEquals(hash2, point);
        logger.debug(hash2);
        Coverage coverage = GeoHash.coverBoundingBox(x1,y1,x2,y2,4);
        boolean seen = false;
        logger.debug(coverage);
        for (String hash : coverage.getHashes()) {
            if (hash.equals(point) || point.startsWith(hash)) {
                logger.debug(hash + " -> " + GeoHash.decodeHash(hash));
                seen = true;
            }
        }
        assertTrue("should have seen hash",seen);
    }
}

Geohash.neighbors wrong results

..for hash "u".

result = {ArrayList@3156} size = 8
0 = "g"
1 = "v"
2 = "b"
3 = "s"
4 = "z"
5 = "e"
6 = "c"
7 = "t"

g u v e s t seem to be OK. I could imagine, that u meets b, c and f over the pole. But what about "z"?

Regards

Consider removing synchronized on GeoHash::heightDegrees

Hi !

We're using the GeoHash.hashContains with 2 to 4-char geohashes and are seeing that the static method heightDegrees comes up as a Hot spot during our CPU sampling with VisualVM. I noticed this method is synchronized so I was wondering if it may be a good idea to pre-calculate all MAX_HASH_LENGTH (12) values in a static way so you can remove the synchronization ?

Is this something you may consider changing ?

Thanks
Regards

Add overload to GeoHash.encode to include both length and max parameters

As requested by @abrin.

David,
As I use it a bit, it might be nice to expose both "length" and “max” to the encode functions. Right now, one has to make the choice between either having a number of hashes or a max length of a hash. But, if you’re selecting a larger region it likely becomes important to have access to both.

thanks,

adam

Misue of GeoHash.DEFAULT_MAX_HASHES

Due to DEFAULT_MAX_HASHES and MAX_HASH_LENGTH having the same value of 12 DEFAULT_MAX_HASHES has been used (and referenced in javadoc) where MAX_HASH_LENGTH should have been used. Replacing the relevant instances will have no effect on api but should improve documentation.

DEFAULT_MAX_HASHES should relate only to GeoHash.coverBoundingBox().

Dependency Issues

Looks like there is a dependency reference on java.awt in the library.
Although this isn't an Android library, it might be handy to clean up this kind of reference.

Package not included in Android ../../../../../.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-math3/3.2/ec2544ab27e110d2d431bdad7d538ed509b21e62/commons-math3-3.2.jar: Invalid package reference in library; not included in Android: java.awt.geom. Referenced from org.apache.commons.math3.geometry.euclidean.threed.PolyhedronsSet.RotationTransform. ../../../../../.gradle/caches/modules-2/files-2.1/com.github.davidmoten/grumpy-core/0.2.2/48b4cff02a62a3a9ea876dd2486225aefd225706/grumpy-core-0.2.2.jar: Invalid package reference in library; not included in Android: java.awt. Referenced from com.github.davidmoten.grumpy.core.Position.

Get the distance?

I didn't find how to get the distance between points/geohashes. So, it's possible to implement this code I got from a chatbot:

    /** Calculate haversine distance between two points in meters */
    fun haversine(lat1: Double, lon1: Double, lat2: Double, lon2: Double): Double {
        val R = 6378.14 // Earth radius in kilometers
        val dLat = Math.toRadians(lat2 - lat1)
        val dLon = Math.toRadians(lon2 - lon1)
        val lat1Rad = Math.toRadians(lat1)
        val lat2Rad = Math.toRadians(lat2)
        val a = sin(dLat / 2).pow(2) + sin(dLon / 2).pow(2) * cos(lat1Rad) * cos(lat2Rad)
        val c = 2 * asin(sqrt(a))
        return R * c * 1000 // Distance in meters
    }

    /** Calculate distance between two lists of points */
    fun distance(points1: List<LatLng>, points2: List<LatLng>): Double {
        var minDistance = Double.MAX_VALUE
        for (point1 in points1) {
            for (point2 in points2) {
                val distance = haversine(point1.lat, point1.lng, point2.lat, point2.lng)
                if (distance < minDistance) {
                    minDistance = distance
                }
            }
        }
        return minDistance
    }

Link to `api` is not working anymore

The link api in the file Readme.md should show an api documentation I suppose? The Classes are well documented through comments, it's not that big of a deal, that this link doesn't work, but still. I felt the need to tell.

Unexpected coverage with bounding box of the world

I might be using this incorrectly, but when I do

double topLeftLat = 90d
double topLeftLon = -179d
double bottomRightLat = -90d
double bottomRightLon = 180d
Coverage c = GeoHash.coverBoundingBox(topLeftLat, topLeftLon, bottomRightLat, bottomRightLon, 1);

I would have expected c.getHashes() to contain the 32 single-character Geohashes (i.e., 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, b, c, d, e, f, g, h, j, k, m, n, p, q, r, s, t, u, v, w, x, y, z).

Instead, it contains only 0, 2, 8, b.

Get a bounding box by hash lenght

It will be great to have a BoundingBox object with:
area;
southwest.latitude;
southwest.longitude;
northeast.latitude;
northeast.longitude
etc etc, keep the good work

Calculation of width/height in kilometers

Its more of a question, rather than an issue.
It is mentioned in the documentation that you can get the width/height from degrees to kilometers using Position.getDistanceToKm function. I checked the source code and I failed to understand how to use it in order to convert degrees to kms.

Base32.encodeBase32(Base32.decodeBase32(geohash)) should be the identity function

I'm not sure whether this is an intentional design decision, but I expected that I could get back my original geohash String when I encode the decoded base32 long. This is useful for systems that wish to store a geohash as a long and later recover it as a String.

The issue is pretty simple. Base32.encodeBase32(Base32.decodeBase32("0000")) returns "0". The encodeBase32 function needs to prepend the result with zeros to the desired length. Then the correct geohash String can be recovered.

Feature Request: sub-cells by direction.

First of all, I'd like to say that it's really a pleasure to use this library.

I am having a use case for which I need to identify "border cells" below a certain distance, e.g. all cells with resolution of about 2.4x2.4 m^2 (geohash lenght = 9) on the border between cells "b" and "c".

What I would require in this case is, being able to extract all cells of geohash length n+1 (e.g. lenght 2 starting from cells of length 1) by direction, so to have the rightmost column of sub-cells in "b" of geohash length, say, 9, and the leftmost column of subcells of "c" of the same geohash length.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.