Git Product home page Git Product logo

geo-embeddings-survey's Introduction

Survey for current practices storing embeddings in GeoParquet

A repository that serves as survey of use cases and current data schemas for vector embeddings in GeoParquet.

About

This repository serves as a survey of all attempts to store geospatial embeddings in GeoParquet. This generally seems like a good idea, and it'd be good for everyone if there was a consistent approach for how to do this. The hope with this repo is to gather all the information of what people have done, along with any additional ideas, and then work to make a lightweight specification / set of recommendations for how to store embeddings in GeoParquet (or other geospatial formats.

After we have sufficient submissions here we'll find a time to gather a group to draft an attempt at those recommendations.

Submitting

Contributions are very welcome. If you've created geospatial embeddings just clone the repo, make a copy of the template, update it with your information, and put it in a folder named after your organization. And then ideally also put in a small sample geoparquet file. And do please run scripts/parquet_metadata_to_json.py to make a more human readable copy of the json metadata in your geoparquet file. It is ok if you didn't customize the parquet metadata at all, it's still valuable to just see how the geoparquet spec is being used.

To submit your information just create a pull request for the repo. You'll then be given commit rights, so you can continue to update after the initial PR. If you are not comfortable creating PR's then just file an issue and someone can help you get it submitted.

And please note, the submissions do not need to be on 'open' embeddings datasets - we're just looking to standardize, so if you're working with proprietary data just fill out what you can. You are welcome to put up an example dataset with a proprietary license, or to just describe the dataset. Submissions that are storing geospatial embeddings but aren't using GeoParquet are also welcome - but do please adapt the metadata questions to the file format.

Current submissions

geo-embeddings-survey's People

Contributors

cholmes avatar bengmstrong avatar strixcuriosus avatar brunosan avatar

Stargazers

Oleksii Vykaliuk avatar Roman Breitfuss-Schiffer avatar Christoph Rieke avatar Olivier Dietrich avatar Maxime Lenormand avatar 爱可可-爱生活 avatar Jacob Bieker avatar Chenhui Zhang avatar  avatar Pete Gadomski avatar Daniel Jahn (dahn) avatar Robin Cole avatar Clyde Wheeler avatar Heng Fang avatar Adeel Hassan avatar Lauren Yee avatar Rohit Sharma avatar Ibrahim Mohammed avatar Thorsten Hoeser avatar  avatar

Watchers

Jed Sundwall avatar Jason Gilman avatar  avatar  avatar Kevin Booth avatar Adeel Hassan avatar  avatar Clyde Wheeler avatar Michelle Roby avatar

geo-embeddings-survey's Issues

Considerations about connection to SimSearch DBs

I do not really have a great suggestion on how to do this. But it seems like embeddings would be used at times to be loaded into Databases to support Similarity Search applications. So just wanted to add this dimension to the discussion. If possible it would be great to align file based GeoParquet embedding collections with embedding database systems.

Additional Questions on Embeddings

This issue explores questions beyond the specific field to use within GeoParquet, focusing on standardization for consistent consumer expectations and building common tools for operations like search or classification.

Chipping

Questions about breaking down input imagery into smaller areas or "chips" for AI/ML processing:

  1. How do you determine the ideal chip size for a given resolution?
    a. Consider the trade-off between relevance (not too big) and discernibility (not too small)
    b. How does chip size impact row count and data size?

  2. What tiling approach should be used for the chips?

    • Options:
      a. Chopping every NxN pixels based on the chip size
      b. Using a defined grid (e.g., H3 or Major Tom)
  3. How do you deal with the Modified Areal Unit Problem (MAUP)
    a. How to handle edges within chips that cut across areas?
    b. Would it make sense to have chips overlap?

Storage and Distribution

  1. What should the guidance be for file size?
    a. What is the ideal granularity of original scenes to GeoParquet files (containing rows of chips)?

    • A very simple approach of 1:1 would have one scene translated to a single geoparquet file.
    • There may be benefits to using alternatives that allow for larger or smaller geoparquet files.
  2. Is there an ideal partitioning scheme within object storage?

  3. Alternatives to GeoParquet:
    a. Is GeoParquet preferred over Pytorch or numpy style files?
    b. Has anyone considered putting embeddings in Zarr?

Searching and Analysis

  1. Searching directly from GeoParquet:
    a. Are users considering searching directly across GeoParquet stored in object stores, or do they prefer copying into a vector database?
    b. Would capturing vector indexes in the GeoParquet enable efficient searching?

  2. Are there any implications on the end analysis that would impact how embeddings are stored in GeoParquet?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.