pnorman / cheep Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 43 KB

CHangeset Extraction Engine Process

cheep's People

Contributors

Watchers

cheep's Issues

Build indexes and finalize DB

After import some tasks will need doing

UNIQUE indexes built on id columns
Set id columns as PRIMARY KEYs
If necessary, build indexes on ways.nodes and relations.members
- How do we do this for the typed column
ANALYZE all tables
reset autovacuum
set id indexes as cluster indexes, but don't actually cluster

Load data

cheep needs to load data into the nodes/ways/relations tables. This must be done with COPY statements for performance reasons. It should also be multi-threaded.

Manage database schemas

We should have some way of migrating database schemas. Ruby has ways to do this with ActiveRecord::Migration, but what does it in Python?

Create database tables

Before importing data, there need to be tables. The first draft is

CREATE TABLE nodes (
  id bigint NOT NULL,
  long int NOT NULL,
  lat int NOT NULL,
  tags hstore);

CREATE TABLE ways (
  id bigint NOT NULL,
  nodes bigint[] NOT NULL,
  tags hstore);

CREATE TYPE relation_type AS ENUM ('node','way','relation');
CREATE TYPE relation_member AS (type relation_type, id bigint);
CREATE TABLE relations (
  id bigint NOT NULL,
  members relation_member[],
  tags hstore);

The tables should be created WITH (autovacuum_enabled = false); which then needs to be reset post-import.

node to way mapping

With a nodes bigint[] column in the ways table going from ways to child nodes is easy, but going from nodes to parent ways is harder. There are three popular ways to do it.

Have a way_nodes table with way_id, node_id, sequence and an index on node_id and SELECT based on node_id. The nodes column could be removed if there was also an index on way_id, but this would be slower than a ways.nodes column. This is what the pgsnapshot schema does.
Advantages: Fastest for a single lookup.
Disadvantages: Amazingly large table + index. Table with ~4b rows. Bad for cache contention.
Have a GIN index on ways.nodes bigint[]. Lookups can be done with a && operator, but should be structured to minimize the number of queries. osm2pgsql slim tables do this.
Advantages: Much less disk space for the index than a way_nodes table. Avoids duplicate data.
Disadvantages: A 170 GB GIN index with fastupdate off is slower to update and bloats quickly.
Have a ways bigint[] column in the nodes table and put IDs of parent ways there. Lookups can be done by querying nodes.ways. AFAIK, this is untested.
Advantages: Uses an already existing table and index. Smallest cache contention impact with the indexes. Probably fastest.
Disadvantages: Untested. Requires more management of data to update the nodes table every time a way is created. Makes an already large nodes table even larger.

pnorman / cheep Goto Github PK

cheep's People

Contributors

Watchers

cheep's Issues

Build indexes and finalize DB

Load data

Manage database schemas

Create database tables

node to way mapping

Parse OSM PBF

Make available on PyPi

Basic command line parsing, tests, and other infrastructure

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent