Git Product home page Git Product logo

sujuikodb's People

Contributors

datarttu avatar haphut avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

haphut

sujuikodb's Issues

Do not split links at stop locations

I have found out that network links should probably not be split at transit stop locations, because

  • it makes the network more scattered by producing short links and more links in general;
  • HFP points are projected to links, not to stop points, and stop event related HFP points can (randomly) end up on a link on either side of the stop point;
  • we currently have to manipulate stop point geometries by shifting and clustering them, in order to avoid making too short links, which is not a very good practice.

Instead of the current model, we could determine that a link can only be set between intersection nodes.
Or, if we require an "enriched" network model that includes street names, for instance, links could be split by nodes wherever these attributes change.

Add all required pg variables to .env_test

For the life of me, I could not make the variables in environment use the variables from env_file. So psql would try to connect as root, for example. Instead, I added all of these into .env_test and it worked like a dream. Unfortunately that meant repeating some constants.

Originally posted by @haphut in #37 (comment)

Do route versioning only by route, direction and validity date range

In nw.route_version, valid_during and version_id are always correlated within each route & dir. Therefore we do not actually need separate version_id in addition to the date range. Also, route_ver_id will be generated as human-readable text anyway, so we can well use the start and end dates in it, instead of a running number in the end (version_id).

Minimal deployment process with Docker

  • Write a Dockerfile & a docker-compose file that allow deploying a minimal version of the database required: no schemata or data yet, just Postgres and the extensions running without problems.
  • Describe this deployment process in the main README.

Update terminology

Tasks

  • Start a glossary file
  • Define terms required in simplified network, schedule and route models

Links, edges, nodes, vertices

I've been using links to refer to street and tram network geometries that have nodes in their both ends.
However, pgRouting uses the term edge in the same meaning, which causes ambiguity at least when using pgRouting functions.
edges would be uniform with pgRouting, but on the other hand, the Finnish translation is a bit weird ("kaaret") whereas "links -> linkit" works pretty well.

I also use nodes while pgRouting uses vertices.
I think nodes is better, though, since "vertices" are commonly used when referring to ordered points that constitute a linestring geometry.

Patterns -> paths?

I chose to use patterns when referring to unique ordered sets of stops used by a route and a direction, because Digitransit uses the same term in this case.
But my network model is not in fact based on stops but rather on links, and often in network models, a sequence of links is called a path.
It might be more logical to change patterns to paths so they refer more clearly to the network model.
patterns could still be used but only when referring to ordered set of stops, not links.

Segments -> path_links?

Currently, segments refer to links as part of a route pattern.
This is a bit problematic, since "segments / segmentit" usually refer to Hastus segments, i. e. route or network parts between Hastus places, at HSL.
Hastus segments could be part of the sujuiko network model hierarchy in future, at least because they provide a useful framework for various analyses, so segments should be reserved for that use.
Current segments could perhaps be changed to path_links?

Templates -> runtimes?

templates was just a term I came up with to refer to different sets of driving times with which a pattern can be "actualized", depending on the time of day for instance.
This name is not very descriptive.
Maybe runtimes would be better?
At the same time, segment_times could be changed to path_link_runtimes.

Questions about test automation and migrations

Look at the project in its current state. At the moment, we do not have any automated tests but one has to eyeball for errors on the console when running DDL scripts and queries (and hopefully remember to use transactions to be able to revert to the initial state...).

On the other hand, developing test automation takes time. It is also not as straightforward with databases and complex data: extracting or creating mock data and writing tests with them for various corner cases in HFP errors, for instance, is demanding. So far, I have just modified / amended the database model accordingly whenever I have encountered a weird case in manual experimenting.

  • What kind of automated tests should we write, if any? What are some critical places for them, considering time spent vs. benefit?
  • Suggestions for setting up a minimum effort test framework for regression tests - if we want to (partially) automate the process of ensuring that given a set of test data, our functions return the correct set of result data. E.g. RegreSQL & Docker?

Making changes to an existing database structure, i.e. migrations, is also a question. For the time being, I don't have a version running on server and the amount of data is so small that the dev version of the db can be easily started and hydrated from scratch. However, using migrations can be relevant sooner or later.

  • What are some good example SQL(-only) repositories using e.g. Migra where one could easily see the process of developing the original SQL scripts and building up related migration scripts?
  • For final HFP observation tables, we are and will be using TimescaleDB that works as an abstraction layer for partitioning a table into subtables based on a timestamp key column. Can such extensions be a problem with database model migrations?

Self-contained network model

Network model in the nw schema should be made "self-contained".
This means that all the essential data validation would happen in the nw schema, and network data could be basically imported from different sources: by hand, from OSM, from Digiroad, etc.
Also, modification history of link, node and stop tables should be tracked, such that e.g. manual modifications done after a data import could be saved (and reverted, if necessary).

Things happening currently in stage_nw could then be detached too, possibly into a separate Python program that converts OSM data into a format suitable for sujuiko db, merges links between intersections, and so on.
The sujuiko network model would just receive all the data given to it, as long as it would fulfil network connectivity and validity requirements.

Reorganize ddl scripts

Some of the script files in db/ddl have grown quite big, and a more distributed file structure could be more manageable in the future. E.g., for each schema, have a script file with DDL for empty tables, and then separate scripts for each function definition (most functions are quite long here and often require some individual testing). Also separate files for defining views / matviews if needed. Finally, the DDL scripts would be executed in the correct order by a main script.

Create test data: manual nw corrections

Digiroad may have missing links, for example -> manual corrections should be serialized into a file somehow so they can be run upon importing nw data to db. Also demonstrate this as an example file.

Questions about plpgsql function design

Should we strive for functional programming principles with our key data transformation functions in the sujuiko model, i.e. functions would only take a certain type of input set and return a result set, without side effects. I can imagine that this would make the functions easier to test, at least. But in reality, our functions and procedures tend to touch many parts in the database at the same time, at least as we do complex joins - that's why I've wrapped complex queries into functions in the first place.

There are some good examples on perhaps-not-optimal practices in the current codebase. Table names are parameterized (and enforced to be regclass) in the stage_hfp.extract_journeys_from_raw() function, which is nice. But the function works as black box, inserting data into the journey table and only returning the number of records inserted, and then we have to go check the journey table separately if we want to inspect the results.

We could change the function so that just returns the result set, and it is up to the user or another function/procedure to insert the result to the correct table (-> added complexity). We could also change the function so that instead of table names it takes sets of data as parameter, but this would add even more complexity when invoking the function. Feeding sets of data to a function is syntactically painful if even possible in Postgres...

Note that we sometimes need to apply more complex logic inside functions than just joins, column calculations, type casts and window functions: especially with HFP data we might want to run through an ordered set of rows step by step, calculating values for the next row(s) depending on what we calculated for the previous row(s) and if they were kept or discarded as invalid.

In this light, what would be some good practices and examples when developing such functions that take raw data from multiple sources, mangle it in different ways and return a result set (eventually into a table)?

Questions about deployment

Consider this from the same viewpoint as #17. It is common in code repositories to

  1. at least have clear CLI instructions how to clone, build and deploy the project locally;
  2. even better, have the deployment process automated e.g. with Docker and possibly docker-compose

so anybody else could try out and further develop the project on their machine based purely on the stuff found or informed about in the repository. I have been using docker-compose here. However, I have noticed how much learning and configuring it can take time, and I don't know if it really is worth the effort, given the limited time for actual tool development. Providing the data to test and use the tool is also a question. What should we strive for here?

  • Is "deploy with one click" a basic need here?
    • Should it be implemented with Docker / docker-compose, or are step-by-step instructions enough?
  • Or can we just list what installations etc. are needed, and let the newcomer worry about all the local installation and configuration mess?
  • What about the data: is it good to provide some minimal example data in a data/ directory that fits nicely in version control, and running docker-compose up without any custom configuration would use that directory as a volume and hydrate the database with those test data files? And instructions about the actual raw data files would be provided separately (what level of detail?) or upon request from the author?

Consider git branching policy

Think about more controlled git branch workflows. E.g.

  • master branch only includes server-ready code that is able to create a database from scratch without any errors or conflicts;
  • master branch is only updated through PRs, not by direct commits (set up a policy in GitHub);
  • test automation could later be added to master branch PRs, at least;
  • dev branch for actual development and any commits (perhaps not realistic to maintain feature branches in this case)

Redesign network & stop model

See #6 and #7. Simplify the network model by not splitting links at stop locations. Instead, stop points are given a link id reference and a coverage range value along the link length. Also remove unnecessary "on-the-fly" validation of the network model: allow e.g. topologically invalid elements to be created, and make validations into separate procedures that mark valid / invalid elements. Finally, develop a way to track network modifications over time.

  • Re-sketch the nw schema doc
  • Decide if separate node table if needed at all. We might do well with links only.
  • Define link characteristics
  • Define characteristics of valid links in relation to each other
  • Define stop characteristics
  • Define link-stop relation characteristics
  • Define modification history model for the network
  • Create further tasks for source code updates

Develop a downloader app for HFP dumps

We need a convenient way to extract subsets of HFP history data from large dumps in Azure Storage for this tool. Building a direct pipe from Azure Storage (or from Transitlog db, for example) to sujuiko db is not reasonable, because we have to do a lot of pre-processing, filtering and validation to the data before importing it to the db. So we want to store intermediate data locally (eventually, on server) in smaller dumps for easy inspection and transformation.

  • Plan a utility tool for inspecting and downloading available HFP datasets from Azure Storage
  • Working version of dataset inspection / listing
  • Working version of selected dataset downloading
    • Only select required columns
    • Apply type casts to columns when needed
    • Distribute results into different csv files by attribute(s)
    • Filter results by attribute(s)

Redesign schedule model

See #8. Simplify the sched model such that it only handles route versions, not schedules. For the time being, we do not need to calculate performance measures of type "scheduled vs observed" as we only stick to "variation within observed" PMs. However, we could still connect observations to their route versions by using the trip id, even if we do not list all the possible planned trips in the sched model. This way we can later add support for schedule data more easily.

  • Rewrite sched schema doc
  • Define route version characteristics
  • Define route model characteristics: ordered sets of directed links, validation etc.
  • Create further tasks for source code updates

Upgrade Postgres version locally

Old environments use PG v11; upgrade to v123supported by Timescale, PostGIS and pgRouting.

  • Update Dockerfile
  • Create a new local dev cluster with v13
  • Reinstall PostGIS, pgRouting and TimescaleDB locally
  • Update readme

Make `db` service report failing SQL statements to the console

Currently, if I run docker-compose.test.yml and all the DDL scripts in /docker-entrypoint-initdb.d/ run neatly without errors, I can see all the CREATE TABLE lines logged by the db service on the console afterwards.

However, if an error occurs when the db service runs the DDL scripts - for example, an SQL syntax error, a foreign key reference to a missing or incorrectly named table, and so on - I cannot see it from the console. The only thing I see in that case is that the db service was not healthy and dataimporter cannot therefore be run:

$ docker-compose -f docker-compose.test.yml up
Creating sujuikodb_db_1 ... done

ERROR: for dataimporter  Container "a2a98c87b484" is unhealthy.
ERROR: Encountered errors while bringing up the project.

Apparently there is some mechanism that prevents db log entries from appearing to the console before dataimporter has been run successfully. It would be quite essential to spot where the errors start to happen in db, though.

@haphut shall we check this in the next session?

successive_stops: use trip_template_arrays

Currently, successive stops are looked up from stage_gtfs.stop_times. This could be done a bit faster by using unnested stop id arrays after stage_gtfs.trip_template_arrays has been populated.

Use `float4` or `float8` instead of `numeric`/`decimal`

  • According to the Postgres docs, numeric is more precise but operations on it are slower than on floating point types
  • Precision of double precision / float8 floating point type should be enough for us, considering e.g. linear location values along a link (between 0 and 1 as a float)
  • PostGIS functions such as ST_Length(geometry) return double precision / float8 values, so I think it's logical to use this type in fields that will be filled using such functions
  • For not-that-critical float values in tables with potentially large amounts of data (e.g. HFP data and their aggregates), I could probably use float4 to save some space
  • I think float4 and float8 are easier to read and understand as type names in the code than real and double precision

Definition of Done

List yourself things to check before a task is considered done (e.g. documentation updates)

Update db model terminology

  • E.g. link_on_route instead of route_link
  • Better term for analysis_segment: section? More concise, and perhaps describes the idea of several consecutive links for analysis in an intuitive way.

Self-contained schedule model

The same way as the network model, the sched schema should be made "independent", such that it could be given data from many different sources, and the schema itself would enforce the data integrity and business rules.
Some modification history model would be useful here as well.

In the near future, schedule data will probably be imported from sources other than GTFS, since the GTFS methods have turned out to have useless "detours" to a large extent.
We should be able to use Jore and / or DW instead.

Find out what pgRouting requires from link table for routing

The aim is to keep the network model as simple as possible, allowing for iterative changes using normal INSERT / UPDATE / DELETE verbs so modifying the network e.g. in QGIS would be straightforward. Thus we do not want to use "black box" functions like pgr_createTopology() unless absolutely necessary. Still, the network model should work with pgRouting shortest path algorithms so we can (semi-)automate creation of link paths.

Related to #10

-> Experiment and define a suitable relation model that works directly with pgRouting shortest path algorithms.

Use a reference table for network vehicle modes instead of enum

Currently, allowed vehicle modes in different network entities are defined with nw.vehicle_mode enum type. However, let us use text values and a foreign key reference to a table containing the possible values we want to allow.

  • Network data will be so small that there is no significant performance and space benefit for enum over text type
  • The reference table is more flexible in terms of adding or modifying allowed values
  • Queries are more straightforward: WHERE 'bus' <@ link_modes instead of WHERE 'bus'::nw.vehicle_type <@ link_modes

Create test data

Create test data from 1-2 transit lines for ETL and db processing:

  • Digiroad subset
  • Manual additions / corrections to the Digiroad subset
  • Jore / Hastus route versions
  • Stop versions
  • HFP VP event csv files (a couple of trips)

Questions about documentation

Let's assume there's a new developer / GIS analyst who should be able to learn about this project and eventually contribute to it. OR there is a developer in another project helping to integrate our and their work. What is important from their perspective?

  • What would be the best way to organize the documentation? Where? (Scope: inside this repo)
    • What should be included in the root README?
    • Shall we attach more specific readme files to each context like now (e.g. network readme lives in the same directory as the nw schema scripts), or collect them into a separate docs or wiki directory?
    • Where should pics and other files linked to readme files be located?
  • Shall we make use of Postgres COMMENT ON feature (all the way from procedures up to individual columns) or document key things about relations, views, functions and procedures elsewhere?
  • How much and how detailed usage examples should we provide vs. relying on a newcomer to clone the repo and try out things themself? Do we have to demonstrate e.g. what every essential relation looks like with a couple of rows? (-> the doc would become quite a wall of text)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.