datarttu / sujuikodb Goto Github PK
View Code? Open in Web Editor NEWDatabase backend for transit network & observation data.
License: MIT License
Database backend for transit network & observation data.
License: MIT License
I have found out that network links should probably not be split at transit stop locations, because
Instead of the current model, we could determine that a link can only be set between intersection nodes.
Or, if we require an "enriched" network model that includes street names, for instance, links could be split by nodes wherever these attributes change.
There is no clear need for "random" attributes for network entities right now, and further fields can be added later if needed.
For the life of me, I could not make the variables in environment
use the variables from env_file
. So psql
would try to connect as root
, for example. Instead, I added all of these into .env_test
and it worked like a dream. Unfortunately that meant repeating some constants.
Originally posted by @haphut in #37 (comment)
Importing structures = e.g. a view with triggers and a copy command for csv imports
In nw.route_version
, valid_during
and version_id
are always correlated within each route
& dir
. Therefore we do not actually need separate version_id
in addition to the date range. Also, route_ver_id
will be generated as human-readable text anyway, so we can well use the start and end dates in it, instead of a running number in the end (version_id
).
I've been using links
to refer to street and tram network geometries that have nodes
in their both ends.
However, pgRouting uses the term edge
in the same meaning, which causes ambiguity at least when using pgRouting functions.
edges
would be uniform with pgRouting, but on the other hand, the Finnish translation is a bit weird ("kaaret") whereas "links -> linkit" works pretty well.
I also use nodes
while pgRouting uses vertices
.
I think nodes
is better, though, since "vertices" are commonly used when referring to ordered points that constitute a linestring geometry.
I chose to use patterns
when referring to unique ordered sets of stops used by a route and a direction, because Digitransit uses the same term in this case.
But my network model is not in fact based on stops but rather on links, and often in network models, a sequence of links is called a path
.
It might be more logical to change patterns
to paths
so they refer more clearly to the network model.
patterns
could still be used but only when referring to ordered set of stops, not links.
Currently, segments
refer to links as part of a route pattern
.
This is a bit problematic, since "segments / segmentit" usually refer to Hastus segments, i. e. route or network parts between Hastus places
, at HSL.
Hastus segments could be part of the sujuiko network model hierarchy in future, at least because they provide a useful framework for various analyses, so segments
should be reserved for that use.
Current segments
could perhaps be changed to path_links
?
templates
was just a term I came up with to refer to different sets of driving times with which a pattern
can be "actualized", depending on the time of day for instance.
This name is not very descriptive.
Maybe runtimes
would be better?
At the same time, segment_times
could be changed to path_link_runtimes
.
Subset from Digiroad -> commit to the repo
Look at the project in its current state. At the moment, we do not have any automated tests but one has to eyeball for errors on the console when running DDL scripts and queries (and hopefully remember to use transactions to be able to revert to the initial state...).
On the other hand, developing test automation takes time. It is also not as straightforward with databases and complex data: extracting or creating mock data and writing tests with them for various corner cases in HFP errors, for instance, is demanding. So far, I have just modified / amended the database model accordingly whenever I have encountered a weird case in manual experimenting.
Making changes to an existing database structure, i.e. migrations, is also a question. For the time being, I don't have a version running on server and the amount of data is so small that the dev version of the db can be easily started and hydrated from scratch. However, using migrations can be relevant sooner or later.
Network model in the nw
schema should be made "self-contained".
This means that all the essential data validation would happen in the nw
schema, and network data could be basically imported from different sources: by hand, from OSM, from Digiroad, etc.
Also, modification history of link, node and stop tables should be tracked, such that e.g. manual modifications done after a data import could be saved (and reverted, if necessary).
Things happening currently in stage_nw
could then be detached too, possibly into a separate Python program that converts OSM data into a format suitable for sujuiko db, merges links between intersections, and so on.
The sujuiko network model would just receive all the data given to it, as long as it would fulfil network connectivity and validity requirements.
Some of the script files in db/ddl
have grown quite big, and a more distributed file structure could be more manageable in the future. E.g., for each schema, have a script file with DDL for empty tables, and then separate scripts for each function definition (most functions are quite long here and often require some individual testing). Also separate files for defining views / matviews if needed. Finally, the DDL scripts would be executed in the correct order by a main script.
We could add docker build -t datarttu/sujuiko:latest .
before running docker-compose
. docker-compose ... --build
is not always reliable, I have been told.
Originally posted by @haphut in #37 (comment)
docker run --rm
script, instead.Originally posted by @haphut in #37 (comment)
Digiroad may have missing links, for example -> manual corrections should be serialized into a file somehow so they can be run upon importing nw data to db. Also demonstrate this as an example file.
Should we strive for functional programming principles with our key data transformation functions in the sujuiko model, i.e. functions would only take a certain type of input set and return a result set, without side effects. I can imagine that this would make the functions easier to test, at least. But in reality, our functions and procedures tend to touch many parts in the database at the same time, at least as we do complex joins - that's why I've wrapped complex queries into functions in the first place.
There are some good examples on perhaps-not-optimal practices in the current codebase. Table names are parameterized (and enforced to be regclass
) in the stage_hfp.extract_journeys_from_raw()
function, which is nice. But the function works as black box, inserting data into the journey table and only returning the number of records inserted, and then we have to go check the journey table separately if we want to inspect the results.
We could change the function so that just returns the result set, and it is up to the user or another function/procedure to insert the result to the correct table (-> added complexity). We could also change the function so that instead of table names it takes sets of data as parameter, but this would add even more complexity when invoking the function. Feeding sets of data to a function is syntactically painful if even possible in Postgres...
Note that we sometimes need to apply more complex logic inside functions than just joins, column calculations, type casts and window functions: especially with HFP data we might want to run through an ordered set of rows step by step, calculating values for the next row(s) depending on what we calculated for the previous row(s) and if they were kept or discarded as invalid.
In this light, what would be some good practices and examples when developing such functions that take raw data from multiple sources, mangle it in different ways and return a result set (eventually into a table)?
Rename to more descriptive "Development", and replace the missing "TODO" part with steps how things should go (at this point), indicate that those steps are not yet implemented.
Consider this from the same viewpoint as #17. It is common in code repositories to
so anybody else could try out and further develop the project on their machine based purely on the stuff found or informed about in the repository. I have been using docker-compose here. However, I have noticed how much learning and configuring it can take time, and I don't know if it really is worth the effort, given the limited time for actual tool development. Providing the data to test and use the tool is also a question. What should we strive for here?
data/
directory that fits nicely in version control, and running docker-compose up
without any custom configuration would use that directory as a volume and hydrate the database with those test data files? And instructions about the actual raw data files would be provided separately (what level of detail?) or upon request from the author?Think about more controlled git branch workflows. E.g.
See #6 and #7. Simplify the network model by not splitting links at stop locations. Instead, stop points are given a link id reference and a coverage range value along the link length. Also remove unnecessary "on-the-fly" validation of the network model: allow e.g. topologically invalid elements to be created, and make validations into separate procedures that mark valid / invalid elements. Finally, develop a way to track network modifications over time.
nw
schema docWe need a convenient way to extract subsets of HFP history data from large dumps in Azure Storage for this tool. Building a direct pipe from Azure Storage (or from Transitlog db, for example) to sujuiko db is not reasonable, because we have to do a lot of pre-processing, filtering and validation to the data before importing it to the db. So we want to store intermediate data locally (eventually, on server) in smaller dumps for easy inspection and transformation.
See #8. Simplify the sched
model such that it only handles route versions, not schedules. For the time being, we do not need to calculate performance measures of type "scheduled vs observed" as we only stick to "variation within observed" PMs. However, we could still connect observations to their route versions by using the trip id, even if we do not list all the possible planned trips in the sched
model. This way we can later add support for schedule data more easily.
sched
schema docOld environments use PG v11; upgrade to v123supported by Timescale, PostGIS and pgRouting.
Currently, if I run docker-compose.test.yml
and all the DDL scripts in /docker-entrypoint-initdb.d/
run neatly without errors, I can see all the CREATE TABLE
lines logged by the db
service on the console afterwards.
However, if an error occurs when the db
service runs the DDL scripts - for example, an SQL syntax error, a foreign key reference to a missing or incorrectly named table, and so on - I cannot see it from the console. The only thing I see in that case is that the db
service was not healthy and dataimporter
cannot therefore be run:
$ docker-compose -f docker-compose.test.yml up
Creating sujuikodb_db_1 ... done
ERROR: for dataimporter Container "a2a98c87b484" is unhealthy.
ERROR: Encountered errors while bringing up the project.
Apparently there is some mechanism that prevents db
log entries from appearing to the console before dataimporter
has been run successfully. It would be quite essential to spot where the errors start to happen in db
, though.
@haphut shall we check this in the next session?
We could commit a file .env_test
into the repo and remove it from .gitignore
. That way the tests can be run sooner and without filling in the details.
Originally posted by @haphut in #37 (comment)
Currently, successive stops are looked up from stage_gtfs.stop_times
. This could be done a bit faster by using unnested stop id arrays after stage_gtfs.trip_template_arrays
has been populated.
numeric
is more precise but operations on it are slower than on floating point typesdouble precision
/ float8
floating point type should be enough for us, considering e.g. linear location values along a link (between 0 and 1 as a float)ST_Length(geometry)
return double precision
/ float8
values, so I think it's logical to use this type in fields that will be filled using such functionsfloat4
to save some spacefloat4
and float8
are easier to read and understand as type names in the code than real
and double precision
where_sql
parameter has been included in the staging functions of the schedule model but not yet taken into use. It should enable importing just a subset of data through the staging process.
List yourself things to check before a task is considered done (e.g. documentation updates)
link_on_route
instead of route_link
analysis_segment
: section
? More concise, and perhaps describes the idea of several consecutive links for analysis in an intuitive way.Consider adding BEGIN;
and COMMIT;
or maybe use psql -1
in docker-compose.test.yml
.
Originally posted by @haphut in #37 (comment)
The same way as the network model, the sched
schema should be made "independent", such that it could be given data from many different sources, and the schema itself would enforce the data integrity and business rules.
Some modification history model would be useful here as well.
In the near future, schedule data will probably be imported from sources other than GTFS, since the GTFS methods have turned out to have useless "detours" to a large extent.
We should be able to use Jore and / or DW instead.
The aim is to keep the network model as simple as possible, allowing for iterative changes using normal INSERT / UPDATE / DELETE verbs so modifying the network e.g. in QGIS would be straightforward. Thus we do not want to use "black box" functions like pgr_createTopology()
unless absolutely necessary. Still, the network model should work with pgRouting shortest path algorithms so we can (semi-)automate creation of link paths.
Related to #10
-> Experiment and define a suitable relation model that works directly with pgRouting shortest path algorithms.
Currently, allowed vehicle modes in different network entities are defined with nw.vehicle_mode
enum type. However, let us use text values and a foreign key reference to a table containing the possible values we want to allow.
WHERE 'bus' <@ link_modes
instead of WHERE 'bus'::nw.vehicle_type <@ link_modes
Create test data from 1-2 transit lines for ETL and db processing:
VP
event csv files (a couple of trips)Let's assume there's a new developer / GIS analyst who should be able to learn about this project and eventually contribute to it. OR there is a developer in another project helping to integrate our and their work. What is important from their perspective?
nw
schema scripts), or collect them into a separate docs
or wiki
directory?COMMENT ON
feature (all the way from procedures up to individual columns) or document key things about relations, views, functions and procedures elsewhere?A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.