Is your feature request related to a problem? Please describe. Cu

Support for multiple data sources about traewelling HOT 5 CLOSED

MrKrisKrisu commented on September 23, 2024 2

Support for multiple data sources

from traewelling.

Comments (5)

derhuerst commented on September 23, 2024 1

I'm currently working on a really hacky POC to inject GTFS data into the DB-Rest response so that we might be able to combine multiple data sources without having to drastically change our internal project's structure. The repo will be made publically available around the start of the GPN next week.

Currently, it's forwarding the departure request directly to db-rest v5 while simultaneously searching for departures on that IBNR. The departures provided via GTFS are then injected into the JSON. To determine what endpoint to call when we're getting a journey request, I simply took inspiration from the current HAFAS-Trip-IDs and added a "GTFS|{gtfs-id}" prefix to the trip IDs. This might be extended to combine multiple APIs from multiple (overlapping) data sources, but the first step might be, to add ÖBB, SNCF, SBB, etc., and restrict them to regular public transport like busses and trams, which are not covered by DB's HAFAS system.

This is very similar to what I've been doing with match-gtfs-rt-to-gtfs: It tries to match data from a HAFAS API (e.g. the DB one) to a GTFS dataset by matching their stop/trip/route IDs/names/locations.

Over time, I've invested quite a lot of effort to make the matching logic fast and flexible enough. For example, it can match a HAFAS stop with a GTFS stop even when they don't share an ID (IBNR), have slightly different names, and slightly different geolocations.

Unfortunately, the code has many indirections and isn't well-documented. Also, it's been a while since I've tested it with the DB HAFAS endpoint. But if you're interested, take a look!

do we form a new "proprietary" ID that "masks" the underlying DB/SNCF IDs?
This will be done w/ a proprietary combination of some proprietary prefixes and the API's original ID.

You might also want to look into Multiformats as a generalized and future-proof mechanism for "combining IDs".

[…] how do we make sure the UX is not confusing. […] How do we make sure users can find the train/trip they're looking for if they're used to a very specific naming scheme (e.g. "RE 1" vs "RE 73793", "TGV INOUI 123" vs "TGV 123")?
We need to keep track of which APIs should be used for which station. […] A general primary identifier could be IFOPT as the parent station with the APIs internal station ID and a reference to the station as children. […]

The Trainline stations database might be very helpful with this.

from traewelling.

derhuerst commented on September 23, 2024

The transport-apis project has many transit APIs listed; It intends to be the "source of truth" for basic information about these APIs (their endpoints, authentication mechanisms, licensing scheme, etc.), so that projects don't need to keep track of these changes each individually. If there is anything missing over there, please create an Issue or submit a PR!

from traewelling.

derhuerst commented on September 23, 2024

Regarding the actual idea being discussed here:
I think that many tricky technical and UX questions arise once starts having >1 underlying data source:

Shall the data sources be completely separate? E.g. when I check into a train/trip as represented by the DB HAFAS, and another person checks into that (same real-world) train/trip as represented by an SNCF data source, will we see each other as being on the same train/trip?
If we have built a mechanism to identify two data items as being about the same (one real-world) train/trip, do we form a new "proprietary" ID that "masks" the underlying DB/SNCF IDs? If we do this, then we need to either a) keep a mapping between them for a long time, or b) make the new ID contain the underlying data source IDs somehow.
If we have tackled the above items, how do we make sure the UX is not confusing. Let's assume we have decided to either a) merge the properties from both data sources about one real-world item, or b) to decide to show only one set of properties. How do we make sure users can find the train/trip they're looking for if they're used to a very specific naming scheme (e.g. "RE 1" vs "RE 73793", "TGV INOUI 123" vs "TGV 123")?

I have brainstormed more about some technical aspects topic in Why linked open transit data?, stable-public-transport-ids, and experimented with fusing >1 (HAFAS-like) data source in pan-european-public-transport.

TLDR: Adding another data source is technically feasable, but how do we create a usable UX from that?

from traewelling.

HerrLevin commented on September 23, 2024

I'm currently working on a really hacky POC to inject GTFS data into the DB-Rest response so that we might be able to combine multiple data sources without having to drastically change our internal project's structure. The repo will be made publically available around the start of the GPN next week.

Currently, it's forwarding the departure request directly to db-rest v5 while simultaneously searching for departures on that IBNR. The departures provided via GTFS are then injected into the JSON. To determine what endpoint to call when we're getting a journey request, I simply took inspiration from the current HAFAS-Trip-IDs and added a "GTFS|{gtfs-id}" prefix to the trip IDs. This might be extended to combine multiple APIs from multiple (overlapping) data sources, but the first step might be, to add ÖBB, SNCF, SBB, etc., and restrict them to regular public transport like busses and trams, which are not covered by DB's HAFAS system.

I might have a few ideas to combat your above-mentioned problems:

In our case: (mostly) yes. We want to use the "official" data endpoint for one vehicle, e.g. Karlsruhe public transport uses their open data endpoint, ICEs use DB Hafas, TGVs use SNCF's and so on (This adds one bigger question: What do we do with trains crossing borders? Is the TGV-Data provided by SNCF more or less accurate than the DB's? Just guessing by the DB's polylines, everything outside of state lines is "bad data")
This will be done w/ a proprietary combination of some proprietary prefixes and the API's original ID.
This is the biggest question in my opinion b/c it just opens even more questions. My current ideas are the following:
- We need to keep track of which APIs should be used for which station. This could be done by using a modified version of the GTFS stops table. A general primary identifier could be IFOPT as the parent station with the APIs internal station ID and a reference to the station as children. Maybe even additional information such as "only long-distance trains" could be added.
- In my opinion, the "correct way" of displaying the line name, etc. is using what the "correct" API is providing. However, this could be extended by providing additional information in some sort of translation schema since it will indeed be confusing to end users in some situations. I'm not completely happy with this approach but it's the best I came up with until now.

This is all in its infancy at the moment but already describes the rough direction I'd like to go.

P.S.: speaking of GPN - will we see you there? 👀

from traewelling.

vainamov commented on September 23, 2024

It's unfortunately limited to trains within Finland, but the Fintraffic API is awesome: https://www.digitraffic.fi/en/railway-traffic/

from traewelling.

Support for multiple data sources about traewelling HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent