Git Product home page Git Product logo

titsa-gtfs-exploration's Introduction

TITSA GTFS - Exploration

Context

During my college I used a lot the bus service and it bothered me a lot how close were the two next bus stop, they were extremely close. After investigating a bit I discovered that TITSA post they google data on the open.

It took me a bit to understand the GTFS but the official docs are quite good.

Requirements

The code attached is a notebook developed on top docker stack using Spark 3.2 (but most functions are retrocompat)

Also there is a docker-composer file containing the image and the volume mounted (the password for jupyter is my-password)

If you want to run the notebook run the command download.sh it will download the last zip files from Titsa webpages and unzip them in the proper directory.

Questions

What are the closest bus stops in a single trip?

Both stops could be close to each other but some may be for specific lines, so using stop_times you can get the trip stop sequence and join it with the stop master data for the coords.

For calculating the distance I used the harvesine distance (kudos to [https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark]) and compare each consecutive stop.

There were some errores reporting the same stop for the same trip those were discarded... And well, I could found the same name in two similars id's and they were a few degrees deviated one from another, as in the following img:

So I discarded also the same name as this is not really clear. And well I ended up with two pretty similar bus stops that were both the same place but included the (T) from terminal.

So after discarded that error I found...

Which was hilarious, the difference between both is about 20 meters and it takes much more in the bus.

In case you're instered: the bus stops I was referring to were ranked as: 157.

Which stops have the most lines?

Obviously there are some bus stations but they should share id or at least name.

stop_id stop_name diff_routes
9181 INTERCAMBIADOR STA.CRUZ 44
2625 INTERCAMBIADOR LAGUNA (T) 36
9413 MERIDIANO 25
9450 INTERCAMBIADOR STA.CRUZ 23
2582 COROMOTO (T) 22
2549 LEOCADIO MACHADO 22
2692 FRANCISCO SÁNCHEZ (T) 21

Well I expected some magic output but there are the main bus stations and the previous / next stops. As we can see Santa Cruz station is splitted so I tried ot group also the stops by name.

And it yield some interesting results, a lot of bus stops share it's name even if they're not related the most common one is "Cementerio" (graveyard) and the second one is "Centro de salud" (health centre).

I include here the top list and it's quite funny tough in such a small location to have so many collisions in names.

stop_name stop_id diff_stops
CEMENTERIO [1137, 1141, 1204, 1225, 1376, 4074, 4124, 4926, 5027, 5029, 7076, 7095, 7256, 7362, 9105, 9106] 16
CENTRO DE SALUD [1219, 1883, 1924, 1928, 2587, 2789, 7257, 7361, 7364, 7382, 7455, 9409] 12
EL PINO [1636, 1647, 2130, 2145, 2314, 2704, 4957, 7577, 7603, 7735, 7782] 11
EL CALVARIO [1203, 1226, 1258, 1259, 4016, 4035, 4217, 4356, 4359, 4739] 10
EL MOLINO [1519, 1571, 1971, 1977, 2573, 2574, 4301, 4308, 4642] 9
CAMPO DE FUTBOL [1622, 1628, 2128, 2147, 4389, 4533, 9362, 9370] 8
LAS TOSCAS [1206, 1223, 1305, 1350, 1765, 2310, 4728, 4733] 8

What is the longest predicted route?

This question should be quite straightforward as we have for any trip all the stops and the predicted "arrival" and "departure" for each one. So we just need to group it and... What is this?

trip_id start end
3927806 24:10:00 24:34:19
3927807 25:30:00 25:54:19
3927809 24:50:00 25:08:38
3927810 26:20:00 26:38:38
3928436 25:10:00 25:20:20
3928763 24:05:00 24:21:39
3928766 24:15:00 24:44:20
3928769 24:45:00 25:01:39
3930761 28:40:00 29:37:11
3930762 25:00:00 25:51:52
3930763 27:30:00 28:21:52
3930764 24:00:00 24:50:36
3930765 26:25:00 27:15:36
3930767 24:00:00 24:51:52
3930768 26:25:00 27:16:52
3930769 25:20:00 26:10:36
3932368 24:05:00 24:31:00
3932373 24:40:00 25:01:17
3934883 24:05:00 24:55:17
3934886 25:00:00 25:50:17

Seems that the people decided to put hour 24 and so on for representing the next day.

Checking the standard from the gtfs reference is correct:

Service day - A service day is a time period used to indicate route scheduling. The exact definition of service day varies from agency to agency but service days often do not correspond with calendar days. A service day may exceed 24:00:00 if service begins on one day and ends on a following day. For example, service that runs from 08:00:00 on Friday to 02:00:00 on Saturday, could be denoted as running from 08:00:00 to 26:00:00 on a single service day.

So... Let's fix the time and convert it into a timestamp doing this is awful in pyspark.

And after doing this, here are the top results:

route_short_name elapsed rank
330 INTERVAL '0 02:47:22' DAY TO SECOND 1
330 INTERVAL '0 02:44:55' DAY TO SECOND 2
325 INTERVAL '0 02:36:24' DAY TO SECOND 3
343 INTERVAL '0 02:28:52' DAY TO SECOND 4
342 INTERVAL '0 02:14:25' DAY TO SECOND 5
108 INTERVAL '0 02:13:36' DAY TO SECOND 6
342 INTERVAL '0 01:59:51' DAY TO SECOND 7
343 INTERVAL '0 01:56:35' DAY TO SECOND 8
325 INTERVAL '0 01:53:27' DAY TO SECOND 9
34 INTERVAL '0 01:52:27' DAY TO SECOND 10

The 330 is a "beltway", it starts and finishes on the same point, so it makes a lot of sense to have such a large estimate time.

And the 325, makes an amazing way. It only works 5 times on labour days and 3 times on weekends.

More questions will be solved soon :)

titsa-gtfs-exploration's People

Contributors

adrianabreu avatar

Watchers

James Cloos avatar  avatar

titsa-gtfs-exploration's Issues

More questions

What is the bus stop with more bus lines? (Discarding stations)
Which is the largest route?
The most largest consecutive stop do not make sense since if i recall correctly there were some lines with no stops between stations (103 for example)

ETL

It would be nice to automatize this project, download the remote files and perform the etl operations on the downloaded files so we can answer the question automatically.
We may generate some nice compress parquet files.

Swap to pandas

Ok, we did some spark but is definitely overkill here, we should change this to use pandas.

Use schemas

Avoid so many magic strings in the project by generating the schemas of each types so we can refer to the columns explictly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.