Get notified of new posts
Exploiting train networks¶
Recently, I had to schedule people moving across a whole country (Germany).
In most planning problems, we assume that people are driving. Driving times are relatively easy estimate with the right tools (e.g., OSRM + OSM).
In this particular case, for reasons that are not relevant to this post, we needed to move people by car and by public transport, i.e., trains.
It's fairly easy to go to the website of a train company and plan a trip from point A to point B. You get all the possible trains, with their departure and arrival times, duration of the trip, number of lay-overs, price, etc.
When scheduling or routing, though, we do not know the routes we will need or use. And so we need to know all possible combinations in advance. This is not possible to do by hand on the train-company's website. It's not even practical to automate a script that calls the API a million times. We thus need to go to the source of that information.
First attempt: Deutsche Bann¶
One would think that the best source for train information is the train operator itself. Thus, I found the REST API for the Germany train operator, created an account and requested access to the endpoint for timetables.
I also installed the R package to access the API endpoints from the R-language: https://github.com/ottlngr/openbahn
This option sounded promising but turn out to be a failure in many ways:
- I never got the access I requested,
- and the free endpoints were not useful because they did not allow downloading the whole timetable at the same time.
- Finally, everything was in German and Google did not translate the whole site correctly.
Second attempt: The GTFS format¶
The GTFS stands for General Transit Feed Specification. Previously, the G was for Google, who invented it. It's a data format to store public transport schedules. It's power comes from being a standard, mostly. And also because it's not too complicated to understand. The GTFS is used widely to store static information on bus and train schedules all over the world.
For example, it's used by Google Maps to integrate public transport into the site. And also used by many known route-planning tools and data services.
In particular, and by pure luck, I discovered that Germany maintains it for its trains.
As the site (https://gtfs.de/en/) mentions:
In accordance with the EU commission's Delegated Regulation 2017/1926, the complete national static timetable data for Germany is published via an national access point in the NeTEx format since January 1st 2020. Based on this dataset, we offer daily generated timetable datasets in the GTFS format (General Transit Feed Specification) - covering the complete long distance and regional rail network of Deutsche Bahn as well as local / urban transit of all agencies and companies in Germany.. With over 20.000 lines, more than 500.000 stop points and nearly 2 million regular vehicle journeys, this is one of the largest GTFS datasets in the world.
The GTFS is very detailed, and we do not want to decide which train to get depending on the time of the day: we're more worried about estimating reasonable time for trips between two locations by train. This means that we will probably ignore a big chunk of the data.
GTFS: relevant parts¶
The information we care the most is the file stop_times.txt which, in the case of the German network, has 32M rows and has the arrival and departure time of every stop of every trip in the country.
| trip_id | stop_id | stop_sequence | arrival_time | departure_time |
|---|---|---|---|---|
| 1013745 | 280916 | 0 | 18:35:00 | 18:35:00 |
| 1013745 | 215166 | 1 | 18:36:00 | 18:36:00 |
| 1013745 | 575948 | 2 | 18:38:00 | 18:38:00 |
| 1013745 | 671819 | 3 | 18:39:00 | 18:39:00 |
| 1013745 | 493441 | 4 | 18:41:00 | 18:41:00 |
| 1013745 | 445911 | 5 | 18:42:00 | 18:42:00 |
stop_times.txt content
After filtering only for trains stops and adding a few columns (from other GTFS files) for the service of the train, we get around 1.8M rows of something like this:
| trip_id | stop_id | stop_sequence | arrival_time | departure_time | route_id | service_id |
|---|---|---|---|---|---|---|
| 1611143 | 477441 | 0 | 12:35:24 | 12:35:24 | 2736 | 3825 |
| 1611143 | 171295 | 1 | 12:36:48 | 12:37:06 | 2736 | 3825 |
| 1611143 | 110093 | 2 | 12:38:18 | 12:38:36 | 2736 | 3825 |
| 1611143 | 340352 | 3 | 12:39:48 | 12:40:06 | 2736 | 3825 |
| 1611143 | 49185 | 4 | 12:41:00 | 12:41:18 | 2736 | 3825 |
| 1611143 | 453970 | 5 | 12:42:42 | 12:43:06 | 2736 | 3825 |
Stop times input table
Finally, we convert this very detailed information into a reasonable estimation of time between consecutive train stops. We take the median of the time between stops and end up with this:
| route_id | stop_id | next_stop_id | time |
|---|---|---|---|
| 7 | 12394 | 474198 | 11 |
| 7 | 17816 | 259350 | 5 |
| 7 | 17816 | 272516 | 5 |
| 7 | 17816 | 348367 | 5 |
| 7 | 17816 | 393640 | 6 |
| 7 | 71808 | 404032 | 8 |
The time between consecutive stops (in minutes) of a given route_id
This is great: we can check every single train that moves around the country and, in theory, estimate how much time it takes to move from one stop to another.
Except we don't care about the stops: what we actually need is the public transport times between "locations". And "location" can be an ambiguous term: it can refer to a train stop, a city, a province, a specific coordinate, etc.
So the question to ask now is: what is a "location" for us?
Choosing a unit of location: postcode¶
We need to start approximating locations. The best option was to choose postcodes as (1) they're very standard, (2) they split the country in reasonably-sized areas and (3) this was the level of detail of the rest of the project. The only problem with this approach is that GTFS has no idea what postcodes are and thus, we need to somehow link a postcode with the rail network in order to take advantage of the public transport data.
So at this point we know we have pairs of locations and a rail network and we want to be able to answer the following question: How long does it take to go by train from postcode A to postcode B?
graph LR
classDef source fill:#a6cee3,stroke:#333,stroke-width:4px;
classDef sink stroke:#333,stroke-width:4px;
A[Postcode A]:::source ==> B[first stop];
subgraph Rail Network
B --> B2[...]
B2 --> C[last stop]
end
C ==> Z[Postcode B]:::sink;
To do this, we need to somehow link postcodes to train stops. We did it by calculating the distance (euclidean, I'm afraid) between postcodes and train stops. We then kept the relevant ones (i.e., those that are sufficiently close).
| postcode | stop_id | plz_name | stop_name | distance |
|---|---|---|---|---|
| 13467 | 473658 | Berlin Hermsdorf | S Hermsdorf | 22 |
| 13467 | 661905 | Berlin Hermsdorf | S Hermsdorf | 31 |
| 75378 | 314056 | Bad Liebenzell | Bad Liebenzell | 32 |
| 75378 | 546008 | Bad Liebenzell | Bad Liebenzell | 35 |
| 42555 | 13761 | Velbert | Velbert-Langenberg | 42 |
| 34117 | 335412 | Kassel | Scheidemannplatz Ri. Wilhelmsstraße | 42 |
Postcode to stop Euclidean distance (in meters)
Great, now we have the network and a way to link the network to postcodes. We now need to explore the network in an efficient way.
Design the graph¶
We need to decide what the graph is going to represent. In my case, I had many options. I had to change the definition of the graph three times before arriving to something that was "precise enough" while not running out memory / lasting hours to analyze.
We will work with two types of nodes:
- Stop nodes: each node is the combination of a stop and a train line (
route_idin theStop times input table). - Postcode nodes: we create two nodes for each postcode: a source node and a sink node.
We will have three types of edges:
- Trip arc: when the person is moving inside a train from one station to the other. These arcs connect nodes with different
stop_idbut the sameroute_id. - Transit arc: when the person is changing trains at a station. These arcs connect nodes with the same
stop_idbut differentroute_id. - Source/sink arc: when the person enters or leaves the train station (e.g., by bus, walking, taxi, etc.). These arcs connect postcodes nodes to stop nodes that are physically close.
For trip arcs, we calculate the time as the median time between stops for all trips of that route_id.
For transit arcs, we calculate the time as 30 min.
For source/sink arcs, we calculate the time as the time it takes to reach the postcode assuming a speed of 20km/h.
The toy example below shows a very simple graph with one origin (postcode Postcode A) that is linked to StationA. There, the passenger can take the light green line and at some point switch to the dark green line (at StationB). Each station is connected to nearby postcodes (sinks) in case the passenger wants to get out. Stop nodes in green diamonds. Postcode nodes in blue squares.
graph LR
classDef source fill:#a6cee3,stroke:#333,stroke-width:4px;
classDef sink stroke:#333,stroke-width:4px;
classDef lineA fill:#b2df8a;
classDef lineB fill:#33a02c;
A[Postcode A]:::source ==>|Source| B{StationA}:::lineA;
B ==>|Sink| R[Postcode A]:::sink;
B -->|Trip| C{StationB}:::lineA;
C -->|Trip| D{StationC}:::lineA;
C ==>|Sink| X[Postcode B]:::sink;
C -.->|Transit| E{StationB}:::lineB;
E ==>|Sink| X;
E -->|Trip| F{StationD}:::lineB;
D ==>|Sink| Y[Postcode C]:::sink;
F ==>|Sink| Z[Postcode D]:::sink;
This analysis, came after a few iterations of graphs that worked but had one issue or another. Below I present a couple variants that were not as good and I explain why.
Graph design: previous iterations¶
Some initial graph designs were not a very good fit:
- Only using trip arcs. Here, each stop node was not tied to the route. Transit arcs were collapsed into a single node and only trip arcs existed. This meant that the transfers were assumed instantaneous, which is not very realistic.
- Only stop nodes were used for the graph. This meant we had two steps: (1) get all shortest paths between train stops and then, (2) merge this information with postcode-to-node times. This second step took a large amount of time (and, more importantly, memory). I realized that including the postcode-to-node links in the graph would be more efficient for both steps, and it was.
Using the same example above, the graph with these two changes would become a more simple graph, but less efficient to analyze and less accurate:
graph LR
classDef lineA fill:#b2df8a;
B{StationA}:::lineA -->|Trip| C{StationB}:::lineA;
C -->|Trip| D{StationC}:::lineA;
C -->|Trip| F{StationD}:::lineA;
Creating and exploiting the graph¶
With the above data, we get a graph with 58k nodes and around 475k arcs.
Once the graph is built, we "just" need to call an efficient implementation to find all the shortest paths between all pairs of postcode nodes.
This will simulate a person using the train network to go from all possible postcodes to all possible postcodes.
I filtered the resulting to only keep the trips that took less than 6 hours, since it's probably not very efficient to let someone spend most of their shift taking trains around the country.
The resulting table of time between postcodes had 30M rows, which is not a bad result considering there are 8168 postcodes in Germany and thus 66M pairs of postcodes. It implies a ratio of 45%.
| postcode | postcode2 | min_time |
|---|---|---|
| 1067 | 1067 | 4 |
| 1069 | 1067 | 8 |
| 1097 | 1067 | 4 |
| 1099 | 1067 | 21 |
| 1108 | 1067 | 24 |
| 1109 | 1067 | 17 |
Public transport time between postcodes
Implementation¶
The whole processing was implemented in R and, with the final implementation of the graph, takes about 5 minutes to run. I used the "jonhson" algorithm offered by the igraph package. I used the tidyverse set of R packages for most of the data crunching and the data.table R package for the most critical parts: reading and writing large files and some operations (joins, filters) on large tables.
I did validate a small sample of estimations with google maps and I check that I was getting fairly similar times.
As an additional step, I analyzed past trips by employees where the above database showed a very long trip or no possible trip at all. For these pairs of postcodes I used the Google Maps API to estimate public transport times and I kept an "override" file of downloaded estimates. This was possible because the number of new missing pairs was fairly small.
Hypothesis, validation and limitations¶
We're doing several approximations for this analysis, all reasonable for the project:
- Postcodes can be big and we're using their centroid as position.
- Transfer times can be hours long or 5 minutes and we're using an estimation of 30 minutes flat. This could be improved by using the frequency of each route and using that information to estimate a more accurate transfer duration.
- 20 km/h from postcode to train stop in a straight line can be reasonable in some contexts (cities) but not in other places.
- We're filtering stops near postcodes and taking out some reasonable possibilities to well-connected stops.
Extensions¶
There's always more things one can do. One thing that would have been nice to do was to produce and store the shortest path between postcodes on top of the distance. This could have helped in more than one way:
- Knowing the trip can help with validation that the timing makes sense.
- Knowing the route and stops can help learn the first train of the day that the employee can take.
Unfortunately, the igraph API did not offer the complete path when calculating all pairs.
Another nice-to-have functionality in the igraph API would have been to take a cut-off / max-dist value parameter to limit the search. Since I was going to filter the maximum time anyway, that could have helped the algorithm avoid paths that are too long. I know that the graph-tool for python package has it.