Costing a bus operation by municipality from a GTFS with R

As a CSM at Remix, I’m in touch with a lot of different transit agencies and have the opportunity to learn about their challenges. Even though all transit agencies have the same main goal, many times their challenges can be quite unique depending on their specific situation.

This is the main reason why, sometimes, one single platform (as great as it is) can’t meet every need each of the +300 customers have. It is in this kind of situations that a CSM has to come up with a workaround so the user can get the value he/she is looking for.

This is basically what happened when one of our prospects told us they cost their lines by municipality. If Line 1 does the sequence of stops “a-b-c-d-e” and “a-b-c” belong to municipality A and “d-e” belong to municipality B they would charge the hours that takes to get from “a” to “d” to municipality A and from “d” to “e” to municipality B.

Since I’ve been learning R in my Master’s degree , I thought this could be a good opportunity to apply some of my new knowledge and start to unlock GTFS with R.

Note: Keep in mind that the code exposed in this article is not complete. To get the full code please visit my Github account.

General Transit Feed Specification (GTFS)

GTFS stands for “General Transit Feed Specification”. If you belong to the transit world, I’m sure you already know the basics of the information you can manage with it (it is, among other things, how transit agencies communicate with Google Maps). If you haven’t heard about it and want to learn more, I recommend you to take a look at their specifications here:


The first thing I had to decide was the libraries I would use. Since I had to import a lot of data and would also need to transform it quite a bit I chose these 4 libraries to get started:


The library “tidyverse” would allow me to manage the data frames easily, in a SQL like way that I find quite easy to follow and goes with my way of thinking. It is not like it will necessarily allow us to do things we couldn’t do with out it, but it will certainly help me structure my code and make it easier to read.

The library “readr” and “readxl” were to read .txt files and excel files faster.

For a first version, the data would be presented in a table, not graphically; this is why “ggplot2” is not there.

Uploading and unzipping the GTFS

Now I needed to upload the GTFS to R to start playing with it. Since I had in mind to develop a Shiny app with this script later, I decided to do it with a function that would ask me to choose the file, just to test it. Then, the script would unzip the file and extract all its content in a folder with its same name, setting this last folder as the working directory.

#Choose the GTFS to unzip
zip <- file.choose()
#Create the directory to unzip it and set it as working directory
outDir <- substring(zip, 1, nchar(zip)-4)
#Unzip it and put it in the directory
unzip(zip, exdir = outDir)

Reading the data

Now that the working directory was set, I could read the files directly with their name. We can create one data frame for each file of the GTFS (I only did this for the files that I was going to use, leaving behind calendar_dates.txt, calendar.txt and agency.txt).

#Reading the files and creating the dataframes
trips <- read_csv("trips.txt")
stop_times <- read_csv("stop_times.txt",
col_types= cols(arrival_time = col_character(),
departure_time = col_character()))
stops <- read_csv("stops.txt")
shapes <- read_csv("shapes.txt")
routes <- read_csv("routes.txt")
regions <- read_excel("/Users/santiagotoso/Google Drive/Master/Curso R/GTFS/Stops_and_stations_20180701.xlsx")

In this particular case, I had to force the column type of “arrival_time” and “departure_time” in “stop_times” since the time format in R won’t allow times greater than 24:00 and in the GTFS we find values like 24:30 to indicate night service.

The last line is to read the Excel file where the prospect indicated which municipality each stop belonged to.

Joining data

I had now all the data frames needed, great! I could start playing with the GTFS and see what insights I could get from this public and easy-to-find source of public transport data.

First thing I did was to keep only the relevant values in the Excel file. I used tidyverse to do it.

#I'll only keep the columns that are interesting to me in the regions file
regions_short <- regions %>%
IsCurrent) %>%
filter(IsCurrent == 1) %>%
filter(StopPointTypeCode == 'BUSSTOP') %>%
select(-StopPointTypeCode, - IsCurrent)

Then I brought all the relevant data from different data frames to “stop_times” that would become the center of the script from now on. Now that I’ve done more studies of this type I find this decision a bit extreme since I could have worked with a lighter “stop_times” data frame and then bring the relevant information when displaying the results. Anyway, in this case I did it like its shown in the image below.

#Join everything with the stop_times dataframe
stop_times <- stop_times %>%
left_join(trips) %>%
left_join(stops) %>%
mutate(StopPointNumber = as.numeric(stop_id)) %>%
left_join(regions_short) %>%

Basically what I did with this script was bringing all the fields from “trips”, “stops” and “regions_short” to “stop_times”. This means I had all the information there: position of the stops, name of the route, color of the route, region the stop belonged to, etc.

I did it this way because at that point I didn’t know exactly what I was looking for and wanted to keep my options open. If you do know what you’re looking for it might be better to keep (or bring) only the variables you’ll need.

Transforming data frames — Part 1

If you have worked with data before, you might know that between 70–80% of the time is spent cleaning and transforming the data. This was no exception.

For the problem I had in mind I thought it would be useful to have in the same row, the trip_id, the trip_id of the previous row and the trip_id of the following row. Like this I could easily identify if that stop was in the middle of the line, the first one or the last one of the line. In the same way I wanted to know if a given stop was in the same region as the previous one or the next one, to identify when the line changes region. These two things were crucial to the problem I was trying to solve: costing by municipality.

To that end, I created a new data frame called “trip_ids” that serves the purpose of doing exactly that.

#See when we change trip and tarif zone
trips_ids <- stop_times %>%
select(trip_id, TariffZoneNumber)
summary(trips_ids)trips_ids <- trips_ids %>%
#trip_id from the previous stop
mutate(prev_trip = ifelse(,
trips_ids[nrow(trips_ids), 1],
lag(trip_id))) %>%
#trip_id from the next stop
mutate(next_trip = ifelse(,
trips_ids[1, 1],
lead(trip_id))) %>%
#TariffZoneNumber of the previous stop
mutate(prev_zone = ifelse(,
lag(TariffZoneNumber)) ) %>%
#TariffZoneNumber of the next stop
mutate(next_zone = ifelse(,
lead(TariffZoneNumber)) )

Lines 61 and 62 are just to take two variables from “stop_times”. Lines 66 to 82 are to put the previous and following “trip_id” and municipality next to each value. In the end, “trip_ids” would look like the picture below.

It is quite easy here to identify the last stop of a trip (line 49 for example) since the value in “next_trip” is different to the value in “trip_id”. Same thing for the municipalities with lines 54 and 55 that belong to trip_id = 2 but the stops are in different municipalities.

With this auxiliary data frame we could easily identify the breaks in “stop_times”. A break meaning the stops where the line starts, ends or changes municipality.

#Store the important stop_time in a new data frame
breaks <- stop_times[(
trips_ids$trip_id != trips_ids$prev_trip |
trips_ids$trip_id != trips_ids$next_trip |
trips_ids$TariffZoneNumber != trips_ids$prev_zone |
trips_ids$TariffZoneNumber != trips_ids$next_zone
), ]

By looking for the rows where the variable “trip_id” is different from “prev_trip” or “next_trip” I found the first and last stops of each trip. By looking for the rows where the variables “TariffZoneNumber” is different from “prev_zone” and “next_zone” I found the stops where the line changes municipality. In the “breaks” data frame, I only kept those rows and discarded the rest.

Treating missing values

Now that I have the “breaks” data frame that will allow me to get the costing I want, I first need to check for missing values. I used the command “summary(breaks)” to look into it.

This function returns a summary of each of the variables. The variables that start with capital letters are the ones from the Excel file. All of those variables have 1868 missing values. On the other hand, some variables from the GTFS have 1665 missing values. The important thing was that each original stop in the GTFS had its municipality. Indeed it did, so we could just remove this missing values with no fear of missing information.

#it is. Lets filter those numbers out
breaks <- breaks %>%
filter( != 'TRUE')

Transforming data frames — Part 2

Now I need to calculate the runtime on each municipality for each trip and sum it up to get the values! Sounds easy doesn’t it? The only problem is that the variable “arrival_time” is a character! (I forced it to be one when I created the data frame). So before doing the calculations I should get the time in a numeric format.

#calculate running time for each
hours <- as.integer(substr(breaks$arrival_time, 1, 2))
minutes <- as.integer(substr(breaks$arrival_time, 4, 5))/60
breaks <- breaks %>%
mutate(hours_minutes = hours + minutes) %>%
mutate(travel_hours = ifelse(trip_id == lead(trip_id), (lead(hours_minutes) - hours_minutes), 0)) %>%
select(-timepoint, - direction_id, -hours_minutes)
#Rescue the last value
breaks[nrow(breaks), 'travel_hours'] <- 0

Since I still can’t express 24:30 in r “time” format I chose to put it as a number (24,5). To make the calculations is exactly the same and it is even better for calculating a cost later.

In the script of the image I take the first two characters of “arrival_time” to get the amount of hours and the last two divided by 60 to get the number of minutes (decimal); then I add them to get the actual “arrival_time” in the column “hours_minutes”. Finally I calculated the runtime for each segment with line 117.


Now I could group the runtimes as I wanted: by trip, by zone, etc.

I started grouping by trip and region, then by route and region. Any combination here works depending on what we want to see.

#Summarise the information by trip and route for each region
by_trip_region <- breaks %>%
group_by(trip_id, TariffZoneDisplayName, service_id) %>%
summarise(travel_hours = sum(travel_hours)) %>%
left_join(trips) %>%
select(trip_id,TariffZoneDisplayName,travel_hours,route_id, service_id) %>%
left_join(routes) %>%
by_route_region <- by_trip_region %>%
group_by(route_long_name, TariffZoneDisplayName) %>%
summarise(travel_hours = sum(travel_hours))

It is important to notice that I didn’t consider different day types till here. We should have done that! And I actually did consider it in the web app I built afterwards.

You can take a quick look at how that works in this short video, it takes the script of this article to run all the calculations and then shows it in a table you can download afterwards.

Originally published at on July 31, 2018.

Sr. Data Analyst at Remix.