Bus usage in West Yorkshire
Over the past year we've been working with the West Yorkshire Combined Authority (WYCA) to help them explore how open data publishing could help them improve travel within West Yorkshire. Previously, WYCA have published data about attitudes following the Grand Départ in 2014 and their go:cycling campaign. Recently, WYCA published their first dataset relating to bus use. This is a positive step forwards in their open data journey.
The new dataset includes individual boardings by English National Concessionary Travel Scheme holders (eligible older people) over a roughly three month period at the start of 2017. Why doesn't the new release include all bus journeys? There are a variety of factors that make that difficult. Partly it is because general bus usage data belongs to the individual operators and they consider it to be commercially sensitive. Even if the operators were on board, there would then be the difficulty of combining data from all the different travel apps and cards from all the different operators. Given these regulatory and practical issues, the approach was to look at releasing data from concessions cards issued by WYCA.
Dealing with personal data
Concessionary travel card usage includes a lot of personal data. Although we promote open data, we are also very much aware that data exists on a spectrum. Data that can identify individuals should be very much on the closed side of that spectrum.
It would be great to know the types of people who were using specific bus services, when they are using them, and how much they are using them. Anyone would then be able to look for usage patterns that might identify better routes. For example, perhaps lots of people have to get two different buses to reach work and a new route could be introduced to reduce their travel time and costs. However, it is important to protect the privacy of individuals and especially think about how any dataset we publish could be combined with other datasets to re-identify people. We spent a lot of time with WYCA and YorCard (the company operating the cards) thinking about ways to anonymise the data to stop a “bad actor” re-identifying people.
The first steps were to identify data which could be used to identify individuals, dates of birth, gender, postcodes etc. We considered just providing age but that varies with time so, if a record was trackable across time, it would be possible to re-identify date of birth as time went on. One way around that would be only providing, say, birth decade and making sure individuals were not cross-referenceable across days. However, in the context of the ENCTS data, age was always going to of less interest anyway so it was just removed. Gender was also potentially available but it could become more identifying on services with lower rates of use.
The data records a Boarding Point as a fare stage point (a group of bus stops) rather than bus stop level. This helped as it reduced the ability to re-identify people. However, we realised that if we released each journey start time to the minute level, that could be used to re-identify specific bus stops - groupings of records and the gaps between show the bus stops (especially if you know where the bus stops are). So, we reduced start times to 10 minute blocks. Ten minutes is still good enough to find global patterns but coarse enough to make bus stop re-identification (and so the use of that for further personal re-identification) much harder.
YorCard undertook a further step to remove journeys where 9 or fewer boardings were completed on a single bus service within that week. That removes around 250 journeys per week on low usage services and a small amount of incorrect boarding data.
The final, published, datasets remove all information about individuals and it is no longer possible to track individuals across entries in the files.
Visualising the data
Now that we have data, we can start to look at it. I have created a simple tool that lets you look at usage for each bus service. As the datasets are over 300 MB, I had to find a way to reduce that to something usable in a web browser. I’ve summed up each 10 minute chunk of day across the whole period so you can see what times of day are the busiest. I also plot the sum of all buses in West Yorkshire so you can compare the shape of an individual bus to the overall use.
Many buses follow the same pattern but some don't. The 321 service from Huddersfield bus station towards Marten Nest, Meltham & Thick Hollins seems to be used mostly after 10pm. My tool doesn’t tell us why that is the case but hopefully it leads to more questions which can be answered by looking at the dataset in different ways.