A case study based on public transport data feeds.
While the benefits of exposing transit data streams has been well-documented, when it comes to making these APIs easy to use, many providers fall short.
This article offers a developer's perspective on four major technical barriers to effective realtime API deployment in transport, with steps transport providers can take to maximize innovative use of their data.
Introduction: The state of transport data
My work introducing transport data to the Ably Hub has involved identifying the most useful, publicly available realtime transit data, converting it to a single realtime feed, and inputting it to the Hub, which then re-distributes it to users in whichever realtime protocol and data structure they need.
The benefits of open data in the mobility sector are well-known, but the process of populating the Hub threw up several reoccurring obstacles to its effective deployment. These range from a lack of ‘real’ realtime information, to a lack of protocol support, to heterogeneous data structures. The article below explores these challenges, with recommendations for further opening up 'open' transport data.
Note: The scope of this article uses several major providers: TFL; NationalRail; NetworkRail; MTA, and Digitransport, as examples. Our research revealed ineffective realtime API deployments as a problem common to multiple transport modes, from trams to AVs. If you have had similar difficulties sharing and accessing transport streams, or want to talk about the situation across other transport sub-sectors, get in touch. We’d be interested to hear your perspectives.
Barrier 1: A lack of realtime updates
The first observation was a general lack of data sources in realtime, notably a lack of push-based APIs. With transport systems event-based by definition, REST-APIs don’t give the whole picture, as they don’t allow for inevitable deviations from pre-defined schedules.
Consider an application which is meant to keep end users updated with train locations, subject to change at any moment. Using pull-based protocols, you’ll need to poll the provider’s endpoint every few seconds for current information. Leave it too long and you risk missing information on a train arriving at a different platform, and have the end user miss the train. Make it too short, and you’re using a lot of bandwidth making requests for unchanged information, with each message also having a fairly large overhead.
Surprisingly, one of the largest feed specifications, GTFS Realtime, has for the most part been implemented by transport systems using pull, with the documentation only providing examples using pull mechanisms. While some companies have started using the specification with push-based protocols, such as MQTT, these early adopters are the exception rather than the rule.
TFL, held up as an example of open transport data as its best, was quite opaque when it came to providing push-based mechanisms for obtaining updates. Going on pretty much all of TFL’s public documentation, it didn’t look like anything besides their unified RESTful API was available. It took a lot of research before I found a blog post from 2015 mentioning the existence of a SignalR endpoint. Why this important functionality has mostly been undocumented for use by developers was unclear.
To provide realtime data, push-based systems lighten the engineering load both for producers, who only need to provide the initial connection point, and for subscribers, who no longer need to worry about intermittently polling the provider’s endpoint. The result is instantaneous updates and far lower bandwidth costs. Unlike pull systems, push bandwidth costs remain sustainable even when thousands of developers start using your data. Finally, if you are using more sophisticated push-based models, document it! Clear, user-friendly documentation is often a deciding factor in how or even whether developers will use your API.
Barrier 2: A fragmented system of realtime protocols
Even where realtime protocols are supported, the range of protocols a provider support varies widely. If developers want to obtain transport information from more than one data provider - a likely scenario, given a city’s transport system tends to be run by a number of separate providers - they need to create multiple adapters between various protocols. This creates extra work for developers who have to familiarize themselves with each provider’s chosen protocol, work out its implementation, and how to convert this data into a unified format suited to a particular app or service.
However, opening up ranges of protocols constitutes a headache for transport providers as well, with each new protocol requiring initial investment to integrate it with existing protocols. It’s also additional investment in terms of ongoing maintenance, ensuring compatibility continues after various updates.
To illustrate how this inconveniences data sharing we can zoom in on train status updates. TfL provide a REST endpoint, as well as SignalR, a WebSocket-based protocol. NationalRail provide access via ActiveMQ through OpenWire or STOMP, both of which are queue services. Digitransit makes use of MQTT. Most American providers using GTFS Realtime as their base feed specification and, as mentioned above, might not even provide a realtime push protocol.
The problem here lies in the fact that each of these protocols have their own set of relative pros and cons. Queues allow for simple communication for servers trying to construct databases of the current state of the transport network, providing the mechanisms necessary for dealing with the huge amounts of data involved. WebSockets provide simple bi-directional communication, with usually strong connection recovery and other functionalities to ensure reliability. MQTT provides similar bi-directional functionality to WebSockets with reduced bandwidth, making it useful for IoT devices, but equally lacking some additional functionalities provided by WebSockets.
Each of these protocols has their own unique benefit, so it’s beneficial for providers to support as many as possible. However, the associated engineering cost makes this unlikely. Each additional protocol comes with in terms of maintaining it, updating it and ensuring availability.
As realtime APIs are more widely adopted across sectors, it’s likely the cost associated with providing protocol support will diminish. This will rely on easy-to-implement protocol adapters which offload the engineering burden for data producers and consumers alike.
Barrier 3: Problematic data structures
In addition to a lack of unification in protocols, there is also lack of standardization in the way transport providers structure their data. Some companies provide extended information - carriage formation, up-to-the-minute ETAs, and seat availability, others scrape by with the bare minimum of time and transport mode ID.
This can be worked around, but it introduces extra complexity for consumers who must amalgamate various sources of data to extract the exact information they need. With each new data structure developers need to work out what data corresponds to what, how to correlate similar data, in addition to allowing for varying degrees of accuracy.
Looking at the difference in the depth of information between data, a good illustration is the variety of options for what has caused a disruption. GTFS Realtime includes twelve possible reasons for delays. NationalRail on Darwin however, has a whooping 496 options. So whilst I wanted to set up general indicators for the severity of delays, going through all these options and creating correlations was not feasible in the time I had.
In addition to the structures of data, the amount of data needed can vary significantly. For many consumers (including Google), the basic information provided by GTFS Realtime is plenty for their products (Google Maps). For others though, the information is insufficient. TfL is unlikely to support GTFS for this reason, opting for TransXChange instead.
This puts the consumer fairly at the mercy of providers in terms of information quality. As well as having too scanty detailed transport data, they may receive overly-verbose data, which they consume at the risk of additional bandwidth costs, especially as IoT devices become more prevalent in the transport sector.
It’s easy to see why producers aren’t all providing multiple data structures to match each of their user’s needs. They’d need to create conversions between their data structures, ensure these remain consistent with standards as they develop, and create mechanisms to provide varying data structures to use on demand. This is extra cost, and continual extra engineering time, focussed on a problem the community can work around.
If transport data is to become truly open however, this barrier will eventually need to be overcome. NYC’s Metropolitan Transportation Authority (MTA) has acknowledged the need to provide different data structures and feed specifications by being one of the first to provide both GTFS-R and SIRI, with simple conversion between them both. Organizations will eventually need to provide these conversions as the needs of developers continue to grow. This can be done by following an industry-wide standard, or using some form of open-source conversion hub which helps to keep the cost of creating conversions to a minimum, and ensures consistency from company to company. The process can be accelerated if developers themselves can define the most useful formats in which companies can share data. If you have comments you’d like to add to conversations about introducing standards for transport data, get in touch.
Barrier 4: Rate limits
While the above talks about shortfalls in transport data sharing for next-gen apps and services, it’s worth noting the huge progress that’s been made over the past decade. TFL alone has 600 travel apps based on its open data, purportedly used by over 42% of Londoners. A sign that many transport providers are not ready to support 2019 travel apps is the fact they often impose heavy rate limits and restrictions on data usage.
NetworkRail has a limit of 500 people using their queues at any one time. TFL’s RESTful API is limited to 500 requests a minute. Most pull-based systems I’ve encountered don’t seem to be designed to handle large numbers of requests, which inherently reduces the value in the data as it becomes less accessible.
Transport data producers generally need to continue working on making their data as available as possible, and this includes reducing rate limits as much as possible. At present this onus is often shifted over to transport aggregates, such as Transport API.
As more consumers realize the benefit of performing analytics on transport data, distributing it in apps, and incorporating it in their IoT networks, the demand for this data will only continue to increase. As demand increases and more data is available at increasing update rates the issue of rate limits and infrastructure stability will also become more pressing.
Wrap up: Advice for transport providers
In terms of data sharing, the transport industry made huge progress over the last decade, and it’s thanks to this we have apps like CityMapper. However, it’s questionable whether transport providers are equipped to share data at the scale and in the format required by developers today. Producers of data aren’t providing data over the necessary protocols, nor is there any indication of the transport sector unifying over a single data structure.
At present, many transport companies appear to rely on transport aggregates, such as Transport API, to deal with the costs and complexity of constructing their data into a single structure and re-distribute at scale. But this has only shifted the cost to the consumer. As data is consumed at ever-growing rates, this cost will become a blocker to innovation and growth based on open data. As new data collections are made available, such as the growth of multi-modal and automated, IoT-controlled logistics and public transport systems, all with similar data sharing needs, it’s more important than ever to open up the data as much as possible.Push-based protocols will go a long way to reducing the cost in systems, and hopefully open-source solutions can be developed and accepted for easy conversion between protocols and data structures.
So long as transport providers are able to keep up with the rate of growth and what today’s developers require, we’re in for some exciting developments. To build developer ecosystems publishers of data in the transport sector need to bear in mind the following main considerations:
- Provide push feeds to keep up with the ever-growing need for realtime data
- Provide data over as many common protocols as possible, ensuring their data accessible for as many applications as possible
- Provide varied level of detail dependent on the end-user’s request, ensuring the consumer is receiving the data they need; look to provide functionality such as [deltas] to further enhance usability.
- Invest into increasing rate limits, opening data up as much as possible. In the long-run, it’ll be worth it.
Ably is a global cloud network for streaming data and managing the full lifecycle of realtime APIs.
To find out about the current state of data-transfer infrastructure, the development of the new realtime data economy, and how your organization can turn data streams into revenue streams, talk to Ably’s tech team.
For more information on how to expose data streams for free, visit the Ably Hub.