We are closed from 5pm on Friday 21st December until 9am on Wednesday 2nd January.
ODI Leeds

Releasing Synthetic A&E Data

Over the coming months we're working with the NHS to release synthetic data regarding A&E departments - specifically the waiting times for individuals and what this means for the pressures that departments are facing. This blog is part one of a four step process:

This blog summarising our plan.

  1. Development of a synthetic dataset by analysts within a secure NHS environment, under the instruction and supervision of information governance experts in NHS England.
  2. A closed session to be held at ODI Leeds where experts in data security and anonymization (along with interested parties and other stakeholders) can understand the methodology used to create the synthetic data. Then the suitability of the dataset can be assessed, including its potential to support innovative and exploratory analysis.
  3. Reflecting on feedback from the closed session, making any changes to the proposed dataset, and undertaking a final quality assessment of the synthetic data.
  4. Release of the synthetic A&E dataset as open data.

We think that better use of data can improve care and save lives, but it must be treated extremely carefully. If at any step we discover problems with the data that can't be fixed, we will stop the process.

Why?

Summary data about A&E departments of UK hospitals are big news. Waiting times data is analysed and fills front pages, the TV, and the radio. Things get political very quickly.

Information on A&E activity is widely requested by people, and some places have done well at making this information available. In Northern Ireland, a single website with all A&E waiting times provides value to people and helps to even out demand. When data is more open, it lets more people and businesses provide more great services at a quicker pace.

But an average waiting time doesnt tell us much about what's really happening.

Who needed care? At what time? For what problem? And what happened as a result?

Analysts across the NHS already collect information and use it to improve care and reduce costs. They know that being more open will help. Working in the open would put them in touch with more external experts and introduce them to new techniques. Working in the open would make it easier for managers to judge external products & services and figure out how to buy & use them efficiently.

But detailed A&E admissions data is far too sensitive to be made open. It is rightly afforded special category status within both the GDPR and The Data Protection Act 2018.

Does this mean we can't do anything? We don't think so. Using the data appropriately we can create datasets that are both powerful and maintain people's privacy.

What is synthetic data

The most common way to take sensitive personal data and make it safe to release is to aggregate it. You take your data, answer questions like "how many people arrived with a broken limb, by day, for every A&E?" and then release the tables. In this way, only summaries are published and individual patient data remains secure.

Alternatively, we create and share synthetic data. You take your data and analyse the patterns and relationships within it. You then create new data in a way that maintains those patterns and relationships but no actual patient data remains. No-ones information is identifiable, but most of the value of the data remains.

Aggregate data is better for immediate consumption - high level overviews create new stories, can help teams improve, and inform general decisions.

Synthetic data is better for exploring. You can start building a tool on synthetic data straight away. There's no need to wait for a license, or an FTP login, or developer credentials to get started. You can test it, improve it, and pitch it to investors or the NHS, knowing that with just a few hours of work it can be switched over to the real dataset.

That's why, for this project, we want to release a synthetic dataset of A&E admissions. The columns of the data will directly relate to the collected variables (i.e. injury type or time), and the size of the dataset - tens of millions of rows - will be similar too.

Analysts at the NHS are constantly working to provide a better service by understanding their data better; whether by themselves, learning from external experts, or working in partnership with companies specialising in data analytics and building great tools and services. We think that synthetic A&E data can help, and we think that you can help us.

How can you help us?

Our project has four steps, and we need your help at part two. NHS colleagues have worked through the data with help from experts in data anonymisation and transformed it into synthetic data.

But if you disagree, we want to hear from you. Maybe you think that even if A&E data can be successfully anonymised it shouldn't be released. Maybe you'd like to look at the details of how we've created the synthetic dataset to check the methodology and ensure that the richness of the data remains. Maybe you work in or with the NHS, like Beautiful Information, on these sorts of things. Maybe you're a company like Privitar, who we've worked with in the past, who have ways of checking how anonymous anonymized data really is. Or maybe you've got an idea that uses this data to improve healthcare in the UK.

Whatever the reason for your interest, we want to hear from you.

You can also register your interest in attending the 'closed session' about synthetic A&E data. The date is still to be confirmed but we are looking at some time in the 1st quarter of 2019.