ODI Leeds

Anonymisation and re-identification

We work to encourage local authorities and government bodies to open up data. Data exists on a spectrum and most often when we talk about open data we mean non-personal things such as bus fares or bin collections. Once in a while a local authority or combined authority will ask us about datasets that have a wider use but involve personally-identifying information.

Protecting privacy is critical. Generally speaking, all personal data should be removed. However, there are some cases where the dataset may be near useless without some kind of notion of individuals but you shouldn't be able to match those to real-life people. In these cases people turn to anonymisation methods.

Anonymisation can involve removing unnecessary fields (e.g. names) and aggregating other fields (e.g. providing postcode outcodes instead of full postcodes). These methods mostly group people and avoid small groups (by age, gender, geography etc) in the data. However it falls prone to the possibility of re-identification as more fields are included and if individuals turn out to be unique when the dataset is looked at as a whole. If it is possible to link any of the fields to external data with extra information this could also lead to re-identification.

Sonia Duarte of ODI HQ recently published a blog post about the perceived risks of re-identification and on ODI HQ R&D project about managing the risks of the re-identification. That is a three year project running until March next year and you can get in touch with them to provide feedback.

Tips

If you are currently trying to anonymise data, here are some tips and things to keep in mind:

  • Don't include obviously identifying things such as names, postal addresses, and unique IDs.
  • If geography has to be included you have a few ways to group people: Grouping by geography will reduce the ability to locate a person. However, be aware of any other, different, geographic fields in the dataset (or that could be matched to any fields in the dataset using external sources) that could create overlapping areas with small numbers of people in them. For example, if publishing bus pass usage data, the postcode outcode together with a bus stop location will narrow down the likely geographic region of the real individual.
  • Age can sometimes be useful in published datasets e.g. to see if some age groups are losing out. You should not include dates-of-birth as these are clearly very identifying. Publishers often attempt to retain some useful information by putting people into age brackets e.g. 0-17, 18-24, 25-44, 45-64, 65-84, and 85. This works if you are publishing data that doesn't allow an individual to be tracked across time - either within the dataset itself or across updated releases in the future. However, if individuals can be tracked over time in your dataset(s), this can lead to their date-of-birth "leaking" out when they change age categories. You could avoid this in a couple of ways:
    • Removing the month and day when calculating the age bracket i.e. use the 1st January of their birth year.
    • Providing their birth decade because this doesn't change over time.
    However you should keep an eye out for other fields that may depend on age that could change at points in time.
  • Time-based information could identify people in a larger dataset if combined with other fields or other, external, information. It could even be used to add more precision to aggregated geography if, say, the events happen along a bus route or in some sequence. Only provide times at a necessary resolution. The ISO8601 date format is good for reducing resolution.
  • Gender can be identifying in cases where people have non-binary gender or have changed gender within published datasets. You can remove these individuals from your datasets to avoid them being re-identified. Be aware that this will bias your dataset with regards to non-binary/trans groups.
  • If you create a unique ID/hash for an individual, do not use any of the individual's personal data to create the ID/hash as there is a small (but increasing with time) chance of someone being able to reverse engineer this with advances in computing power. Also be aware that if you assign unique IDs in a specific order, the ordering may add information to your dataset e.g. if your dataset was originally ordered by name when an ordered ID was assigned, it's placement relative to other individuals gives some idea of what the name was.
In summary there are two main themes to be aware of:
  1. Don't include small groups. When all fields are combined, individuals shouldn't be in groups small enough that they can be re-identified.
  2. Be aware of hard edges in the data. If you have aggregated fields into groups with "hard boundaries" (e.g. age brackets) make sure that individuals can't be matched across these boundaries and allow the original value (e.g. date-of-birth) to be recovered.
As mentioned in a recent guest blog post by Jonathan Pearson of NHS England, after anonymisation you could also consider techniques such as swapping, "jittering" and removing some percent of a final extract.

If you have any tips of your own you can add them to our Open Data Tips page on Github.