A paper published last month raises deep concerns about the way that user data is currently anonymised, and suggests that no complex dataset can be successfully anonymised using standard techniques.
The research has some pretty terrifying consequences.
It suggests that the data held by hospitals, credit agencies, and tech firms can never be protected sufficiently.
Anonymity and identification
Anonymised datasets are supposed to allow researchers to work with data without being able to uniquely identify individuals.
A hospital, for instance, might remove names and dates of birth from patient data, and then release this to health researchers.
The problem is that if this data contains enough data points, it can be reverse-engineered in order to identify individuals.
A number of high-profile examples of this have occurred in the past few years.
In 2008, for instance, a publicly available Netflix dataset of film ratings was de-anonymized by comparing the ratings with public scores on the IMDb film website in 2014.
The recent research was carried out by Belgium’s Université Catholique de Louvain (UCLouvain) and Imperial College London and aims to work out how many data sets like this can be re-identified.
Their conclusions are striking: a dataset that contains 15 demographic attributes, for instance, “would render 99.98 per cent of people in Massachusetts unique”.
The problem gets even worse when the set of data is smaller.
If town-level location data is included, for instance, “it would not take much to reidentify people living in Harwich Port, Massachusetts, a city of fewer than 2,000 inhabitants”.
The scale of the problem
These might seem like abstract concerns, but they are not.
The problem is that plenty of companies release ‘anonymous’ data sets as a core part of their business.
In many cases, in fact, selling this data provides the core income for these companies, and the larger the amount of data on each individual, the more the data sells for.
This gives companies an incentive to release huge, detailed data sets that can easily be de-anonymised.
The researchers of the recent report highlight a few examples of this.
They draw particular attention, for instance, to one set of data sold to the computer software firm Alteryx, which contained 248 attributes per household for 120 million Americans.
The concerns raised by this research will affect companies and consumers differently.
Many companies work with a ‘anonymise, release, and forget’ attitude. The research suggests that this practice might have to change because it proves that supposedly secure data sets are anything but.
It also raises a question about whether anonymisation is enough for companies to comply with relevant legislation like the GDPR.
For consumers, this research merely confirms a long-term concern about how private data is used by companies.
Even where users do not give explicit permission, plenty of companies release supposedly anonymous data by accident.
Gary Stevens, CISO at the community research group, HostingCanada, org, is one of a growing number of voices that have repeatedly raised concerns about the logging policies of supposedly secure VPN and email providers.
Stevens points out that even VPNs can leak data – and they often do – meaning that privacy applications which are supposed to protect us actually do the opposite.
Novel solutions to the problems raised by the research are in development, and several large companies have already taken measures to reduce the risk of re-identification.
Differential privacy is one such approach and is used by companies such as Apple and Uber.
Both of these companies release data sets from time to time, and these sets contain huge numbers of data points for each (anonymous) individual.
To avoid the risk of re-identification, though, differential privacy deliberately makes each data point ‘fuzzy’: each point is expressed as a range rather than a definite number.
These ranges are calculated in such a way that they average out over the whole data set, defeating attempts to uniquely identify each user.
Another approach is homomorphic encryption, which allows data to remain encrypted but still is manipulated by researchers. In the near future, it is also likely that AIs will provide a more exotic solution to the problem.
AI’s can be trained to work with large data sets, and then produce anonymous data that is statistically identical but contains different values.
Users, however, are unlikely to wait for these technologies to be implemented.
In fact, data from recent years suggests that users are more concerned than ever about their online privacy, and are increasingly turning to privacy tools to protect themselves against data leaks.
Whilst these tools might provide an extra level of defence against accidental releases of information, they do not protect consumers against the kind of releases covered in recent research.
In these cases, users had (knowingly or not) given consent for ‘anonymous’ data to be released. They – and the companies who released this information – were simply unaware that it could be re-identified.
Solutions to this issue are not going to be easy but rely on increased consciousness, both among users and tech companies themselves, about just how flawed current anonymisation practices are.
Editor’s note: e27 publishes relevant guest contributions from the community. Share your honest opinions and expert knowledge by submitting your content here.
Image Credit: Guillermo Latorre