Although future work is needed, it seems likely
that most large-scale metadata data sets—for example, browsing history, financial records, and
transportation and mobility data—will have a high
unicity. Despite technological and behavioral differences (Fig. 5B and fig. S3), we showed credit
card records to be as reidentifiable as mobile
phone data and their unicity to be robust to
coarsening or noise. Like credit card and mobile
phone metadata, Web browsing or transportation data sets are generated as side effects of human interaction with technology, are subjected
to the same idiosyncrasies of human behavior,
and are also sparse and high-dimensional (for example, in the number of Web sites one can visit or
the number of possible entry-exit combinations of
metro stations). This means that these data can
probably be relatively easily reidentified if released in a simply anonymized form and that
they can probably not be anonymized by simply
coarsening of the data.
Our results render the concept of PII, on which
the applicability of U.S. and European Union (EU)
privacy laws depend, inadequate for metadata
data sets (18). On the one hand, the U.S. specific-types approach—for which the lack of names,
home addresses, phone numbers, or other listed
PII is enough to not be subject to privacy laws—
is obviously not sufficient to protect the privacy
of individuals in high-unicity metadata data sets.
On the other hand, open-ended definitions expanding privacy laws to “any information concerning an identified or identifiable person” (30)
in the EU proposed data regulation or “[when the]
re-identification to a particular person is not possible” (31) for Deutsche Telekom are probably
impossible to prove and could very strongly limit
any sharing of the data (32).
From a technical perspective, our results emphasize the need to move, when possible, to more
advanced and probably interactive individual (33)
or group (34) privacy-conscientious technologies,
as well as the need for more research in computational privacy. From a policy perspective, our
findings highlight the need to reform our data
protection mechanisms beyond PII and anonymity and toward a more quantitative assessment
of the likelihood of reidentification. Finding the
right balance between privacy and utility is absolutely crucial to realizing the great potential of
REFERENCES AND NOTES
1. S. Higginbotham, “For science, big data is the
microscope of the 21st century” (2011); http://gigaom.
2. D. Lazer et al., Science 323, 721–723 (2009).
3. J. Giles, Nature 488, 448–450 (2012).
4. D. J. Watts, Winter Issue of The Bridge on Frontiers of
Engineering 43, 5–10 (2013).
5. A. Wesolowski et al., Science 338, 267–270 (2012).
6. S. Charaudeau, K. Pakdaman, P.-Y. Boëlle, PLOS ONE 9,
7. N. Eagle, M. Macy, R. Claxton, Science 328, 1029–1031
8. V. Padmanabhan, R. Ramjee, P. Mohan, U.S. Patent 8,423,255
9. G. Boulton, Nature 486, 441 (2012).
10. M. McNutt, Science 346, 679 (2014).
11. T. Bloom, “Data access for the open access literature: PLOS’s
data policy” (2013); www.plos.org/data-access-for-the-open-access-literature-ploss-data-policy.
12. K. Burns, “In US cities, open data is not just nice to have;
it’s the norm” The Guardian, 21 October 2013;
13. Massachusetts Bay Transportation Authority, “Real-time
commuter rail data” (2010); www.mbta.com/rider_tools/
14. Y.-A. de Montjoye, Z. Smoreda, R. Trinquart, C. Ziemlicki,
V. D. Blondel, D4D-Senegal: The second mobile phone data for
development challenge. (2014); http://arxiv.org/abs/1407.4885.
15. V. D. Blondel et al., Data for Development: The D4D
challenge on mobile phone data. (2012); http://arxiv.org/
16. P. Mutchler, “MetaPhone: The sensitivity of telephone
metadata” (2014); http://webpolicy.org/2014/03/12/
17. Y.-A. de Montjoye, J. Quoidbach, F. Robic, A. Pentland,
Predicting personality using novel mobile phone-based
metrics. in Proc. SBP (Springer, Berlin, Heidelberg, 2013),
18. P. M. Schwartz, D. J. Solove, Calif. Law Rev. 102, 877–916 (2014).
19. Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, V. D. Blondel,
Sci. Rep. 3, 1376 (2013).
20. A. Narayanan, V. Shmatikov, Robust de-anonymization of large
sparse datasets. in IEEE Symposium on Security and Privacy,
Oakland, CA, 18 to 22 May 2008 (IEEE, New York, 2008),
21. A. C. Solomon, R. Hill, E. Janssen, S. A. Sanders,
J. R. Heiman, Uniqueness and how it impacts privacy
in health-related social science datasets. in Proc. IHI
(Association for Computing Machinery, New York, 2012),
22. L. Sweeney, Int. J. Unc. Fuzz. Knowl. Based Syst. 10, 557–570
23. 2013 Federal Reserve payments study (2013);
24. eMarketer, “US mobile payments to top 1 billion in 2013”
25. “The trust advantage: How to win with big data” (2013);
26. C.-L. Huang, M.-C. Chen, C.-J. Wang, Expert Syst. Appl. 33,
27. S. Bhattacharyya, S. Jha, K. Tharakunnel, J. C. Westland,
Decis. Support Syst. 50, 602–613 (2011).
28. C. Krumme, A. Llorente, M. Cebrian, A. S. Pentland, E. Moro,
Sci. Rep. 3, 1645 (2013).
29. Materials and methods are available as supplementary
materials on Science Online.
30. European Commission, “General data protection regulation”
31.De utsche Telekom, “Guiding principle big data” (2014); www.
32. Y.-A. de Montjoye, J. Kendall, C. Kerry, Enabling Humanitarian
Use of Mobile Phone Data. Brookings Issues in Technology Innovation
Series (Brookings Institution, Washington, DC, 2014), vol. 26.
33. Y.-A. de Montjoye, S.S. Wang, A.S. Pentland, IEEE Data Eng. Bull.
35, 5–8 (2012).
34. C. Dwork, in Automata, Languages and Programming (Lecture
Notes in Computer Science Series, Springer, Berlin, Heidelberg,
2006), vol. 4052, pp. 1–12.
For contractual and privacy reasons, we unfortunately cannot
make the raw data available. Upon request we can, however, make
individual-level data of gender, income level, resolution (h, v, a),
and unicity (true, false), along with the appropriate documentation,
available for replication. This allows the re-creation of Figs. 2 to 4,
as well as the GLM model and all of the unicity statistics. A
randomly subsampled data set for the four points case can be
found at http://web.media.mit.edu/~yva/uniqueintheshoppingmall/
and in the supplementary materials. This work was supported in
part by the Geocrowd Initial Training Network funded by the
European Commission as an FP7-People Marie Curie Action under
grant agreement number 264994, and in part by the Army
Research Laboratory under Cooperative Agreement Number
W911NF-09-2-0053. Y.-A.d.M. was partially supported by the
Belgian American Educational Foundation and Wallonie-Bruxelles
International. L. R. did part of this work while visiting the MIT
Media Lab. We gratefully acknowledge B. Bozkaya and a bank that
wishes to remain anonymous for access to the data. Views and
conclusions in this document are those of the authors and
should not be interpreted as representing the policies, either
expressed or implied, of the sponsors.
Materials and Methods
Figs. S1 to S5
Tables S1 and S2
Algorithms S1 and S2
20 May 2014; accepted 23 December 2014
Fig. 5. Distributions of the financial records. (A) Probability density function of the price of a transaction in dollars equivalent. (B) Probability density function of spatial distance between two consecutive
transactions of the same user. The best fit of a power law (dotted line) and an exponential distribution
(dot-dashed line) are given as a reference. The dashed lines are the diameter of the first and second
largest cities in the country. Thirty percent of the successive transactions of a user are less than 1 km
apart (the shaded area), followed by, an order of magnitude lower, a plateau between 2 and 20 km,
roughly the radius of the two largest cities in the country. This shows that financial metadata are different
from mobility data: The likelihood of short travel distance is very high and then plateaus, and the overall
distribution does not follow a power-law or exponential distribution.