Furthermore, financial traces contain one additional column that can be used to reidentify an
individual: the price of a transaction. A piece of
outside information, a spatiotemporal tuple can
become a triple: space, time, and the approximate
price of the transaction. The data set contains the
exact price of each transaction, but we assume
that we only observe an approximation of this
price with a precision a we call price resolution.
Prices are approximated by bins whose size is
increasing; that is, the size of a bin containing
low prices is smaller than the size of a bin containing high prices. The size of a bin is a function
of the price resolution a and of the median price
m of the bin (29). Although knowing the location
of my local coffee shop and the approximate time
I was there this morning helps to reidentify me,
Fig. 2 (blue bars) shows that also knowing the
approximate price of my coffee significantly increases the chances of reidentifying me. In fact,
adding the approximate price of the transaction
increases, on average, the unicity of the data set
by 22% (fig. S2, when a =0.50, 〈De〉 = 0.22).
The unicity e of the data set naturally decreases
with its resolution. Coarsening the data along any
or all of the three dimensions makes reidentifi-
cation harder. We artificially lower the spatial reso-
lution of our data by aggregating shops in clusters
of increasing size v based on their spatial prox-
imity (29). This means that we do not know the
exact shop in which the transaction happened, but
only that it happened in this geographical area.
We also artificially lower the temporal resolution
of the data by increasing the time window h of a
transaction from 1 day to up to 15 days. Finally, we
increase the size of the bins for price a from 50 to
75%. In practice, this means that the bin in which a
$15.13 transaction falls into will go from $5 to $16
(a = 0.50) to $5 to $34 (a = 0.75) (table S2).
Figure 3 shows that coarsening the data is not
enough to protect the privacy of individuals in
financial metadata data sets. Although unicity
decreases with the resolution of the data, it only
decreases slowly along the spatial (v), temporal
(h), and price (a) axes. Furthermore, this decrease
is easily overcome by collecting a few more points
(table S1). For instance, at a very low resolution
of h = 15 days, v = 350 shops, and an approximate
price a = 0.50, we have less than a 15% chance of
reidentifying an individual knowing four points
(e4 < 0.15). However, if we know 10 points, we
now have more than an 80% chance of reidentifying this person (e10 > 0.8). This means that
even noisy and/or coarse financial data sets along
all of the dimensions provide little anonymity.
We also studied the effects of gender and
income on the likelihood of reidentification.
Figure 4A shows that women are easier to reiden-
tify than men, whereas Fig. 4B shows that the
higher somebody’s income is, the easier it is to
reidentify him or her. In fact, in a generalized
linear model (GLM), the odds of women being
reidentified are 1.214 times greater than for men.
Similarly, the odds of high-income people (and,
respectively, medium-income people) to be reiden-
tified are 1.746 times (and 1.172 times) greater
than for low-income people (29). Although a full
causal analysis or investigation of the determi-
nants of reidentification of individuals is beyond
the scope of this paper, we investigate a couple
of variables through which gender or income
could influence unicity. A linear discriminant
analysis shows that the entropy of shops, how
one shares his or her time between the shops he
or she visits, is the most discriminative factor for
both gender and income (29).
Our estimation of unicity picks the points at
random from an individual’s financial trace. These
points thus follow the financial trace’s nonuniform distributions (Fig. 5A and fig. S3A). We
are thus more likely to pick a point where most
of the points are concentrated, which makes them
less useful on average. However, even in this case,
seven points were enough to reidentify all of the
traces considered (fig. S4). More sophisticated re-identification strategies could collect points that
would maximize the decrease in unicity.
538 30 JANUARY 2015 • VOL 347 ISSUE 6221 sciencemag.org SCIENCE
Fig. 3. Unicity (e4) when we lower the resolution of the data set on any or all of the three dimensions; with four spatiotemporal tuples [(A), no price]
and with four spatiotemporal-price triples [(B), a = 0.75; (C), a = 0.50]. Although unicity decreases with the resolution of the data, the decrease is easily
overcome by collecting a few more points. Even at very low resolution (h = 15 days, v = 350 shops, price a = 0.50), we have more than an 80% chance of
reidentifying an individual with 10 points (e10 > 0.8) (table S1).
Fig. 4. Unicity for different categories of users (v = 1, h = 1).
(A) It is significantly easier to reidentify women (e4 = 0.93) than men
(e4 = 0.89). (B) The higher a person’s income is, the easier he or she
is to reidentify. High-income people (e4 = 0.93) are significantly
easier to reidentify than medium-income people (e4 = 0.91), and
medium-income people are themselves significantly easier to re-identify than low-income people (e4 = 0.88). Significance levels were
tested with a one-tailed t test (P < 0.05). Error bars denote the 95%
confidence interval on the mean.