INSIGHTS | PERSPECTIVES
480 30 JANUARY 2015 • VOL 347 ISSUE 6221 sciencemag.org SCIENCE
method to compute their average salary, that
group could deduce what the salary was of
their original colleague. The protocol delivers completely accurate results but can also
Fully homomorphic encryption of cloud
data. Suppose a researcher wishes to study
salaries using bank data. People routinely
and confidently send such information encrypted over the Internet. Financial institutions decrypt the messages and perform
calculations. But what if the bank or other
data receiver not only could perform calculations without ever decrypting the
private information but also could
return encrypted answers that only
the sender could decode? Long
thought to be impossible, “fully homomorphic encryption” methods
have recently been devised (10) to
do just that. Most algorithms are
still too slow for practical applications. Proposed protocols could,
however, analyze a population’s encrypted data but only allow statistics to be decrypted if participants
verify that calculations have been
done to their satisfaction (11).
By giving control over their data
to potential subjects rather than
to researchers, such techniques jeopardize
plans for replicability and reuse, as well as
for representative or even adequate sampling. Supposing there are results to release,
it may still be possible for a researcher to
violate the privacy of individuals who participate in the study. Any protocol that allows
exact counts of subpopulations is vulnerable
to a “differencing attack,” for example. To find
out whether the CEO of a company earns
more than $1 million, just make two simple
inquiries: how many employees earn over $1
million in salary, and how many who are not
the CEO earn over $1 million. It may seem
straightforward to rule out lines of questioning like this. Provably, however, no algorithm
can reliably determine whether a given set
of questions that seem to ask only about statistical aggregates would nevertheless have
answers that, taken together, reveal private
Differential privacy for curated data.
Consider a data set D that contains my per-
sonal information and another data set D´
that is missing my data but otherwise the
same. A research protocol would be privacy-
preserving if it could not distinguish be-
tween D and an adjacent D´. It also would
not be very useful. But what if the protocol
could barely and rarely make such a dis-
tinction? Consider the probabilities that a
certain methodology generates a given an-
swer to a given question when applied to
D as compared with D´. The ratio of those
two probabilities should be as close to one
as possible. The log of that ratio measures
the loss of privacy incurred when the proto-
col answers the given question. If the log is
always less than ε for any adjacent data sets,
the protocol provides ε-differential privacy.
Dwork, McSherry, Nissim, and Smith formulated this definition, showed it captures
basic intuitions about privacy, and devised
research protocols that provide ε-differential
privacy (13). Data are held by a trusted curator who only accepts certain questions
from the investigator. The curator performs
calculations behind a firewall but only returns answers after adding a small amount
of carefully chosen noise. It suffices, for example, to draw noise from a Laplace distribution with parameter 1/ε when responding
to a counting query. There are limits on the
type and number of questions allowed, as
each could deplete a privacy budget by as
much as ε.
Choosing ε for a differentially private protocol determines how the research will trade
accuracy against privacy. The smaller ε is, the
less leakage of information but at the cost of
more noise. One promising application is the
Census Bureau’s On TheMap Project (14). Payroll records in each state have been carefully
perturbed and aggregated to create a “
synthetic database.” The public can query that
database to receive approximate, but quite
accurate, answers to a large class of counting
and geographic questions (15).
PICKING PROTOCOLS. Setting aside administrative, financial, legal, or institutional
factors that do not bear directly on accuracy
and privacy, some basic suggestions for comparing protocols are clear. Potential subjects
considering participation in a study should
ask if there is another protocol that would
yield at least as reliable scientific results
while offering better privacy protection. Researchers designing studies should ask if the
protocols will actually deliver the levels of accuracy and privacy anticipated.
Funders or others deciding on whether a
research plan moves forward should also ask
about the broader incentive effects of using
a particular methodology. Accuracy and privacy achieved by a protocol are public goods
and, hence, subject to free-rider problems. To
increase chances of curing a disease, say, every patient wants accurate research but preferably using other people’s data rather than
their own. To decrease chances of linkage
or other attacks, every researcher wants all
other projects held to high thresholds of privacy protection but preferably not their own.
Policy-makers reviewing U.S. legislation
should also ask about laws like FERPA,
HIPAA, or the Privacy Act of 1974 that govern data collection and use by educators,
health care providers, or federal officials,
respectively (17). Do these actually promote
accuracy and privacy, or are they based on
outmoded ideas about anonymization and
identifiability, for example? Unlike other
countries, the United States has no legislation specifically regulating or facilitating
the use of personal information by academic
Critically, society as a whole must also ask
about promising and threatening aspects
of new information technologies. How well
society balances the accuracy and privacy of
research protocols will determine the extent
to which “big data” either allows everyone to
benefit from advances in empirical science or
only those private interests who hold enormous and growing stores of sensitive information about us all. ■
REFERENCES AND NOTES
1. RDC Research Opportunities, Center for Economic
Studies (CES), https://www.census.gov/ces/
2. L. Einav, J. Levin, Science 346, 1243089 (2014).
3. Riding with the Stars, Passenger Privacy in the NYC Taxicab
4. L. Sweeney, J.Law Med.Econ. 25, 98 (1997).
5. A. Narayanan, V. Shmatikov, Proc.of IEEESymp.on
Security and Privacy (IEEE, 2008), pp. 111–125.
6. How Unique are You? http://aboutmyinfo.org/.
7. C. D work, in Privacy, Big Data, and the Public Good , J. Lane
et al., Eds. (Cambridge Univ. Press, Cambridge, 2014), pp.
8. E(r) = p /2 + p /4 + (1 – p )/4, where r is the fraction reporting “yes” and p is the true proportion.
9. M.Prabhakaran, A.Sahai, Secure Multi-party Computation
(IOS Press, Amsterdam, 2013).
10. C. Gentry, STOC ’09: Proc. of the 41st ACM Symp. on
Theory of Computing (ACM, New York, 2009), pp. 169–178.
11. A.López-Alt,E. Tromer, V.Vaikuntanathan, STOC ’12: Proc.
of the 44th ACM Symp. on Theory of Computing (ACM,
New York, 2012), pp. 1219–1234.
12. C. D work, in Automata, Languages and Programming
(Springer, New York, 2006), pp. 1–2.
13. C. D work, F. Mc Sherry, K. Nissim, A. Smith, Proc. 3rd
Theory of Cryptography Conference (TCC) (Springer, New
York, 2006), pp. 265–284.
14. A.Machanavajjhala, D.Kifer,J.Abowd, J.Gehrke,L.
Vilhuber, Proc. 24th IEEE International Conf. on Data
Engineering (ICDE) (IEEE, 2008), pp. 277–286.
15. On the Map, http://onthemap.ces.census.gov.
16. L. D. Harmon, B. Julesz, Science 180, 1194 (1973).
17. The Family Educational Rights and Privacy Act (FERPA)
and the Health Insurance Portability and Accountability
Privacy is breached when “secure” data can be linked with
publicly available data.