30 JANUARY 2015 • VOL 347 ISSUE 6221 479 SCIENCE sciencemag.org
Designing protocols for research us- ing personal data entails trade-offs between accuracy and privacy. Any suggestion that would make empiri- cal work less precise, open, represen- tative, or replicable seems contrary
to the needs and values of science. A careful
reexamination has begun of what “accuracy”
or “privacy” should mean and how research
plans can balance these objectives.
Attitudes toward research that analyzes
personal data should depend both on how
well the protocol generates valuable statistics and on how well it protects confidential
details. There is always some risk of a leak, so
it hardly makes sense to support
a study incapable of producing
valid and robust results. It would
also be reassuring to know that the same or
better scientific reliability could not be obtained via some other protocol that provides
more privacy protection.
PARSING PROTOCOLS. A given research
plan can be assessed by comparing it along
accuracy and privacy dimensions with other
potential protocols. Many purport to deliver
more than they do on either score. Research
on even a simple population statistic—say,
average salary—involves collecting, processing, and releasing data. Various protocols
can introduce obfuscation, or not, at any
combination of these three stages. Eight
examples follow, starting with traditional
methods whose strengths and shortcomings
motivate more recent approaches.
Open data. Suppose a researcher wishes to
study faculty wages. Some U.S. states publish
names, salaries, and other information about
public university employees. There are no
restrictions on data collecting and sampling,
linking and analysis, or release and reuse.
This is the ideal supported by “open data”
advocates. It facilitates accuracy but not confidentiality. People who care about keeping
their pay private need to be aware of such
policies before they decide to take a position.
Data enclaves for federal data. Suppose
a researcher wishes to study U.S. wage and
employment trends more broadly. Academics can apply for access to Research Data
Centers run by the U.S. Census Bureau (1).
Approved researchers are subject to prosecution for misuse of private information
under the same terms as government officials. Computations typically take place in
a data enclave disconnected from the rest of
the world. Papers must be reviewed by the
Census Bureau before they can
be released, mainly to ensure
that information is aggregated
or obfuscated enough to protect individuals’ privacy. This
is akin to how pixelating the
photo of an unfamiliar face
renders it unidentifiable (see
image). Federal enclaves have
produced no known security
breaches and are becoming
less cumbersome to use, but
replication is problematic.
for online business data. Suppose a researcher wishes to
study the relation between
salary and other behaviors.
Online companies often ask or
draw inferences about users’
income, usually for unstated
purposes. Researchers who
seek such data rarely gain access without
signing a nondisclosure agreement (NDA)
that gives the company control over what
details may be released. Arrangements like
this usually protect proprietary interests of
businesses rather than privacy interests of
customers. NDAs can also preclude replication of results or reuse of data (2).
Anonymization of administrative data.
Suppose a researcher wishes to study earnings of cab drivers. New York City recently
released “anonymized” data about every
taxi trip taken in 2013. These data were reidentified by exploiting weak encoding and
by linking with other publicly available data
sets. Not only is it possible to track earnings
of each cabbie by name, one can also map
GPS coordinates on either end of each ride
and even deduce the trip times, fares, and
tips of certain celebrities (3).
This joins many other examples of data
sets that were released with assurances
that they had been scrubbed of any person-
ally identifiable information but were easily
linked with other public information to yield
private confidences, including health records
of Governor Weld (4) and movie rental histo-
ries of Netflix users (5). Sweeney even sug-
gests that a vast majority of Americans can
be uniquely identified using only zip code,
sex, and birthday data (6). So anonymization
can reduce accuracy while failing to protect
private information against “linkage attacks.”
In other words, “sanitizing data doesn’t” and
“deidentified data isn’t” (7).
Randomized response in survey data. A researcher may want to estimate what percentage of a group lives in poverty and so gives
each person a coin to flip, together with these
instructions: “If it lands heads, truthfully answer yes or no to the question ‘Is your income
below the poverty line?’ If it lands tails, flip
again. If the second toss is a
head, answer truthfully, but if
the second toss is a tail, then
lie by giving the answer opposite to what is true.” Twice the
fraction of yes responses minus one-half provides a good
estimate of the actual fraction
Even if you know who answered what, that does not
tell you who is impoverished.
The usefulness of this technique depends on having lots
of participants, all of whom
follow instructions. There are
that provide more efficient estimators, but some accuracy is
sacrificed in any case.
for reporting sensitive data.
Suppose a researcher would like to calculate
the average salary of a group, but without
anyone ever communicating her own. Say
there are three people. Each generates two
random numbers and gives one to each of
the other two participants. Everyone then
adds the two random numbers she generated to her own salary, subtracts the two
numbers she was given, and reports the result. All the random numbers cancel when
these three results are added, so their sum
equals the sum of the salaries. Dividing by
three gives the average. Special and more
convoluted computations can secretly carry
out operations beyond just taking averages
Although no individual’s salary was communicated, this protocol does not necessarily keep participants from finding out one
another’s personal information. If, for example, all but one collude by using the same
Balancing privacy versus
accuracy in research protocols
By Daniel L. Goroff *
Restricting data at collection, processing, or release
Vice President and Program Director, Alfred P. Sloan
Foundation, New York, NY 10111, USA. E-mail: gorof@sloan.
org. *Opinions or errors are the author’s own rather than
those of the foundation or its grantees.
Aggregation and averaging,
as in the pixelated image,
can hide identities. Linkage
with other information, like the
familiar portrait of President
Lincoln on U.S. currency, can
undermine obfuscation. Image
from ( 16).