Twitter: Big data
IN THEIR POLICY FORUM “The parable
of Google Flu: Traps in big data analysis”
(14 March, p. 1203), D. Lazer et al. remark
upon recent failures of Google Flu Trends
(GFT) and cast these as limitations of big
data analysis in general. However, many
of these limitations have been overcome
by other big data systems. Specifically,
analyses that use Twitter for influenza
surveillance account for the concerns of
replicability, overfitting, construct validity,
granularity, and temporal confounds that
Lazer et al. have identified.
For example, the results of GFT cannot
be replicated because they are based on
proprietary data, whereas Twitter data
are open. A community of researchers has
replicated analyses based on these data
[e.g., (1)]. Changes to the social media
platform itself, such as reengineering the
underlying algorithms, need not adversely
impact replicability as long as these data
Previous articles have correctly
remarked that GFT and keyword-based
systems overestimate influenza prevalence
by conflating signals of influenza aware-
ness (such as media attention) with signals
of actual infection (2, 3). These signals
are separable on Twitter. Our work (4)
has recently shown that the rate of tweets
indicative of actual influenza infection is
strongly correlated with the U.S. Center
for Disease Control’s Influenza-Like Illness
rates, even though these rates were not
used for system development and despite
focused media attention. Furthermore, our
Twitter evaluations do not suffer from peak
overestimation as does GFT. Finally, our
analysis explicitly controls for seasonality
and temporal autocorrelation, meaning
that our results directly capture flu trends
and not simply seasonal variations.
Finally, New York City’s Department of
Health and Hygiene successfully con-
ducted a blind evaluation of our method
using municipal data, also resulting in a
strong correlation. Our system successfully
demonstrates the ability to understand
the prevalence of flu at local levels.
Concerns that are specific to GFT should
not be overly generalized to other big data
David Andre Broniatowski,1*
Michael J. Paul,2 Mark Dredze2
1Department of Engineering Management and
Systems Engineering, George Washington
University, Washington, DC 20052, USA.
2Department of Computer Science, Johns Hopkins
University, Baltimore, MD 21218, USA.
*Corresponding author. E-mail: broniatowski@gwu.
1. S. Tuarob, C. S. Tucker, M. Salathe, N. Ram, J.Biomed.
Inform. 49, 255 (2014).
2. P. Copeland et al ., Proc. Int. Soc. Negl. Trop. Dis. 3 (2013).
3. D. Butler, Nature 494, 155 (2013).
4. D. Broniatowski, M. J. Paul, M. Dredze, PLOS ONE8 ,
WE THANK BRONIATOWSKI, Paul, and
Dredze for giving us the opportunity to
reemphasize the potential of big data and
make the more obvious point that not all
big data projects have the problems currently plaguing Google Flu Trends (GFT),
nor are these problems inherent to the
field in general.
Our Policy Forum is meant to provide
a constructive critique by highlighting
possible pitfalls of big data analysis. These
pitfalls are not the same for all big data
sets, but are certainly not unique to GFT.
We do agree that Twitter has substantial
scientific potential and is distinctive in
the public availability of its data. Indeed,
one of us (A.V.) is using Twitter data for
influenza surveillance in the context of the
recent Center for Disease Control (CDC)
“Predict the Flu Challenge” (1).
Twitter data provide an excellent
representation of those who choose to
express an opinion publicly, which can be
of tremendous value for many research
purposes. However, these data may be
manipulated by both the service provider
(such as Google) and the user (such as
companies marketing a product), as we
explain in our Policy Forum. In light of
these trends, whether these data can be
used to represent the entire United States
population remains an open question.
Who uses Twitter and how they use it
have changed markedly over the past sev-
eral years. The algorithmic underpinning
of Twitter (which identifies “what’s trend-
ing”) is subject to constant and invisible
tinkering. The system is under constant
attack, with armies of bots ready to pro-
duce content for the highest bidder (2, 3).
The norms of expression on Twitter are
heterogeneous and still rapidly evolving—
who feels the need to publicly express
that they have flu symptoms on Twitter,
and are these predispositions evenly
distributed throughout the population (4)?
Bodnar and Salathé’s cautionary tale (5) on
Twitter-based influenza surveillance clearly
shows that seemingly irrelevant tweets
(such as those about zombies) are moder-
ately indicative of influenza prevalence,
and that the choice of validation methods
has a large effect on reported success.
It is possible that one day we will have
reliable prediction of flu prevalence from
social media. Certainly, this would require
a careful evaluation and recalibration of
methodologies, public and independent
replication of results, and the explicit
evaluation of error processes. Clearly, all
big data projects do not have the same
syndromes as GFT presently does, but by
building strong collaborations and adhering to rigorous standards, we should be
able to extract considerably more information from these highly informative new
David Lazer,1,2 Ryan Kennedy,1,3,4
Gary King,3 Alessandro Vespignani5,6,3
1Lazer Laboratory, Northeastern University, Boston,
MA 02115, USA. 2Harvard Kennedy School, Harvard
University, Cambridge, MA 02138, USA. 3Institute
for Quantitative Social Science, Harvard University,
Cambridge, MA 02138, USA. 4University of Houston,
Houston, TX 77204, USA. 5Laboratory for the
Modeling of Biological and Sociotechnical Systems,
Northeastern University, Boston, MA 02115, USA.
6Institute for Scientifc Interchange Foundation,
*Corresponding author. E-mail: firstname.lastname@example.org
1. CDC, CDC Competition Encourages Use of Social Media to
Predict Flu ( www.cdc.gov/flu/news/predict-flu-challenge.
of social botnets for spam distribution and digital-influence manipulation” [2013 IEEE Conference on
Communications and Network Security (CNS), 2013].
3. N. Bilton, “Friends, and influence, for sale online” (20 April
Edited by Jennifer Sills