Transparency, Granularity, and All-Data
The GFT parable is important as a case study
where we can learn critical lessons as we
move forward in the age of big data analysis.
Transparency and Replicability. Replication is a growing concern across the academy. The supporting materials for the GFT-related papers did not meet emerging community standards. Neither were core search
terms identified nor larger search corpus provided. It is impossible for Google to make its
full arsenal of data available to outsiders, nor
would it be ethically acceptable, given privacy
issues. However, there is no such constraint
regarding the derivative, aggregated data.
Even if one had access to all of Google’s data,
it would be impossible to replicate the analyses of the original paper from the information
provided regarding the analysis. Although it is
laudable that Google developed Google Correlate ostensibly from the concept used for
GFT, the public technology cannot be utilized
to replicate their findings. Clicking the link
titled “match the pattern of actual flu activity
(this is how we built Google Flu Trends!)” will
not, ironically, produce a replication of the
GFT search terms (14). Oddly, the few search
terms offered in the papers (14) do not seem
to be strongly related with either GFT or the
CDC data (SM)—we surmise that the authors
felt an unarticulated need to cloak the actual
search terms identified.
What is at stake is twofold. First, science
is a cumulative endeavor, and to stand on the
shoulders of giants requires that scientists
be able to continually assess work on which
they are building (25). Second, accumulation of knowledge requires fuel in the form of
data. There is a network of researchers waiting to improve the value of big data projects
and to squeeze more actionable information
out of these types of data. The initial vision
regarding GFT—that producing a more accurate picture of the current prevalence of contagious diseases might allow for life-saving
interventions—is fundamentally correct, and
all analyses suggest that there is indeed valuable signal to be extracted.
Google is a business, but it also holds in
trust data on the desires, thoughts, and the
connections of humanity. Making money
“without doing evil” (paraphrasing Google’s
motto) is not enough when it is feasible to do
so much good. It is also incumbent upon academia to build institutional models to facilitate collaborations with such big data projects—something that is too often missing
now in universities (26).
Use Big Data to Understand the Unknown.
Because a simple lagged model for flu preva-
lence will perform so well, there is little room
for improvement on the CDC data for model
projections [this does not apply to other
methods to directly measure flu prevalence,
e.g., (20, 27, 28)]. If you are 90% of the way
there, at most, you can gain that last 10%.
What is more valuable is to understand the
prevalence of flu at very local levels, which is
not practical for the CDC to widely produce,
but which, in principle, more finely granular
measures of GFT could provide. Such a finely
granular view, in turn, would provide power-
ful input into generative models of flu propa-
gation and more accurate prediction of the flu
months ahead of time (29–33).
Study the Algorithm. Twitter, Facebook,
Google, and the Internet more generally are
constantly changing because of the actions
of millions of engineers and consumers.
Researchers need a better understanding of
how these changes occur over time. Scien-
tists need to replicate findings using these
data sources across time and using other data
sources to ensure that they are observing
robust patterns and not evanescent trends. For
example, it is eminently feasible to do con-
trolled experiments with Google, e.g., looking
at how Google search results will differ based
on location and past searches (34). More gen-
erally, studying the evolution of socio-tech-
nical systems embedded in our societies is
intrinsically important and worthy of study.
The algorithms underlying Google, Twitter,
and Facebook help determine what we find
out about our health, politics, and friends.
It’s Not Just About Size of the Data. There
is a tendency for big data research and more
traditional applied statistics to live in two
different realms—aware of each other’s
existence but generally not very trusting of
each other. Big data offer enormous possibilities for understanding human interactions at a societal scale, with rich spatial and
temporal dynamics, and for detecting complex interactions and nonlinearities among
variables. We contend that these are the most
exciting frontiers in studying human behavior. However, traditional “small data” often
offer information that is not contained (or
containable) in big data, and the very factors
that have enabled big data are enabling more
traditional data collection. The Internet has
opened the way for improving standard surveys, experiments, and health reporting
(35). Instead of focusing on a “big data revolution,” perhaps it is time we were focused
on an “all data revolution,” where we recognize that the critical change in the world has
been innovative analytics, using data from
all traditional and new sources, and providing a deeper, clearer understanding of our
References and Notes
1. D. Butler, Nature 494, 155 (2013).
2. D. R. Olson et al., PLOS Comput. Biol. 9, e1003256
3. A. McAfee, E. Brynjolfsson, Harv. Bus. Rev. 90, 60 (2012).
4. S. Goel et al., Proc. Natl. Acad. Sci. U.S.A. 107, 17486
5. A. Tumasjan et al., in Proceedings of the 4th International
AAAI Conference on Weblogs and Social Media, Atlanta,
Georgia, 11 to 15 July 2010 (Association for Advancement
of Artificial Intelligence, 2010), pp. 178.
6. J. Bollen et al., J. Comput. Sci. 2, 1 (2011).
7. F. Ciulla et al., EPJ Data Sci. 1, 8 (2012).
8. P. T. Metaxas et al., in Proceedings of PASSAT—IEEE
Third International Conference on Social Computing,
Boston, MA, 9 to 11 October 2011 (IEEE, 2011), pp. 165;
9. D. Lazer et al., Science 323, 721 (2009).
10. A. Vespignani, Science 325, 425 (2009).
11. G. King, Science 331, 719 (2011).
12. D. Boyd, K. Crawford Inform. Commun. Soc. 15, 662
13. J. Ginsberg et al., Nature 457, 1012 (2009).
14. S. Cook et al., PLOS ONE 6, e23610 (2011).
15. P. Copeland et al., Int. Soc. Negl. Trop. Dis. 2013, 3
16. C. Viboud et al., Am. J. Epidemiol. 158, 996 (2003).
17. W. W. Thompson et al., J. Infect. Dis. 194 (Suppl. 2),
18. I. M. Hall et al., Epidemiol. Infect. 135, 372 (2007).
19. J. B. S. Ong et al., PLOS ONE 5, e10036 (2010).
20. J. R. Ortiz et al., PLOS ONE 6, e18687 (2011).
21. Organizing lists of related searches, Google; http://
22. Improving health searches, because your health matters, Google; http://insidesearch.blogspot.com/2012/02/
23. E. Mustafaraj, P. Metaxas, in Proceedings of the Web-
Sci10, Raleigh, NC, 26 and 27 April 2010 (Web Science
Trust, 2010); http://journal.webscience.org/317/.
24. J. Ratkiewicz et al., in Proceedings of 5th International
AAAI Conference on Weblogs and Social Media, San
Francisco, CA, 7 to 11 August 2011 (AAAI, 2011),
25. G. King, PS Polit. Sci. Polit. 28, 443 (1995).
26. P. Voosen, Chronicle of Higher Education, 13 September
27. R. Lazarus et al., BMC Public Health 1, 9 (2001).
28. R. Chunara et al., Online J. Public Health Inform. 5, e133
29. D. Balcan et al., Proc. Natl. Acad. Sci. U. S.A. 106, 21484
30. D. L. Chao et al., PLOS Comput. Biol. 6, e1000656
31. J. Shaman, A. Karspeck, Proc. Natl. Acad. Sci. U.S.A. 109,
32. J. Shaman et al., Nat. Commun. 4, 2837 (2013).
33. E. O. Nsoesie et al., PLOS ONE 8, e67164 (2013).
34. A. Hannak et al., in Proceedings of the 22nd International World Wide Web Conference, Rio de Janeiro, 13
to 17 May 2013 (Association for Computing Machinery,
New York, 2013), pp. 527–538.
35. A. J. Berinsky et al., Polit. Anal. 20, 351–368 (2012).
Acknowledgments: This research was funded, in part, by
NSF grant no. 1125095 Army Research Office (ARO) grant
no. W911NF-12-1-0556, and, in part, by the Intelligence
Advanced Research Projects Activity (IARPA) via Department
of Interior National Business Center (DoI/NBC) contract
D12PC00285. The views and conclusions contained herein are
those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either
expressed or implied, of the NSF ARO/IARPA, DoI/NBE, or the
U.S. government. See SM for data and methods.