6. R. E. Billingham, P. S. Russell, Nature 177, 791–792
7. R. R. Banerjee et al., Science 303, 1195–1198 (2004).
8. J. Ishibashi et al., Mol. Cell. Biol. 32, 2289–2299
9. R. K. Gupta et al., Nature 464, 619–623 (2010).
10. S. Kang et al., PLOS Biol. 10, e1001433 (2012).
11. W. Tang et al., Science 322, 583–586 (2008).
12. W. A. Alcaraz et al., Proc. Natl. Acad. Sci. U.S.A. 103,
13. A. Hammarstedt et al., Proc. Natl. Acad. Sci. U.S.A. 110,
14. H. Kulessa, G. Turk, B. L. Hogan, EMBO J. 19, 6664–6674
15. R. R. Driskell et al., Nature 504, 277–281 (2013).
16. S. Dulauroy, S. E. Di Carlo, F. Langa, G. Eberl, L. Peduto,
Nat. Med. 18, 1262–1270 (2012).
17. H. Suga et al., Stem Cells 32, 1347–1360 (2014).
18. Y. Rinkevich et al., Science 348, aaa2151 (2015).
19. M. V. Plikus, C. F. Guerrero-Juarez, E. Treffeisen, D. L. Gay,
Exp. Dermatol. 24, 167–170 (2015).
20. J. A. Lehoczky, B. Robert, C. J. Tabin, Proc. Natl. Acad. Sci. U.S.A.
108, 20609–20614 (2011).
21. Y. Rinkevich, P. Lindau, H. Ueno, M. T. Longaker, I. L. Weissman,
Nature 476, 409–413 (2011).
22. T. H. Leung, E. R. Snyder, Y. Liu, J. Wang, S. K. Kim, Genes Dev.
29, 2097–2107 (2015).
Funding is provided by U.S. NIH grant R01-AR055309, NIH Skin
Diseases Research Core grant P30-AR057217, and the Edward and
Fannie Gray Hall Center for Human Appearance. M.V.P. is
supported by a pilot grant from the Diabetes and Endocrinology
Research Center (University of Pennsylvania), a Dermatology
Foundation research grant, an Edward Mallinckrodt Jr.
Foundation grant, a Pew Charitable Trust grant, and NIH grants
R01-AR067273 and R01-AR069653. M.A.L. is supported by NIH
grant DK49210, M.I. by NIH grant R01-AR066022, S.E.M. by NIH
grant R37-AR047709 and Penn Skin Biology and Diseases
Resource-based Core grant P30-AR069589, W.S.P. by NIH grant
R01-AI047833, R.K.G. by NIH grant R01-DK104789, T.-L. T. by NIH
grant R01-GM095821, B.A.H. by NIH grant R01-NS05487, R.R. by
California Institute for Regenerative Medicine training grant TG2-
01152, C.F.G.-J. by the NSF Graduate Research Fellowship Program
(DGE-1321846) and a training grant from MBRS-IMSD (Initiative
for Maximizing Student Development; GM055246), X. W. by a
Canadian Institutes of Health Research postdoctoral fellowship
(MFE-123724), J. W.O. by National Research Foundation of Korea grant
2016R1C1B1015211, C.H.L. by the Cutaneous Biology and Skin
Disease training program (T32-AR064184), Y.R.L. by a NIH National
Research Service Award F30 training grant and a Paul and Daisy
Soros Fellowship for New Americans, H.-L.L. by NIH T32 training grant
T32-CA009054-37, and M.S. by American Heart Association
postdoctoral fellowship 16POST26420136. Retn-lacZ mice were
generated with the Transgenic Mouse Core of the University of
Pennsylvania Diabetes Research Center (NIH grant DK19525). We
thank Y. Mishina for providing Bmpr1aflox mice, C.-M. Chuong for
providing K14-Noggin mice, V. Scarfone and C. Tu for their assistance
with fluorescence-activated cell sorting and tissue culture, Z. Yang for
technical assistance, and P. Sterling for reviewing the manuscript.
SMA-CreERT2 mice are available from P.C. under a material transfer
agreement with the University of California, Irvine. P.C. and D.M. are
inventors on patents EP 1 692 936 B1 and US 7112715 B2, held by
GIE-CERBM (Centre Européen de Recherche en Biologie et Médecine),
that cover the method for generating conditional DNA recombination
in mice by using the Cre-ERT2 fusion protein. M.V.P., C.F.G.-J., and
G.C. are co-inventors on a patent application filed through the U.S.
Patent and Trademark Office by the University of Pennsylvania
describing the BMP pathway as a target for promoting neogenic fat
formation, among other claims.
Materials and Methods
Figs. S1 to S24
Tables S1 to S5
24 August 2016; accepted 19 December 2016
Published online 5 January 2017
DNA damage is a pervasive cause of
sequencing errors, directly
confounding variant identification
Lixin Chen, Pingfang Liu, Thomas C. Evans Jr.,* Laurence M. Ettwiller*
Mutations in somatic cells generate a heterogeneous genomic population and may
result in serious medical conditions. Although cancer is typically associated with somatic
variations, advances in DNA sequencing indicate that cell-specific variants affect a
number of phenotypes and pathologies. Here, we show that mutagenic damage accounts
for the majority of the erroneous identification of variants with low to moderate (1 to 5%)
frequency. More important, we found signatures of damage in most sequencing data sets in
widely used resources, including the 1000 Genomes Project and The Cancer Genome
Atlas, establishing damage as a pervasive cause of sequencing errors. The extent of this
damage directly confounds the determination of somatic variants in these data sets.
Genomic variations in somatic cells can result in disease states, including cancer (1–3). Thus, accurate tumor-associated var- iant detection, which may help direct person- alized treatments, is important for cancer
diagnosis and prognosis. Next generation sequencing (NGS) has revolutionized variant identification and characterization. Nonetheless, owing
to tumor heterogeneity and/or contamination
by normal cells, somatic cancer variants are often
found at low allelic frequencies (4, 5), confounding their identification.
Detection of low allelic frequency variants is
achieved through deep sequencing and specialized data analysis algorithms that detect variants
in a limited number of reads. Data analysis is
challenged by artifactual errors that display the
same low allelic frequency as cancer mutations,
with the level of artifactual errors defining the
threshold for low allelic variant detection. Most
sequencing errors are thought to result from
polymerase chain reaction mistakes or sequencing miscalls (6). Meanwhile, mutagenic DNA
damage is recognized as a major source of sequencing errors only in specialized samples—for
example, formalin-fixed paraffin-embedded (7),
ancient (8), and circulating tumor DNA (9). Furthermore, another study demonstrated that library
preparation induces oxidative damage (10), raising the possibility that sequencing high-quality
human genomic DNA may also be affected by
We explored this possibility by measuring
damage in sequencing runs. For this, we used
the fact that mutagenic damage leads to a global
imbalance between variants detected in read
1 (R1) and read 2 (R2) in paired-end sequenc-
ing (Fig. 1A) (11). The degree of this imbalance
directly correlates with the amount of damage
present in a sample. We devised an analysis
strategy based on this imbalance to deconvolute
both the origin and orientation of variants and
computed a metric, the Global Imbalance Value
(GIV) score, that is indicative of damage (11).
The algorithm produces 12 GIV scores, one
per variant type. Here, a GIV score above 1.5 is
defined as damaged. At this GIV score, there
are 1.5 times more variants on R1 than on R2,
suggesting that at least one-third of the variants
are erroneous. Undamaged DNA samples have
a GIV score of 1. To experimentally validate the
GIV score and provide an independent damage
quantification, we used human genomic DNA
containing various amounts of 7,8-dihydro-8-
oxoguanine (8-oxo-dG), resulting in G-to-T transversions after amplification (10, 12). We also
treated the damaged DNA with an enzyme cocktail that repairs DNA damage before library
preparation (11, 13). Sequencing the same sample with and without DNA repair enzyme treatment quantified the rate of erroneous variants
specifically introduced by damage. Confirming previous findings (10), the G-to-T transversion frequency varied according to library
preparation conditions (figs. S1 and S2) (11).
Notably, excess G-to-T variants were only observed in R1 sequences, whereas C-to-A variants
were in excess in R2 sequences, leading to a
GIVG_T score > 1 (Fig. 1B). Repair enzyme treatment abolished this imbalance and reduced
the GIVG_T score to 1. The GIV score correlated
with the variant excess measured experimentally (fig. S1E), demonstrating that the GIV score
can be used to accurately estimate the extent of
damage in publicly available data sets. We estimated that the GIV score calculation is accurate at >2 million reads (fig. S3B).
To estimate the extent of damage in public
data sets, we determined the GIV scores of individual sequencing runs from the 1000 Genomes
Project (14) and a subset of The Cancer Genome
Atlas (TCGA) data set (11). Both data sets showed
widespread damage, particularly those leading to
752 17 FEBRUARY 2017 • VOL 355 ISSUE 6326
New England Biolabs Inc., 240 County Road, Ipswich, MA
*Corresponding author. Email: firstname.lastname@example.org (T.C.E.), ettwiller@