lineages that may come from multiple archaic
ancestors (Fig. 1A), allowing for the recovery of
To identify surviving Neandertal lineages, we
developed a two-stage computational strategy (fig.
S3) (10). First, we identify candidate introgressed
sequences by using an extension of a previously
developed summary statistic referred to as S (11),
which is sensitive to the signatures of introgres-
sion (Fig. 1B) and is calculated without using
the Neandertal reference genome. We performed
coalescent simulations for a wide variety of de-
mographic scenarios and found that our imple-
mentation of S* can distinguish introgressed from
nonintrogressed sequences (Fig. 1C and fig. S4).
Second, we refine the set of candidate introgressed
sequences using an orthogonal approach by com-
paring them to the Neandertal reference genome
and testing whether they match significantly more
than expected by chance (10). We estimate that
the use of S* alone, as compared to our two-staged
approach, would recover ~30% of Neandertal
lineages at a false discovery rate (FDR) = 20%
(fig. S5) (10).
We applied this framework to whole-genome
sequences from 379 Europeans and 286 East
Asians from the 1000 Genomes Project (table
S1) (12). Specifically, we calculated S in 50-kb
sliding windows (tables S2 to S8) (10) and used
a computationally efficient approach to determine
statistical significance through coalescent simu-
lations (fig. S6) (10). At an S threshold corre-
sponding to P ≤ 0.01, we identified ~40 Gb of
candidate introgressed sequence. Note that S* P
values are robust to demographic uncertainty (fig.
S7). The distribution of Neandertal-match P val-
ues for this set of candidate introgressed sequences
(Fig. 1D) demonstrates a strong skew toward zero,
consistent with the hypothesis that these sequences
are strongly enriched for Neandertal lineages.
The distribution of Neandertal-match P values
for sequences that do not possess significant evi-
dence of introgression, as revealed by S*, is ap-
proximately uniform (Fig. 1D) (10), indicating
that our statistical approach is able to distinguish
between introgressed and nonintrogressed lineages
(fig. S8) (10).
At FDR = 5%, we identified more than 15 Gb
of introgressed sequence across all individuals, spanning ~20% (600 Mb) of the Neandertal
genome (Fig. 1E and table S9). Of the 600 Mb
of distinct sequence, ~25% (149 Mb) was shared
between Europeans and East Asians. On average, we found 23 Mb of introgressed sequence
per individual (Fig. 1F), with East Asian individuals inheriting 21% more Neandertal sequence
than Europeans. Within subpopulations, we found
small but statistically significant variation in the
amount of introgressed sequence among Europeans (Kruskal-Wallis rank sum test, P = 4.2 ×
10–12), but not among East Asians (P = 0.43).
The average length of introgressed haplotypes
was ~57 kb (Fig. 2A), and ~26% of all protein-coding genes had one or more exons that overlapped a Neandertal sequence (Fig. 2B). On a
broad scale, the genomic distribution of Nean-
Fig. 1. Recovering Neandertal lineages from the DNA of modern humans.
(A) Schematic representation illustrating that low levels of introgression may
facilitate the recovery of substantial amounts of archaic sequence. Lines rep-
resent DNA from contemporary individuals, and colored boxes indicate archaic
sequences. Different colored boxes represent sequences inherited from distinct
archaic ancestors. (B) Genealogies of loci in Europeans and Africans in the
presence of introgression. The expected signature of an introgressed lineage
(blue) that our method exploits is high levels of divergence that persists over
relatively long haplotype blocks. (C) Receiver operator curve (red) illustrating
the performance of S* for detecting an introgressed sequence in simulated
data (10). The black diagonal dashed line represents random predictions.
(D) Distribution of P values testing for an enrichment of Neandertal variants
for S* candidate and randomly selected regions. (E) Amount of Neandertal
sequence recovered as a function of FDR. The inset Venn diagram shows the
amount of sequence overlap between East Asians (ASN) and Europeans (EUR)
at a FDR of 5%. (F) Violin plots showing the distribution of the amount of
introgressed sequence identified per individual for East Asian and European
populations (population abbreviations are described in table S1).