Flow sorting and DNA amplification (11–14) of
more than 1000 co-occurring Prochlorococcus
cells allowed us to explore the cell-by-cell genomic composition of these wild populations.
We were able to identify coherent subpopulations
at the whole-genome level and their relationship
to those defined by the ITS region, explore finely
resolved diversity patterns within and between
subpopulations, and examine shifting abundances
with seasonal changes in the habitat.
We first examined the population composition by sequencing the ITS regions of hundreds
of Prochlorococcus cells in each sample, revealing the presence of finely resolved clusters within
the broadly defined ecotypes (Fig. 1B). The populations were composed of tens to hundreds of
nearly identical ITS clusters (>98% similar) within
the coarse-grained ecotypes (Fig. 1, B and C). The
relative abundance of cells belonging to the different clusters changed with season (Fig. 1, A to
C) (15), suggesting shifts in their relative fitness in
response to environmental changes.
To study the fine-scale genomic variation and
compare it with the ITS-defined clusters, we se-
quenced the partial genomes (representing, on av-
erage, 70% of the total genome) of 90 individual
cells (30 per sample) from the largest nearly
identical ITS cluster, cN2 (Figs. 1C and 2), as
well as 6 cells from two other clusters, cN1 and
c9301. For each time of year, cells were random-
ly selected for genome sequencing from within
the major ITS ribotypes (>99% similar) within
cluster cN2 (C1 to C5) (30 cells), as well as from
c9301-C8 and cN1-C9 (one cell each), as de-
tailed in (15). We used a modified mediator ge-
nome reference assembly approach (15, 18) to
analyze between-cell variation in the partial ge-
nomes recovered. The topologies of the ITS and
genomic trees were highly congruent (Fig. 2),
indicating that ITS sequences can serve as a proxy
for genome sequences in Prochlorococcus at a
much finer level of resolution than previously
demonstrated (4, 19). The genomic data further
revealed that the largest cluster cN2 is divided
into five major clades [C1 to C5 (Fig. 2)] and a few
additional minor clades represented by only one
cell each. The delineation of clades C1 to C5 was
highly robust and also observed in trees constructed
from genomic position subsets (figs. S1 and S2).
To explore the evolutionary forces that shaped
the cN2 C1 to C5 clades, we examined differences
in nucleotide sequences within and between clades.
For example, the C1 and C3 subpopulations (Fig.
2B) differ in 52,885 dimorphic single-nucleotide
polymorphisms (SNPs), which represent 3.2% of
their genomes (Fig. 3A, blue). The dimorphic
SNPs between C1 and C3 are scattered across the
genomes, occurring in 1519 out of 1974 genes
(most of them core genes); 8% of these SNPs are
found in intergenic regions (9% of the genome is
noncoding). Of the intragenic SNPs, 37% are
nonsynonymous, thus affecting the amino acid sequences of the proteins they encode. In contrast to
the scattered nature of the sequence variation between the C1 and C3 clades, the polymorphism
within them is confined to a few regions of the
genome (Fig. 3A, black), indicating that most
regions along the genome are conserved within
clades and are different between them (15), which
is true for all pairwise comparisons within C1 to
C5 (figs. S3 and S4).
This emerging pattern was further supported
by a standard measure of genetic differentiation
between populations, the fixation index (FST) (20),
applied at gene-by-gene resolution to the five cN2
clades, C1 to C5 (Fig. 3, B and C). Seventy-five
percent of the core genes had high FST values
(>0.8), (Fig. 3, B and C) (15), meaning different
clades contained significantly different alleles.
Some of the differentiated core genes have functions involved in the interaction between the cell
and environmental stimuli [e.g., transporters, genes
that affect oxidative stress responses, and cell
surface biosynthesis and modification (Data S1)];
that is, they are not all simply “housekeeping genes”
that control central metabolism. For example, alleles
of phosphoglucosamine mutase, which is involved
in the biosynthesis of outer membrane lipopolysac-charides (21), differ by an average of 10% of their
amino acid sequences (Fig. 3C), with substitutions
in the hydrophilic center of the enzyme (21), possibly affecting its specificity and kinetics.
We next asked whether different clade subpopulations carry distinct sets of flexible genes.
Using de novo assemblies to capture regions un-mapped by the reference assemblies (15), we
found that each subpopulation carries a small set
of distinct genes, typically in the form of cassettes
within genomic islands (Table 1). Cassettes containing genes in the glycosyltransferase family
account for much of the gene content variation
between these clade subpopulations (Table 1 and
table S1). The gene content in these cassettes
suggests involvement in outer membrane modifications, possibly affecting phage attachment (22),
recognition by grazers (23), cell-to-cell communication, or interactions with other bacteria (24).
We conclude that these clade subpopulations
have distinct “genomic backbones” (and are
Fig. 2. ITS-rRNA sequence and whole-genome neighbor-joining phylogenetic trees at a fine
resolution of diversity. (A) Phylogenetic tree based on ITS-rRNA sequences of 96 single cells (90 cN2
ribotypes, three cN1 ribotypes, and three c9301 ribotypes), as well as additional five high-light–adapted
cultured strains. (B) Phylogenetic tree of the 96 single cells based on whole-genome sequences. The colored
symbols to the left of the leaf labels in (A) and (B) represent the different clades depicted from the deep
branches observed in the whole-genome tree. The sample origin of each cell is marked with red, blue, and
green squares (representing autumn, winter, and spring, respectively) on the right. Distance units are base
substitutions per site (see scale bar) (15). Bootstrap values <80 are marked as black dots on the internal
nodes in (B) (fig. S1). Cells marked with fall into an ITS clade that differs from the genome-defined clade.
Neighbor-joining trees in (A) and (B) were constructed using p-distance.