time and sorted by VOT. The first electrode responds to all plosives with same approximate
latency and amplitude, irrespective of VOT. The
second electrode responds only to plosive phonemes with short VOT (voiced), and the third
electrode responds primarily to plosives with long
To examine the nonlinear relationship between
VOT and response amplitude for voiced-plosive
electrodes (labeled voiced in Fig. 2D) compared
with plosive electrodes with no sensitivity to
voicing feature (labeled coronal, labial and dorsal
in Fig. 2D), we fitted a linear and exponential
function to VOT-response pairs (fig. S11B). The
difference between these two fits specifies the
nonlinearity of this transformation, shown for all
plosive electrodes in Fig. 4C. Voiced-plosive electrodes (pink) all show strong nonlinear bias for
short VOTs compared with all other plosive electrodes (gray). We quantified the degree and direction of this nonlinear bias for these two groups
of plosive electrodes by measuring the average
second-derivative of the curves in Fig. 4C. This
measure maps electrodes with nonlinear preference for short VOTs (e.g., electrode e2 in Fig. 4B)
to negative values and electrodes with nonlinear
preference for long VOTs (e.g., electrode e3 in
Fig. 4B) to positive values. The distribution of this
measure for voiced-plosive electrodes (Fig. 4D,
red distribution) shows significantly greater nonlinear bias compared with the remaining plosive
electrodes (Fig. 4D, gray distribution) (P < 0.001,
Wilcox rank-sum test). This suggests a specialized mechanism for spatially distributed, nonlinear rate encoding of VOT and contrasts with
previously described temporal encoding mechanisms (26, 28).
We performed a similar analysis for fricatives,
measuring duration, which aids the distinction between voiced (/z/ and /v/) and unvoiced fricatives
(/s /, /ʃ/, /q/, /f/); spectral peak, which differentiates
/f/ and /v/ versus coronal /s/ and /z/ versus dorsal /ʃ/;
and F2 of the following vowel (16) (fig. S12).
These parameters can be decoded reliably from
population responses (Fig. 4A; P < 0.001, t test).
Because plosives and fricatives can be sub-specified by using similar acoustic parameters, we
determined whether the response of electrodes to
these parameters depends on their phonetic category (i.e., fricative or plosive). We compared the
partial correlation values of neural responses with
spectral peak, duration, and F2 onset of fricative
and plosive phonemes (Fig. 4E), where each point
corresponds to an electrode color-coded by its cluster grouping in Fig. 2D. High correlation values
(r = 0.70, 0.87, and 0.79; P < 0.001; t test) suggest that electrodes respond to these acoustic parameters independent of their phonetic context.
The similarity of responses to these isolated acoustic parameters suggests that electrode selectivity
to a specific phonetic features (shown with colors
in Fig. 4E) emerges from combined tuning to multiple acoustic parameters that define phonetic contrasts (24, 25).
We have characterized the STG representation
of the entire American English phonetic inventory.
We used direct cortical recordings with high spatial
and temporal resolution to determine how selectivity for phonetic features is correlated to acoustic spectrotemporal receptive field properties in
STG. We found evidence for both spatially local
and distributed selectivity to perceptually relevant
aspects of speech sounds, which together appear to
give rise to our internal representation of a phoneme.
We found selectivity for some higher-order
acoustic parameters, such as examples of nonlinear, spatial encoding of VOT, which could have
important implications for the categorical representation of this temporal cue. Furthermore, we
observed a joint differential encoding of F1 and
F2 at single cortical sites, suggesting evidence of
spectral integration previously speculated in theories of combination-sensitive neurons for vowels
Our results are consistent with previous single-unit recordings in human STG, which have not
demonstrated invariant, local selectivity to single
phonemes (30, 31). Instead, our findings suggest
a multidimensional feature space for encoding the
acoustic parameters of speech sounds (25). Phonetic features defined by distinct acoustic cues for
manner of articulation were the strongest determinants of selectivity, whereas place-of-articulation
cues were less discriminable. This might explain
some patterns of perceptual confusability between
phonemes (32) and is consistent with feature hierarchies organized around acoustic cues (17),
where phoneme similarity space in STG is driven
more by auditory-acoustic properties than articulatory ones (33). A featural representation has greater
universality across languages, minimizes the need
for precise unit boundaries, and can account for
coarticulation and temporal overlap over phoneme-based models for speech perception (17).
References and Notes
1. N. Chomsky, M. Halle, The Sound Pattern of English
(Harper and Row, New York, 1968).
2. J. R. Binder et al., Cereb. Cortex 10, 512–528 (2000).
3. D. Boatman, C. Hall, M. H. Goldstein, R. Lesser,
B. Gordon, Cortex 33, 83–98 (1997).
4. E. F. Chang et al., Nat. Neurosci. 13, 1428–1432 (2010).
Fig. 4. Neural encoding of plosive and fricative
phonemes. (A) Prediction accuracy of plosive and
fricative acoustic parameters from neural population responses. Error bars indicate SEM. (B) Response
of three example electrodes to all plosive phonemes
sorted by VOT. (C) Nonlinearity of VOT-response
transformation and (D) distributions of nonlinearity
for all plosive-selective electrodes identified in Fig.
2D. Voiced plosive-selective electrodes are shown
in pink, and the rest in gray. (E) Partial correlation
values between response of electrodes and acoustic parameters shared between plosives and fricatives
(**P < 0.01, t test). Dots (electrodes) are color-coded
by their cluster grouping from Fig. 2C.