Phonetic Feature Encoding in Human
Superior Temporal Gyrus
Nima Mesgarani,1 Connie Cheung,1 Keith Johnson,2 Edward F. Chang1†
During speech perception, linguistic elements such as consonants and vowels are extracted from
a complex acoustic speech signal. The superior temporal gyrus (STG) participates in high-order
auditory processing of speech, but how it encodes phonetic information is poorly understood. We used
high-density direct cortical surface recordings in humans while they listened to natural, continuous
speech to reveal the STG representation of the entire English phonetic inventory. At single electrodes,
we found response selectivity to distinct phonetic features. Encoding of acoustic properties was
mediated by a distributed population response. Phonetic features could be directly related to tuning for
spectrotemporal acoustic cues, some of which were encoded in a nonlinear fashion or by integration of
multiple cues. These findings demonstrate the acoustic-phonetic representation of speech in human STG.
Phonemes—and the distinctive features com- posing them—are hypothesized to be the smallest contrastive units that change a word’s meaning (e.g., /b/ and /d/ as in bad versus dad) (1). The superior temporal gyrus (Brodmann area 22, STG) has a key role in acoustic-phonetic pro-
cessing because it responds to speech over other
sounds (2) and focal electrical stimulation there
selectively interrupts speech discrimination (3).
These findings raise fundamental questions about
the representation of speech sounds, such as
whether local neural encoding is specific for pho-
nemes, acoustic-phonetic features, or low-level
spectrotemporal parameters. A major challenge
in addressing this in natural speech is that cor-
tical processing of individual speech sounds is
extraordinarily spatially discrete and rapid (4–7).
We recorded direct cortical activity from six
human participants implanted with high-density
multielectrode arrays as part of their clinical evaluation for epilepsy surgery (8). These recordings
provide simultaneous high spatial and temporal
resolution while sampling population neural activity from temporal lobe auditory speech cortex.
We analyzed high gamma (75 to 150 Hz) cortical
surface field potentials (9, 10), which correlate
with neuronal spiking (11, 12).
Participants listened to natural speech samples featuring a wide range of American English
speakers (500 sentences spoken by 400 people)
(13). Most speech-responsive sites were found in
posterior and middle STG (Fig. 1A, 37 to 102 sites
per participant, comparing speech versus silence,
P < 0.01, t test). Neural responses demonstrated a
distributed spatiotemporal pattern evoked during
listening (Fig. 1, B and C, and figs. S1 and S2).
We segmented the sentences into time-aligned
sequences of phonemes to investigate whether
STG sites show preferential responses. We estimated the mean neural response at each electrode
to every phoneme and found distinct selectiv-
1Department of Neurological Surgery, Department of Physiology, and Center for Integrative Neuroscience, University of
California, San Francisco, CA 94143, USA. 2Department of
Linguistics, University of California, Berkeley, CA 94720, USA.
*Present address: Department of Electrical Engineering,
Columbia University, New York, NY 10027, USA.
†Corresponding author. E-mail: firstname.lastname@example.org
Fig. 1. Human STG cortical selectivity to speech sounds. (A) Magnetic
resonance image surface reconstruction of one participant’s cerebrum. Elec-
trodes (red) are plotted with opacity signifying the t test value when com-
paring responses to silence and speech (P < 0.01, t test). (B) Example
sentence and its acoustic waveform, spectrogram, and phonetic transcription.
(C) Neural responses evoked by the sentence at selected electrodes. z score
indicates normalized response. (D) Average responses at five example electrodes
to all English phonemes and their PSI vectors.