Speakers are typically aware of the ideas they want to express but not the steps involved in getting from those ideas to a series of motor movements. Consider describing an image with the sentence, “Bees are stinging a man.” The idea that starts the production processes contains no order between the semantic representations of BEES, MAN, and STING, just their roles in the event. The BEES are the agents who act, and the MAN is the patient being acted on. The speaker must decide the order in which to express them. The semantic representation of MAN may be chosen to start the sentence because speakers prefer to mention humans before nonhumans despite BEES performing the action. Word order or grammatical marking must indicate that it is the BEES who sting the MAN and not vice versa. The expression of these concepts follows the syntax of the language. In English, the action STING follows the subject of the sentence BEES, whereas it would come at the end of the sentence in Japanese. The verbs expressing STING need to agree in number with the word expressing BEES (“are stinging” vs. “is”). In addition, the speaker must decide on the words to use. For example, the representation of MAN could be expressed as “guy,” “dude,” “victim,” or another context-appropriate word. The speaker must also decide whether to use “a” vs. “the” before the selected nouns. The sounds of the words are retrieved and organized before they can be converted into motor movements. Language production research is concerned with studying these steps.
Much can be learned by examining what happens to a system when it falls apart. The roots of language production research are often attributed to Merringer and Meyer’s analysis of spontaneous speech errors by adult speakers (as cited in Fromkin, 1971). The late 1960s and early 1970s saw a boom in speech error research (e.g., Fromkin, 1971; Garrett, 1975; Nooteboom, 1973), which produced the first influential theories outlining the stages of language production. Scholars also created methods for inducing errors in the lab (Baars et al., 1975). This control over content allowed researchers to manipulate variables that were correlated with errors in spontaneous speech to uncover causation. This work was fundamental in establishing the processing steps and building blocks involved in planning to speak.
Starting in the 1980s, much experimental work concerned how speakers decide the syntactic structure of their sentences [see Psycholinguistics]. Researchers often ask participants to describe line drawings of scenes (e.g., bees stinging a man). Presenting pictures allows researchers to specify a message that constrains what is said across participants without giving them specific words or syntactic structures to use. For example, studies showed that people prefer to mention humans at the start of sentences even if they are not agents (e.g., “A man is stung by bees”; Sridhar, 1988).
As computers became widely available, researchers increasingly measured the time from the onset of a picture of an object (e.g., a bed) on a computer screen to when a participant begins to articulate its name (a.k.a., naming latency). In picture-naming experiments, participants are asked to name objects or actions as quickly and accurately as possible. They find, for example, that common words (e.g., “bed”) are produced faster and more accurately than uncommon ones (“bell”; the word frequency effect—e.g., Oldfield & Wingfield, 1964). Producing object and action names often takes over 900 ms when participants are not repeating or pretrained on their names (e.g., Szekely et al., 2005). Under 150 ms of this time is typically devoted to recognizing the familiar objects or actions depicted (Potter, 1975). Instead, most of this time is devoted to the many steps of word production.
In 1989, Wilhelm Levelt published the seminal book Speaking (Levelt, 1989). In it, he exhaustively reviewed work on language production and established much of the terminology and theoretical framework for the future work described below.
Studies measuring response latencies to produce simple sentences sometimes found that manipulating the difficulty of words at the end of an utterance (as opposed to the beginning) did not delay speech onset (e.g., Schriefers et al., 1998). This supported the argument that fluent sentence production can be incremental, meaning the speaker prepares upcoming parts of an utterance while articulating earlier parts. Around the start of the new millennium, researchers started monitoring speakers’ eye movements as they described pictures (Griffin & Bock, 2000; Meyer et al., 1998). This method goes beyond response latencies by providing direct insight into what happens before and after the articulation of an utterance begins. Although speakers can recognize what is depicted in simple line drawings within tenths of a second without moving their eyes (e.g., Bock et al., 2003), they nonetheless typically gaze at the objects they mention just before saying their names (Griffin & Bock, 2000). For example, when describing bees stinging a man, a speaker looks at the man for about a second before saying “A man.” While articulating “A man is stung by,” the speaker looks at the bees for about a second before saying “bees.” The time spent looking at the objects corresponds to the time needed to name the objects on their own and reflects how difficult it is to retrieve their names (Meyer et al., 1998).
Language production research has historically focused on spoken language. A growing body of research demonstrates many similarities between spoken and signed language production despite the difference in modality (Emmorey, 2023) [see Sign Language]. Acquiring a native spoken or signed language differs from learning a written language in that they do not need to be explicitly taught. In addition, speaking and signing typically occur with the time pressure of saying something in the moment to an interlocutor without the opportunity for extensive consideration or editing as found in writing. Despite differing from spoken and signed production in many ways, several basic findings such as frequency effects replicate with writing (e.g., Bonin et al., 2002). This article focuses on spoken languages.
Speaking starts with an intention to express something (Fromkin, 1971; Levelt, 1989). The content is first established with macroplanning, which consists of creating a goal (e.g., convincing a friend to watch a particular movie) and the conceptual content associated with it (e.g., pointing out the strength of its reviews, the skill of its lead actor, and the fact that it is only playing this week). Because all of this information cannot be squeezed into the production system at once, the content for an utterance is packaged into a preverbal message during a microplanning stage. A message can correspond to a proposition or two (e.g., that the movie is only playing this week) that is used to generate a sentence or a single concept like naming an object (“chair”).
Messages are shaped by what the speaker thinks the listener knows or can infer (audience design based on common ground; see Yoon & Brown, 2023). For example, messages include information about whether an element has been discussed recently in a conversation, allowing a pronoun like “she” or “it” to be used instead of “the actor” or “the movie.”
Messages need to be converted into sequences of words that express the relationships between the parts of each message. Across languages, there is a tendency to express previously mentioned and human referents early in sentences (e.g., start with “man” rather than “bees”; Sridhar, 1988).
Aside from message content, another influence on word order is the structure of sentences that speakers have recently said or heard. Even when no words overlap between sentences, people are more likely to say a sentence with the same structure as one they have recently heard (Bock, 1986). For example, after hearing a passive sentence like, “A passerby was jostled by a drunk,” in which the patient precedes the agent, a speaker is more likely to describe a picture as “A man is being stung by bees,” which has the same sequence than if they heard the active form, “A drunk jostled a passerby.” This phenomenon has several labels but is usually called syntactic priming or structural persistence. It occurs in natural conversations as well as the lab. This phenomenon suggests that there are procedures for forming sentences that are independent of the words used and that are changed by experience.
Speech errors throw light on the structure of word representations. The most common type of word error involves substituting an intended word with one that has a similar meaning in context (“couch” for “chair”; see Dell et al., 1997). The selection of a word based on meaning can go awry while correctly retrieving the sounds of the intruding word (e.g., substituting “couch” for “chair” but pronouncing “couch” correctly). Phonological errors involve the correct word being selected based on meaning but sound retrieval goes awry (saying “bear” for “chair”). This suggests that there is a step of production primarily concerned with selecting words based on meaning rather than sound and a subsequent one in which sounds are retrieved without much concern for meaning (Fromkin, 1971; Nooteboom, 1973). If retrieval happened in one step, one would not expect errors to be divided as cleanly into semantic and phonological types as they are. Errors can occur at either stage, and in unimpaired speakers, they involve near misses in that the result is semantically or phonologically related to the intended word.
Further support for this separation of word and sound stages comes from the tip-of-the-tongue phenomenon, in which a speaker has a strong sense of knowing which word they want to say but cannot come up with all the sounds (signers experience the analogous tip-of-the-fingers state; see Emmorey, 2023). This occurs most often for proper names (e.g., “Who directed Pink Flamingos?”) and uncommon words (“What do you call a word that is spelled the same forward and backward?”; Burke et al., 1991). People in a tip-of-the-tongue state usually know syntactic information about the missing word such as its grammatical gender in languages that have it (e.g., Vigliocco et al., 1997). So, an Italian speaker failing to retrieve the sounds for the word “pen” would nonetheless know that it has feminine grammatical gender and thus would be preceded by “la” instead of “il.” This word-specific knowledge indicates that speakers in a tip-of-the-tongue state have not just identified the intended concept. People in tip-of-the-tongue states sometimes have partial phonological information about the intended word, such as the first sound and number of syllables (Brown & McNeill, 1966). Providing additional semantic information about the intended word provides little or no benefit, whereas providing sounds is very helpful (Meyer & Bock, 1992). As a result, theories posit a word representation called a lemma, which marks that there is a word in a speaker's vocabulary that expresses a particular meaning and has specific grammatical properties but does not contain information about the sounds of the word (Kempen & Hoenkamp, 1987). Speakers first select a lemma and then retrieve and organize its sounds (phonological encoding).
Many phenomena indicate that the lemmas for semantically related words compete with each other for selection. One way to envision this is that putting a concept into words involves activating a set of semantic features (Fromkin, 1971). For example, the concept CHAIR could be represented with features such as IS FURNITURE, HAS LEGS, IS MANUFACTURED, HOLDS ONE PERSON, and USED FOR SITTING. While some features are relatively specific to chairs, others are shared with other objects such as couches, stools, and tables. The activated features spread activation to corresponding lemmas (e.g., Dell et al., 1997). So, the “chair” lemma will be highly activated but so will lemmas for “couch,” “stool,” and “table.” The most highly activated lemma is selected. Due to noisy activation or residual activation from recent use, another highly active lemma may be selected by mistake. Lemmas that do not share semantic features with “chair” receive no activation from its features and are therefore unlikely to be accidentally selected. Activation of semantically related lemmas increases the time needed to produce a word (Levelt, 1989). For example, producing a semantically related word increases the time to name a picture (e.g., “chair” is slower after “couch” compared to after “lime”; e.g., Damian & Als, 2005).
One of the biggest influences on picture-naming latencies is a variable called picture name agreement (see also uncertainty and codability; e.g., Szekely et al., 2005). The more words that are considered appropriate names for an object or action, the longer it takes to name it. For example, a picture of a baby might be named “baby” by speakers 99% of the time, even though it could also be called an “infant” or “child.” The picture would be considered to have high name agreement. In contrast, another picture might elicit “couch” 55% of the time and “sofa” 45%. This would constitute medium name agreement, and all else being equal, a picture of a baby would be named faster than a picture of a couch. Although name agreement is calculated by looking at the variety of names produced across speakers, evidence suggests that it reflects the activation of multiple lemmas within a speaker (Balatsou et al., 2022). Assuming that semantic features activate lemmas for lower name-agreement pictures less than high agreement ones accounts for name-agreement effects.
A morpheme is the smallest unit of language that carries meaning. For example, “bees” contains the morpheme for “bee” and a plural “s,” whereas “babysitting” contains “baby,” “sit,” and “ing” [see Morphology]. Relatively little research has addressed the production of morphologically complex words, and some theories omit them (see Wheeldon & Konopka, 2018). Some morphological processes may intervene between lemma selection and phonological encoding, but we will not further explore them here.
Having selected a lemma, the speaker must retrieve phonological information about the intended word, sometimes termed a lexeme (Kempen & Hoenkamp, 1987). This information includes not only the individual sounds (sometimes described as phonemes or segments) but also how they combine with syllable frames to form syllables. A syllable is a group of sounds that include a vowel. For example, the words “tack” and “cat” are each one syllable long, and they contain the same segments but in different syllable positions. Theories of production often posit a syllable frame that combines and orders initial consonants, a vowel, and then final consonants (e.g., Dell et al., 1997).
Syllable frames and the speech sounds that make up those syllable frames may be retrieved independently. The evidence for independence comes from the ability to prime a syllable frame when the speech sounds within the syllables differ (Costa & Sebastian-Gallés, 1998). So, all else being equal, “ta-co” would prime “ki-wi” because both have two syllables composed of a consonant and vowel, but “lim-bo” wouldn't prime “ki-wi” because it has an extra consonant (“m”) in its first syllable frame. Other work suggests that syllable frames are filled incrementally, that is, speech sounds are retrieved in the order they will be spoken (Meyer, 1991). Thus, the “l” in “limbo” would be inserted in the syllable frame for “lim” before the “m” is.
The sounds that words share with other words in a person’s vocabulary also influence the speed and success of phonological encoding in several ways (see Wheeldon & Konopka, 2018). For example, words that overlap in sounds with many other words tend to be less vulnerable to speech errors and tip-of-the-tongue states. However, the details of how the retrieval of a word is influenced by similar-sounding words is complicated.
Traditionally, most production researchers have tacitly or explicitly assumed that once abstract speech sounds have been retrieved and ordered as part of phonological encoding, they are converted directly into motor movements such that further investigation is better left to researchers focused on motor movements rather than language. So, ironically, the physical act of speaking is often ignored by those who study the processes that result in speech (see, e.g., Buchwald, 2014; Bürki, 2023). Nevertheless, fine-grained differences in how speech sounds are produced must become explicit. For example, the “p” sound in “pan,” “span,” and “nap” are articulated slightly differently, although we think of them all as “p.”
Researchers often assume that speakers strive to produce fluent speech (Clark & Clark, 1977). Pausing or saying “um” or “uh” in the middle of an utterance suggests that some aspect of the following utterance has not been planned yet. Such disfluencies are extremely common. For example, disfluencies occur once every 20 words in spontaneous speech (Shriberg, 1994), whereas speech errors occur roughly once every 1,000 words (Deese, 1984). Disfluencies suggest that speakers often plan less than an entire utterance down to the level of speech sounds before articulating the first word. While perfectly fluent speech could result from planning an utterance entirely before speaking, research shows that it may often be due to last-second preparation when the timing works out to have each unit ready when needed (Griffin, 2003). The consensus is that people rarely, if ever, plan entire utterances before beginning to speak even when fluent (e.g., Kempen & Hoenkamp, 1987; Levelt, 1989). Instead, production is incremental, with processing occurring at multiple levels simultaneously. For example, the lemma for an upcoming word is selected while an earlier word is articulated (e.g., selecting “bees” while saying “man”).
Much work has addressed the minimum amount of planning prior to speech onset at different steps in the production needed for fluent speech. However, variations in methodology make it challenging to establish a minimum or modal degree of planning prior to speech onset because speakers operate under different constraints and with different goals across studies. Research on the scope of planning before speech has addressed issues such as the degree to which speakers prefer to know the gist of an event and select a sentence structure before speaking (e.g., Norcliffe et al., 2015); when verbs are selected (e.g., Schriefers et al., 1998); the time course for encoding nouns in complex noun phrases like, “The acorn that is over the bear…” (Allum & Wheeldon, 2007); whether final morphemes in a word can be encoded before initial ones (e.g., “print” before “re” in “reprint”; Roelofs, 1996); whether articulation of a word can begin before all of its sounds have been activated (e.g., do you need to know that “limbo” ends in “o” before starting to articulate the “l” Roelofs, 2002); and so on. It is uncontroversial to say that when speaking fluently, speakers at least select the lemma and initial sound of the first content word in an utterance before speaking; otherwise, they would have nothing to say. The rest is subject to debate.
Language production research focuses on processing in adult speakers who have acquired their language and might be thought to have stable representations of their language with no need for change. However, producing a word or sentence has a long-lasting impact on speaking. For example, repetition priming for producing words can persist for weeks (Cave, 1997). Semantic interference from producing “couch” affects the time and accuracy for producing “chair” over multiple seconds and several intervening, unrelated words (e.g., “flower” and “truck”; e.g., Schnur et al., 2006). Likewise, syntactic priming can be detected over the production of many structurally unrelated sentences (Bock & Griffin, 2000). Whereas earlier models focused on accounting for immediate effects of produced and perceived words (e.g., Levelt et al., 1999), more recent ones have incorporated learning mechanisms to accommodate long-lasting effects (e.g., Oppenheim et al., 2010). Each time someone says a word or uses a syntactic structure, they become faster and more accurate when doing so again. Learning is often conceptualized as strengthening the connections between representations that lead to successful speech, allowing for greater activation to spread between them (e.g., Damian & Als, 2005). For example, each time a lemma is selected, its semantic features become better able to activate it in the future. In addition, some models also incorporate a weakening of connections between activated features and lemmas that are not selected (see Oppenheim et al., 2010). So, after producing “chair,” the overlapping features such as USED FOR SITTING activate “couch” less, making it slightly slower to produce. To account for long-lasting syntactic priming, the mappings between roles in a message (e.g., agent and patient) and the ordering of the phrases expressing them in a sentence are strengthened with use ( ).
Conceptualizing priming effects as the result of slight but persistent strengthening mappings between representations allows these phenomena to connect to language acquisition, which is traditionally thought of as learning [see Language Acquisition]. For example, the syntactic priming effects seen in neurotypical adults are argued to rest on fundamentally the same learning mechanisms that underlie other learning (Bock & Griffin, 2000), such as children learning to produce passive sentences (e.g., Kumarage et al., 2022), second language learners acquiring new syntactic structures (e.g., Shin & Christianson, 2012), and the relearning of the syntactic structures by agrammatic aphasics (e.g., Lee & Man, 2017).
Early models of word production assumed that semantically related words such as “couch” and “sofa” battled for selection in so that the more available “sofa” was, the longer it would take to select “couch” (Levelt et al., 1999). Effectively, everything was delayed until the single best word was selected. More recent approaches and data argue for a “good enough” approach to production, in which speakers will produce the first available word that adequately expresses their intention (Goldberg & Ferreira, 2022; Oppenheim et al., 2010). So, the selection of a lemma or syntactic structure is more of a horse race in which the strength of the second horse does not slow the first horse instead of a battle to the death in which stronger competitors delay the victory of the winner.
The default for psycholinguistic studies is that participants self-identify as native speakers of the tested language (e.g., Dell et al., 1997; Meyer, 1991; Roelofs, 1996). The majority of people, particularly those outside the United States, can carry out conversations in at least two languages (e.g., Byers-Heinlein, 2019). Many of the studies discussed here were carried out with bilingual participants, ignoring the potential influence of another language. However, there is a vibrant subarea of psycholinguistics concerning how the knowledge of two or more languages influences production. The primary finding is that a speaker’s first-learned or dominant language influences processing in their other language, and a second or nondominant language also influences processing of the first language (see Bailey et al., 2024).
Cognates are words that share sounds and meaning across languages such as the English “guitar” and Spanish “guitarra.” Bilinguals are faster to produce cognates than noncognates (translation equivalents that are not similar in form, e.g., English “dog” and Spanish “perro”; Costa et al., 2000). Cognates are also less likely to trigger tip-of-the-tongue states than matched noncognates (Gollan & Acenas, 2004). This may be attributed to lemmas in both languages becoming active and providing converging activation to overlapping sound representations (Costa et al., 2000).
Syntactic priming occurs across languages (see van Gompel & Arai, 2018). For example, the German sentence, “Der Musiker verkaufte dem Agenten etwas Kokain” (“The musician sold the agent some cocaine”), would prime speakers to say, “A woman is giving a girl a lunchbox,” rather than “A woman is giving a lunchbox to a girl.” Cross-language syntactic priming has been replicated across many pairs of languages, including typologically unrelated ones. It is often considered evidence for shared syntactic representations.
English and Dutch were the predominant languages used in language production research in the 20th century (the Max Planck Institute for Psycholinguistics officially opened in the Netherlands in 1980). As of 2009, only approximately 30 out of the world’s 5,000 languages had been explored in laboratory experiments on sentence production (Jaeger & Norliffe, 2009). Although some linguists argue for many universal tendencies across languages, there is great variability among them along many dimensions that a priori should affect processing (see Evans & Levinson, 2009). For example, languages differ in their basic word order. In English, a typical word order like “A girl ate pasta”' has the order of subject, verb, object. In contrast, about 13% of languages place the main verb (“ate”) at the start of the sentence, and over 56% put the verb after the subject and object (Hammarström, 2016). Work on such languages has been important in testing the extent of incrementality in sentence production and influences on the choice of sentence structure (e.g., Momma et al., 2015; Norcliffe et al., 2015).
The search for the basic building blocks used in sentence production presents a strong example of the hazard of making universal claims based on a limited set of languages. Logically, we form sentences by combining words in novel ways, so words or lemmas are building blocks in speaking. In addition, morphemes (such as “kick” and “-ing” in “kicking”) are the smallest units of meaning in a language and combine to form words. Speech error patterns reinforce the intuition that these are the units combined in speaking. For example, words accidentally exchange with other words in sentences (e.g., “She ate a park in the sandwich”). Morphemes can be recombined by accident, as in saying, “I’m not in the read for mooding,” instead of “mood for reading,” in which the morphemes “mood” and “read” exchange while the progressive “-ing” remains in place. Such errors are attested in languages that are unrelated to Indo-European ones (Wells-Jensen, 2007). These observations suggest that universal steps in production involve selecting and placing words and morphemes in sequences.
Representations of single sounds are also implicated as building blocks by common speech errors such as saying “what the bell” instead of “what the hell” in Indo-European languages such as English, Dutch, and Spanish and have therefore been considered universal building blocks of sentences (e.g., Bock, 1991). However, as research has expanded beyond Indo-European languages, the question of sublexical units has become more complicated. For example, errors are more likely to involve whole syllables than individual sounds in Mandarin Chinese (see Alderete & O’Séaghdha, 2022). Likewise, Japanese speech errors tend to involve phonological units that are bigger than a single sound (morae). English has about 9,000 different syllables, and Dutch has at least 12,000, whereas Mandarin has 400 syllables (1,200 including tones), and Japanese only has 100 morae. These cross-linguistic differences in error patterns and vocabulary suggest that unit-like behavior at the level of phonological encoding is influenced by how frequently sound sequences are combined in the vocabularies of different languages. In other words, words fall apart when the following sound is not very predictable (low transitional probability), and this varies across languages [see Statistical Learning]. Fortunately, researchers are increasingly studying production in typologically different languages, allowing true universal processing principles to be discovered (e.g., Wells-Jensen, 2007; Norcliffe et al., 2015; Sauppe, 2017).
To speak is to act. Although language can be seen as an unparalleled system of combining symbols of many levels of abstraction, it shares many characteristics with producing other motor actions (Lashley, 1951). This is most easily seen in comparisons with playing music, which also involves sequencing movements with a complex hierarchical structure. Ordering errors across the two domains share many similarities. In addition, the timing of executing one action (articulating a word or flexing an arm) while preparing the next is similar in many ways (Griffin, 2003).
Baddeley's (1992) model of working memory (which is responsible for the short-term storage and manipulation of information) includes a phonological loop for rehearsing verbal material such as word lists [see Working Memory]. Psycholinguists have often considered how working memory capacity interacts with language production processes (see Schwering & MacDonald, 2020).
Historically, neuroimaging and electrophysiological measures have not played a large role in language production research because the movements required for speaking disrupted measurement. Also, the research questions often differed, with imaging typically focusing on localization rather than understanding processes [see Neuroscience of Language]. However, researchers are increasingly finding ways to circumvent the limitations of neuroimaging techniques and relate brain activity to production processes. Such studies, for example, have found very early phonological effects in word production, supporting the hypothesis that phonological encoding can begin before lemma selection is complete or that the stages are less distinct than typically assumed (Strijkers et al., 2017).
I am indebted to Maria Bedny, Michael Frank, Bill McNeill, Lauretta Reeves, and Bryan Register for comments on drafts of this text and Eri Pilon for copyediting.
Buchwald, A. (2014). Phonetic processing. In M. Goldrick, F. S. Ferreira, & M. Miozzo (Eds.), The Oxford handbook of language production (pp. 245–258). Oxford Press. https://doi.org/10.1093/oxfordhb/9780199735471.001.0001
Goldrick, M. (2014). Phonological processing: The retrieval and encoding of word form information in speech production. In M. Goldrick, F. S. Ferreira, & M. Miozzo (Eds.), The Oxford handbook of language production (pp. 228–244). Oxford Press. https://doi.org/10.1093/oxfordhb/9780199735471.001.0001
Oppenheim, G. M., Dell, G. S., & Schwartz, M. F. (2010). The dark side of incremental learning: A model of cumulative semantic interference during lexical access in speech production. Cognition, 114(2), 227–252. https://doi.org/10.1016/j.cognition.2009.09.007
Wheeldon, L. R., & Konopka, A. (2023). Grammatical encoding for speech production. Cambridge University Press. https://doi.org/10.1017/9781009264518