A large language model (LLM) is a computational system, typically a deep neural network with a large number of tunable parameters (i.e., weights), that implements a mathematical function called a language model. A language model (LM), in its most general form, is a probability distribution over possible sequences of words and other elements in a language. For example, given a sequence w1, w2, ..., wm, an LM gives the probability P of that sequence: P(w1, w2, ..., wm). In a natural language context, sequence elements—referred to as tokens—can be words, word parts, punctuation, or other symbols. In practice, LMs can be used to predict unseen tokens in a sequence. For example, given sequence w1, w2, ... wm–1, the probability that w is the mth token in this sequence can be computed as the conditional probability P(w | w1, w2, ..., wm–1). LMs have been used extensively in many areas of natural language processing, ranging from speech recognition and translation to text generation and chatbots. The neural networks underlying LLMs are trained using broad collections of text typically obtained from websites, digitized books, and other digital resources.
A central concept in language modeling is the Markov Chain, a mathematical model of stochastic processes developed by Andrey Markov in the early 1900s (Gagniuc, 2017), in which the probability of a token in a sequence depends only on a fixed number of immediately preceding tokens.
Shannon (1948) adopted a Markov chain framework to propose language modeling with n-grams, or sequences of length n. Given a vocabulary of tokens from which sequences can be formed, a unigram model (n = 1) is a table that gives the probability P(wi) of each token independently; a bigram (n = 2) model gives, for each pair of tokens wi, wj, the probability that wj will follow wi in a sequence; and a trigram model (n = 3) similarly gives conditional probabilities that each token will follow a particular sequence of two tokens, and so on.
Using a bigram model, for example, the probability of any sequence w1,w2, ... wm can be estimated as P(w1)P(w2 | w1)...P(wm | wm–1), and given a sequence, the next token can be sampled from the probability distribution over tokens following the previous two tokens. N-gram models are the simplest form of language modeling. The probabilities that make up an n-gram model are estimated by measuring the frequencies of token co-occurrences in text corpora; these values are then stored in a probability table, from which the probability of each possible word following a given sequence can be quickly retrieved. Shannon (1951) showed how n-gram models can be used to estimate the entropy of natural languages, where entropy measures the average degree to which a word can be predicted from the sequence of words that precedes it.
In the decades following Shannon’s work, n-grams and other statistical language models were used extensively in natural language processing. However, among other limitations, n-gram models suffer from the “curse of dimensionality”: the size of the model’s probability table increases exponentially with n.
In the 1990s and 2000s, several groups showed that neural networks could be trained to implement language models more efficiently and accurately than n-gram models. Most notably, Bengio et al. (2000) proposed the basic structure for neural language modeling still used today: given an input sequence of tokens from a text, the neural network is trained to predict the probability that each token in the model’s vocabulary will be the next token in the sequence. Tokens are represented by numerical vectors (embeddings) whose values are also learned from data.
Language models based on n-grams and on traditional feedforward neural networks allow only a fixed number of tokens—the context—to be used in predicting next-token probabilities. The invention of recurrent neural networks (RNNs), in which tokens in a sequence are input to the network over a series of time steps, enabled neural language modeling with unlimited context length. However, such networks have difficulty capturing longer-range relationships between tokens in a sequence. To address this problem, versions of RNNs were created with features that enhanced their short-term memory (Hochreiter & Schmidhuber, 1997; Cho et al., 2014).
In 2017, researchers at Google proposed a novel type of neural network, called a transformer architecture (Vaswani et al., 2017), that was entirely feedforward (meaning that it had no recurrence) but captured long-range dependencies among sequence tokens via a mechanism called attention. Transformers substantially outperformed previous RNNs on language modeling. Moreover, unlike RNNs, many aspects of the transformer architecture are easily parallelizable on specialized hardware, enabling substantial scaling of training and inference processes. At present, nearly all LLMs use transformer architectures, the largest having one trillion or more parameters.
LLMs are generative models, which, given a prompt (decomposed into tokens), compute a probability distribution over the model’s vocabulary and probabilistically select the next token to generate. This process can be iterated by appending the generated token to the original prompt and using this new prompt to choose the next token, and so on, to generate new text of any length. LLMs typically have a fixed-length context window for their input, in which the original prompt and added tokens are stored; when the context window is full, a token is dropped from the beginning in order to add a new token at the end.
LLMs are built on the transformer architecture, which consists of layers of transformer blocks. Each block contains an attention layer that feeds into a traditional multilayer perceptron. Each attention layer can have multiple attention heads, each of which computes new token embeddings that incorporate some aspect of the context of other tokens in the sequence. The final output of the network is a probability distribution over all tokens in the vocabulary, from which new tokens can be probabilistically selected. As an example of the scale of LLMs, OpenAI’s GPT-3 model has 96 layers of transformer blocks, each of which contains 96 attention heads, with a total of 175 billion tunable parameters (Brown et al., 2020). More recent models have one trillion or more parameters (Minaee et al., 2024).
LLMs are trained on large corpora of text data. A masked training method was used for the early LLM BERT (Devlin et al., 2018): certain tokens in the input text are omitted, and the training objective is to predict these “masked” tokens. OpenAI’s GPT models use an autoregressive training method: the last token in an input text is omitted, and the model’s objective is to predict that token.
Such self-supervised training, without any specific task objective, is called pre-training. The pre-trained model is called a base model or a foundation model. For many applications, the model must be fine-tuned (i.e., additionally trained) on data that is designed or collected for a specific task. Such fine-tuning takes several forms. Supervised fine-tuning uses curated texts with human-created labels that are relevant to specific tasks an LLM might be adapted for, such as summarization or sentiment analysis. Instruction tuning uses datasets with human-created examples in which prompts contain instructions and the desired responses show the model how to carry out the instructions. Similar curated datasets are also used to train LLMs to engage in conversation (e.g., become chatbots). Other fine-tuning methods include reinforcement learning from human feedback, often used to train LLMs to avoid biased, inappropriate, or other unwanted outputs. Like pre-training, all such fine-tuning methods result in a model that predicts new tokens in response to a prompt, but by tuning weights to create a friendly, nontoxic chatbot that obeys instructions (e.g., OpenAI’s ChatGPT), they go beyond the language-modeling objective that simply relies on statistics of natural language found in general text corpora.
LLMs have been remarkably successful, both in their abilities to model and generate natural language and in their benefits as foundations on which more general AI systems can be built (hence the term foundation models). LLMs now dominate applications in natural language processing and have been applied to many areas beyond language, including computer code generation (Poldrack et al., 2023; Roziere et al., 2023), medical applications (Zhou et al., 2023), robotics (Vemprala et al., 2023), financial prediction (Yu et al., 2023), and other diverse applications.
However, in spite of their successes, these models have several weaknesses that have limited their widespread deployment. Because their pretraining objective is to generate statistically likely next tokens, they are prone to what have been called hallucinations: generating (often plausible-sounding) untrue statements or information (Xu et al., 2024). LLMs have also absorbed human-like biases concerning race, gender, and other attributes. In spite of fine-tuning designed to mitigate such biases, models built on LLMs have exhibited such biases in both explicit and subtle ways (Weidinger et al., 2022). Other liabilities of LLMs include the possibly illegal inclusion of copyrighted materials in their training data and their outputs (Karamolegkou et al., 2023), the possibility of security risks and privacy violations by such models (Yao et al., 2024), the use of such models to spread disinformation (Jiang et al., 2024), and the requirement for unsustainable amounts of electricity, water, and other resources needed for their operation (Luccioni et al., 2023).
The quality of pre-trained LLMs is often evaluated in terms of perplexity, an information-theoretic measure that captures how well LLMs perform at predicting tokens that follow given sequences (Rosenfeld, 2000). However, perplexity does not always correlate well with the capability of LLMs (and their fine-tuned versions) to perform specific tasks. Thus, these models are typically tested on standard benchmarks that assess abilities to perform diverse tasks involving language and reasoning as well as on standardized tests designed for humans (Chang et al., 2023). There has been considerable controversy over how informative such evaluations are. Some issues with benchmark-based evaluations are (1) the possibility of data contamination—whether parts of these benchmarks (or very similar items) are contained in the models’ training data, which is often impossible to determine, as commercial companies typically don’t reveal training data; (2) the possibility of shortcuts—unintended statistical associations in tests that might be used by the model to predict answers without actually performing the underlying general ability being tested; and (3) the issue of test validity—whether scoring high on a benchmark translates into similar performance in real-world tasks.
Many studies have shown that the quality of LLMs, measured both in terms of perplexity and in improvement on benchmarks, scales in predictable ways with model size (number of parameters), amount of pre-training data, and computational resources for training (Kaplan et al., 2020). Other studies have shown that certain capabilities in LLMs seem to emerge only at certain scales (Wei et al., 2022a). However, other studies have proposed that the apparent abrupt emergence of such capabilities is an artifact of the evaluation metrics used, not an intrinsic property of scaling (Schaeffer et al., 2024).
The abilities of fine-tuned LLMs to reason and plan have been widely debated and remain controversial. Numerous studies have claimed that LLMs can perform sophisticated mathematical and other types of reasoning (Huang & Chang, 2023) and that their abilities can be improved via Chain-of-Thought prompting, in which they are prompted with examples of reasoning patterns and are prompted to “think step-by-step” (Wei et al., 2022b; Kojima et al., 2022). However, other studies have pointed to many limitations in LLMs’ capacity for reasoning and planning and have questioned the generality and robustness of such abilities in LLMs (McCoy, 2023; Wu et al., 2023), hypothesizing that these systems’ performance on certain problems is due to “approximate retrieval” of similar reasoning patterns in their training data (Kambhampati, 2024) rather than abstract reasoning abilities.
In order to improve and increase their capabilities, LLMs are being extended in many ways, such as being given the ability to call on “scratchpads” (Nye et al., 2021), code interpreters, symbolic calculators, and other external tools, to search the web and use the results to back up claims (Gao et al., 2023) and to perform actions on the internet. In addition, LLMs are being extended to be multimodal—that is, to integrate language with data in other modalities, such as images and videos (Koh et al., 2023; Liu et al., 2024).
LLMs have touched on nearly all areas of cognitive science and have played many roles, including (among others) as proof of principle for or against linguistic hypotheses [see Language Acquisition] (Mahowald et al., 2024; Piantadosi, 2023), as models of neuroscientific and cognitive processes [see Neuroscience of Syntax] (Hardy et al., 2023; Schrimpf et al., 2020), as proposed replacements for human participants in experiments (Crockett & Messeri, 2023; Dillion et al., 2023), and as foils for showing how humans and machines differ (Mitchell et al., 2023; West et al., 2023; Yiu et al., 2023).
Child, R. Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C. McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 31.