Learning English is no easy task, as countless students well know. But when the student is a computer, one approach works surprisingly well: Simply feed mountains of text from the internet to a giant mathematical model called a neural network. That’s the operating principle behind generative language models like OpenAI’s ChatGPT, whose ability to converse coherently (if not always truthfully) on a wide range of topics has surprised researchers and the public over the past year.
But the approach has its drawbacks. For one thing, the “training” procedure required to transmute vast text archives into state-of-the-art language models is costly and time-intensive. For another, even the people who train large language models find it hard to understand their inner workings; that, in turn, makes it hard to predict the many ways they can fail.
Faced with these difficulties, some researchers have opted to train smaller models on smaller data sets and then study their behavior. “It’s like sequencing the Drosophila genome versus sequencing the human genome,” said Ellie Pavlick, a language model researcher at Brown University.
Now, in a paper recently posted to the scientific preprint server arxiv.org, a pair of Microsoft researchers have introduced a new method for training tiny language models: Raise them on a strict diet of children’s stories.
Machine learning researchers have embraced this lesson. GPT-3.5, the large language model that powers the ChatGPT interface, has nearly 200 billion parameters, and it was trained on a data set comprising hundreds of billions of words. (OpenAI hasn’t released the corresponding figures for its successor, GPT-4.) Training such large models typically requires at least 1,000 specialized processors called GPUs running in parallel for weeks at a time. Only a few companies can muster the requisite resources, let alone train and compare different models.
The two researchers showed that language models thousands of times smaller than today’s state-of-the-art systems rapidly learned to tell consistent and grammatical stories when trained in this way. Their results hint at new research directions that might be helpful for training larger models and understanding their behavior.
“I found this paper very informative,” said Chandra Bhagavatula, a language model researcher at the Allen Institute for Artificial Intelligence in Seattle. “The concept itself is super interesting.”
Once Upon a Time
The neural networks at the heart of language models are mathematical structures loosely inspired by the human brain. Each one contains many artificial neurons arranged in layers, with connections between neurons in adjacent layers. The neural network’s behavior is governed by the strength of these connections, called parameters. In a language model, the parameters control which words the model might spit out next, given an initial prompt and the words it has generated already.
A model only truly comes to life during training, when it repeatedly compares its own output to the text in its training data set and adjusts its parameters to increase the resemblance. An untrained network with random parameters is trivially easy to assemble from a few lines of code, but it will just produce gibberish. After training, it can often plausibly continue unfamiliar text. Larger models often undergo further fine-tuning that teaches them to answer questions and follow instructions, but the bulk of the training is mastering word prediction.
Success at word prediction requires a language model to master many different skills. For example, the rules of English grammar suggest that the next word after the word “going” is likely to be “to,” regardless of the subject of the text. In addition, a system needs factual knowledge to complete “the capital of France is,” and completing a passage containing the word “not” requires a rudimentary grasp of logic.
“Raw language is very complicated,” said Timothy Nguyen, a machine learning researcher at DeepMind. “In order for interesting linguistic capabilities to arise, people have resorted to ‘more data is better.’”