Large language models are a class of AI algorithm that relies on a high number computational nodes and an equally large number of connections among them. They can be trained to perform a variety of functions—protein folding, anyone?—but they're mostly recognized for their capabilities with human languages.
LLMs trained to simply predict the next word that will appear in text can produce human-sounding conversations and essays, although with some worrying accuracy issues. The systems have demonstrated a variety of behaviors that appear to go well beyond the simple language capabilities they were trained to handle.
We can apparently add analogies to the list of items that LLMs have inadvertently mastered. A team from University of California, Los Angeles has tested the GPT-3 LLM using questions that should be familiar to any Americans that have spent time on standardized tests like the SAT. In all but one variant of these questions, GPT-3 managed to outperform undergrads who presumably had mastered these tests just a few years earlier. The researchers suggest that this indicates that Large Language Models are able to master reasoning by analogy.
Different kinds of reasoning
The UCLA team, Taylor Webb, Keith Holyoak, and Hongjing Lu, relied on a large collection of ways that past research has tested humans' ability to reason via analogy. The classic form of this is the completion of a comparison—think "cold is to ice as hot is to ____"—where you have to select the best completion from a set of options.
Related tests involve figuring out the rules behind transformations of a series of letters. So, for example, if the series a b c d is transformed to a b c e, then the rule is to replace the last letter of the series with its alphabetical successor. The participant's understanding of the rule is tested by asking them to use the rule to transform a different set of letters. Similar tests with numbers can involve complex rules, such as "only even numbers in order, but can be ascending or descending."
On all of these tests, GPT-3 consistently outperformed undergrads, although the margins varied depending on the specific test involved. The researchers also found that the software could develop rules based on a series of numbers, and then apply them to a different domain, such as descriptions of temperatures like "warm" and "chilly." They conclude that "these results suggest that GPT-3 has developed an abstract notion of successorship that can be flexibly generalized between different domains."
But there were also some odd glitches. The software didn't consistently recognize when it was being presented with these problems, displaying a large error rate unless given a prompt to supply an answer, or when the question was phrased as a sentence, rather than sets of values.