wp-plugin-hostgator
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/scienrds/scienceandnerds/wp-includes/functions.php on line 6114ol-scrapes
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/scienrds/scienceandnerds/wp-includes/functions.php on line 6114Source:https:\/\/www.quantamagazine.org\/new-theory-suggests-chatbots-can-understand-text-20240122\/#comments<\/a><\/br> Artificial intelligence seems more powerful than ever, with chatbots like Bard and ChatGPT capable of producing uncannily humanlike text. But for all their talents, these bots still leave researchers wondering: Do such models actually understand<\/a> what they are saying? \u201cClearly, some people believe they do,\u201d said the AI pioneer Geoff Hinton<\/a> in a recent conversation<\/a> with Andrew Ng, \u201cand some people believe they are just stochastic parrots.\u201d<\/p>\n This evocative phrase comes from a 2021 paper<\/a> co-authored by Emily Bender<\/a>, a computational linguist at the University of Washington. It suggests that large language models (LLMs) \u2014 which form the basis of modern chatbots \u2014 generate text only by combining information they have already seen \u201cwithout any reference to meaning,\u201d the authors wrote, which makes an LLM \u201ca stochastic parrot.\u201d<\/p>\n These models power many of today\u2019s biggest and best chatbots, so Hinton argued that it\u2019s time to determine the extent of what they understand. The question, to him, is more than academic. \u201cSo long as we have those differences\u201d of opinion, he said to Ng, \u201cwe are not going to be able to come to a consensus about dangers.\u201d<\/p>\n New research may have intimations of an answer. A theory developed by Sanjeev Arora<\/a> of Princeton University and Anirudh Goyal<\/a>, a research scientist at Google DeepMind, suggests that the largest of today\u2019s LLMs are not stochastic parrots. The authors argue that as these models get bigger and are trained on more data, they improve on individual language-related abilities and also develop new ones by combining skills in a manner that hints at understanding \u2014 combinations that were unlikely to exist in the training data.<\/p>\n This theoretical approach, which provides a mathematically provable argument for how and why an LLM can develop so many abilities, has convinced experts like Hinton, and others. And when Arora and his team tested some of its predictions, they found that these models behaved almost exactly as expected. From all accounts, they\u2019ve made a strong case that the largest LLMs are not just parroting what they\u2019ve seen before.<\/p>\n \u201c[They] cannot be just mimicking what has been seen in the training data,\u201d said S\u00e9bastien Bubeck<\/a>, a mathematician and computer scientist at Microsoft Research who was not part of the work. \u201cThat\u2019s the basic insight.\u201d<\/p>\n The emergence of unexpected and diverse abilities<\/a> in LLMs, it\u2019s fair to say, came as a surprise. These abilities are not an obvious consequence of the way the systems are built and trained. An LLM is a massive artificial neural network, which connects individual artificial neurons. These connections are known as the model\u2019s parameters, and their number denotes the LLM\u2019s size. Training involves giving the LLM a sentence with the last word obscured, for example, \u201cFuel costs an arm and a ___.\u201d The LLM predicts a probability distribution over its entire vocabulary, so if it knows, say, a thousand words, it predicts a thousand probabilities. It then picks the most likely word to complete the sentence \u2014 presumably, \u201cleg.\u201d<\/p>\n Initially, the LLM might choose words poorly. The training algorithm then calculates a loss \u2014 the distance, in some high-dimensional mathematical space, between the LLM\u2019s answer and the actual word in the original sentence \u2014 and uses this loss to tweak the parameters. Now, given the same sentence, the LLM will calculate a better probability distribution and its loss will be slightly lower. The algorithm does this for every sentence in the training data (possibly billions of sentences), until the LLM\u2019s overall loss drops down to acceptable levels. A similar process is used to test the LLM on sentences that weren\u2019t part of the training data.<\/p>\n A trained and tested LLM, when presented with a new text prompt, will generate the most likely next word, append it to the prompt, generate another next word, and continue in this manner, producing a seemingly coherent reply. Nothing in the training process suggests that bigger LLMs, built using more parameters and training data, should also improve at tasks that require reasoning to answer.<\/p>\n But they do. Big enough LLMs demonstrate abilities \u2014 from solving elementary math problems to answering questions about the goings-on in others\u2019 minds \u2014 that smaller models don\u2019t have, even though they are all trained in similar ways.<\/p>\n \u201cWhere did that [ability] emerge from?\u201d Arora wondered. \u201cAnd can that emerge from just next-word prediction?\u201d<\/p>\n Arora teamed up with Goyal to answer such questions analytically. \u201cWe were trying to come up with a theoretical framework to understand how emergence happens,\u201d Arora said.<\/p>\n The duo turned to mathematical objects called random graphs. A graph is a collection of points (or nodes) connected by lines (or edges), and in a random graph the presence of an edge between any two nodes is dictated randomly \u2014 say, by a coin flip. The coin can be biased, so that it comes up heads with some probability p<\/em>. If the coin comes up heads for a given pair of nodes, an edge forms between those two nodes; otherwise they remain unconnected. As the value of p<\/em> changes, the graphs can show sudden transitions in their properties.<\/em> For example, when p<\/em> exceeds a certain threshold, isolated nodes \u2014 those that aren\u2019t connected to any other node \u2014 abruptly disappear.<\/p>\n Arora and Goyal realized that random graphs, which give rise to unexpected behaviors after they meet certain thresholds, could be a way to model the behavior of LLMs. Neural networks have become almost too complex to analyze, but mathematicians have been studying random graphs for a long time and have developed various tools to analyze them. Maybe random graph theory could give researchers a way to understand and predict the apparently unexpected behaviors of large LLMs.<\/p>\n The researchers decided to focus on \u201cbipartite\u201d graphs, which contain two types of nodes. In their model, one type of node represents pieces of text \u2014 not individual words but chunks that could be a paragraph to a few pages long. These nodes are arranged in a straight line. Below them, in another line, is the other set of nodes. These represent the skills needed to make sense of a given piece of text. Each skill could be almost anything. Perhaps one node represents an LLM\u2019s ability to understand the word \u201cbecause,\u201d which incorporates some notion of causality; another could represent being able to divide two numbers; yet another might represent the ability to detect irony. \u201cIf you understand that the piece of text is ironical, a lot of things flip,\u201d Arora said. \u201cThat\u2019s relevant to predicting words.\u201d<\/p>\n To be clear, LLMs are not trained or tested with skills in mind; they\u2019re built only to improve next-word prediction. But Arora and Goyal wanted to understand LLMs from the perspective of the skills that might be required to comprehend a single text. A connection between a skill node and a text node, or between multiple skill nodes and a text node, means the LLM needs those skills to understand the text in that node. Also, multiple pieces of text might draw from the same skill or set of skills; for example, a set of skill nodes representing the ability to understand irony would connect to the numerous text nodes where irony occurs.<\/p>\n The challenge now was to connect these bipartite graphs to actual LLMs and see if the graphs could reveal something about the emergence of powerful abilities. But the researchers could not rely on any information about the training or testing of actual LLMs \u2014 companies like OpenAI or DeepMind don\u2019t make their training or test data public. Also, Arora and Goyal wanted to predict how LLMs will behave as they get even bigger, and there\u2019s no such information available for forthcoming chatbots. There was, however, one crucial piece of information that the researchers could access.<\/p>\n Since 2021, researchers studying the performance of LLMs and other neural networks have seen a universal trait emerge. They noticed that as a model gets bigger, whether in size or in the amount of training data, its loss on test data (the difference between predicted and correct answers on new texts, after training) decreases in a very specific manner. These observations have been codified into equations called the neural scaling laws. So Arora and Goyal designed their theory to depend not on data from any individual LLM, chatbot or set of training and test data, but on the universal law these systems are all expected to obey: the loss predicted by scaling laws.<\/p>\n Maybe, they reasoned, improved performance \u2014 as measured by the neural scaling laws \u2014 was related to improved skills. And these improved skills could be defined in their bipartite graphs by the connection of skill nodes to text nodes. Establishing this link \u2014 between neural scaling laws and bipartite graphs \u2014 was the key that would allow them to proceed.<\/p>\n The researchers started by assuming that there exists a hypothetical bipartite graph that corresponds to an LLM\u2019s behavior on test data. To leverage the change in the LLM\u2019s loss on test data, they imagined a way to use the graph to describe how the LLM gains skills.<\/p>\n Take, for instance, the skill \u201cunderstands irony.\u201d This idea is represented with a skill node, so the researchers look to see what text nodes this skill node connects to. If almost all of these connected text nodes are successful \u2014 meaning that the LLM\u2019s predictions on the text represented by these nodes are highly accurate \u2014 then the LLM is competent in this particular skill. But if more than a certain fraction of the skill node\u2019s connections go to failed text nodes, then the LLM fails at this skill.<\/p>\n This connection between these bipartite graphs and LLMs allowed Arora and Goyal to use the tools of random graph theory to analyze LLM behavior by proxy. Studying these graphs revealed certain relationships between the nodes. These relationships, in turn, translated to a logical and testable way to explain how large models gained the skills necessary to achieve their unexpected abilities.<\/p>\n Arora and Goyal first explained one key behavior: why bigger LLMs become more skilled than their smaller counterparts on individual skills. They started with the lower test loss predicted by the neural scaling laws. In a graph, this lower test loss is represented by a fall in the fraction of failed test nodes. So there are fewer failed test nodes overall. And if there are fewer failed test nodes, then there are fewer connections between failed test nodes and skill nodes. Therefore, a greater number of skill nodes are connected to successful test nodes, suggesting a growing competence in skills for the model. \u201cA very slight reduction in loss gives rise to the machine acquiring competence of these skills,\u201d Goyal said.<\/p>\n Next, the pair found a way to explain a larger model\u2019s unexpected abilities. As an LLM\u2019s size increases and its test loss decreases, random combinations of skill nodes develop connections to individual text nodes. This suggests that the LLM also gets better at using more than one skill at a time and begins generating text using multiple skills \u2014 combining, say, the ability to use irony with an understanding of the word \u201cbecause\u201d \u2014 even if those exact combinations of skills weren\u2019t present in any piece of text in the training data.<\/p>\n Imagine, for example, an LLM that could already use one skill to generate text. If you scale up the LLM\u2019s number of parameters or training data by an order of magnitude, it will become similarly competent at generating text that requires two skills. Go up another order of magnitude, and the LLM can now perform tasks that require four skills at once, again with the same level of competency. Bigger LLMs have more ways of putting skills together, which leads to a combinatorial explosion of abilities.<\/p>\n And as an LLM is scaled up, the possibility that it encountered all these combinations of skills in the training data becomes increasingly unlikely. According to the rules of random graph theory, every combination arises from a random sampling of possible skills. So, if there are about 1,000 underlying individual skill nodes in the graph, and you want to combine four skills, then there are approximately 1,000 to the fourth power \u2014 that is, 1 trillion \u2014 possible ways to combine them.<\/p>\n Arora and Goyal see this as proof that the largest LLMs don\u2019t just rely on combinations of skills they saw in their training data. Bubeck agrees. \u201cIf an LLM is really able to perform those tasks by combining four of those thousand skills, then it must be doing generalization,\u201d he said. Meaning, it\u2019s very likely not a stochastic parrot.<\/p>\n But Arora and Goyal wanted to go beyond theory and test their claim that LLMs get better at combining more skills, and thus at generalizing, as their size and training data increase. Together with other colleagues, they designed a method<\/a> called \u201cskill-mix\u201d to evaluate an LLM\u2019s ability to use multiple skills to generate text.<\/p>\n To test an LLM, the team asked it to generate three sentences on a randomly chosen topic that illustrated some randomly chosen skills. For example, they asked GPT-4 (the LLM that powers the most powerful version of ChatGPT) to write about dueling \u2014 sword fights, basically. Moreover, they asked it to display skills in four areas: self-serving bias, metaphor, statistical syllogism and common-knowledge physics. <\/em>GPT-4 answered with: \u201cMy victory in this dance with steel [metaphor] is as certain as an object\u2019s fall to the ground [physics]. As a renowned duelist, I\u2019m inherently nimble, just like most others [statistical syllogism] of my reputation. Defeat? Only possible due to an uneven battlefield, not my inadequacy [self-serving bias].\u201d When asked to check its output, GPT-4 reduced it to three sentences.<\/p>\n<\/div>\n <\/br><\/br><\/br><\/p>\n
\nNew Theory Suggests Chatbots Can Understand Text<\/br>
\n2024-01-23 21:58:32<\/br><\/p>\nMore Data, More Power<\/strong><\/h2>\n
Connecting Skills to Text<\/strong><\/h2>\n
Scaling Up Skills <\/strong><\/h2>\n
True Creativity?<\/strong><\/h2>\n