wp-plugin-hostgator
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/scienrds/scienceandnerds/wp-includes/functions.php on line 6114ol-scrapes
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/scienrds/scienceandnerds/wp-includes/functions.php on line 6114Source:https:\/\/www.quantamagazine.org\/how-chain-of-thought-reasoning-helps-neural-networks-compute-20240321\/#comments<\/a><\/br> Your grade school teacher probably didn\u2019t show you how to add 20-digit numbers. But if you know how to add smaller numbers, all you need is paper and pencil and a bit of patience. Start with the ones place and work leftward step by step, and soon you\u2019ll be stacking up quintillions with ease.<\/p>\n Problems like this are easy for humans, but only if we approach them in the right way. \u201cHow we humans solve these problems is not \u2018stare at it and then write down the answer,\u2019\u201d said Eran Malach<\/a>, a machine learning researcher at Harvard University. \u201cWe actually walk through the steps.\u201d<\/p>\n That insight has inspired researchers studying the large language models that power chatbots like ChatGPT. While these systems might ace questions involving a few steps of arithmetic, they\u2019ll often flub problems involving many steps, like calculating the sum of two large numbers. But in 2022, a team of Google researchers showed<\/a> that asking language models to generate step-by-step solutions enabled the models to solve problems that had previously seemed beyond their reach. Their technique, called chain-of-thought prompting, soon became widespread, even as researchers struggled to understand what makes it work.<\/p>\n Now, several teams have explored the power of chain-of-thought reasoning by using techniques from an arcane branch of theoretical computer science called computational complexity theory. It\u2019s the latest chapter in a line of research that uses complexity theory to study the intrinsic capabilities and limitations of language models. These efforts clarify where we should expect models to fail, and they might point toward new approaches to building them.<\/p>\n \u201cThey remove some of the magic,\u201d said Dimitris Papailiopoulos<\/a>, a machine learning researcher at the University of Wisconsin, Madison. \u201cThat\u2019s a good thing.\u201d<\/p>\n Large language models are built around mathematical structures called artificial neural networks. The many \u201cneurons\u201d inside these networks perform simple mathematical operations on long strings of numbers representing individual words, transmuting each word that passes through the network into another. The details of this mathematical alchemy depend on another set of numbers called the network\u2019s parameters, which quantify the strength of the connections between neurons.<\/p>\n To train a language model to produce coherent outputs, researchers typically start with a neural network whose parameters all have random values, and then feed it reams of data from around the internet. Each time the model sees a new block of text, it tries to predict each word in turn: It guesses the second word based on the first, the third based on the first two, and so on. It compares each prediction to the actual text, then tweaks its parameters to reduce the difference. Each tweak only changes the model\u2019s predictions a tiny bit, but somehow their collective effect enables a model to respond coherently to inputs it has never seen.<\/p>\n Researchers have been training neural networks to process language for 20 years. But the work really took off in 2017, when researchers at Google introduced a new kind of network<\/a> called a transformer.<\/p>\n \u201cThis was proposed seven years ago, which seems like prehistory,\u201d said Pablo Barcel\u00f3<\/a>, a machine learning researcher at the Pontifical Catholic University of Chile.<\/p>\n What made transformers so transformative is that it\u2019s easy to scale them up \u2014 to increase the number of parameters and the amount of training data \u2014 without making training prohibitively expensive. Before transformers, neural networks had at most a few hundred million parameters; today, the largest transformer-based models have more than a trillion. Much of the improvement in language-model performance over the past five years comes from simply scaling up.<\/p>\n Transformers made this possible by using special mathematical structures called attention heads, which give them a sort of bird\u2019s-eye view of the text they\u2019re reading. When a transformer reads a new block of text, its attention heads quickly scan the whole thing and identify relevant connections between words \u2014 perhaps noting that the fourth and eighth words are likely to be most useful for predicting the 10th. Then the attention heads pass words along to an enormous web of neurons called a feedforward network, which does the heavy number crunching needed to generate the predictions that help it learn.<\/p>\n Real transformers have multiple layers of attention heads separated by feedforward networks, and only spit out predictions after the last layer. But at each layer, the attention heads have already identified the most relevant context for each word, so the computationally intensive feedforward step can happen simultaneously for every word in the text. That speeds up the training process, making it possible to train transformers on increasingly large sets of data. Even more important, it allows researchers to spread the enormous computational load of training a massive neural network across many processors working in tandem.<\/p>\n To get the most out of massive data sets, \u201cyou have to make the models really large,\u201d said David Chiang<\/a>, a machine learning researcher at the University of Notre Dame. \u201cIt\u2019s just not going to be practical to train them unless it\u2019s parallelized.\u201d<\/p>\n However, the parallel structure that makes it so easy to train transformers doesn\u2019t help after training \u2014 at that point, there\u2019s no need to predict words that already exist. During ordinary operation, transformers output one word at a time, tacking each output back onto the input before generating the next word, but they\u2019re still stuck with an architecture optimized for parallel processing.<\/p>\n As transformer-based models grew and certain tasks continued to give them trouble, some researchers began to wonder whether the push toward more parallelizable models had come at a cost. Was there a way to understand the behavior of transformers theoretically?<\/p>\n Theoretical studies of neural networks face many difficulties, especially when they try to account for training. Neural networks use a well-known procedure to tweak their parameters at each step of the training process. But it can be difficult to understand why this simple procedure converges on a good set of parameters.<\/p>\n Rather than consider what happens during training, some researchers study the intrinsic capabilities of transformers by imagining that it\u2019s possible to adjust their parameters to any arbitrary values. This amounts to treating a transformer as a special type of programmable computer.<\/p>\n \u201cYou\u2019ve got some computing device, and you want to know, \u2018Well, what can it do? What kinds of functions can it compute?\u2019\u201d Chiang said.<\/p>\n These are the central questions in the formal study of computation. The field dates back to 1936, when Alan Turing first imagined a fanciful device<\/a>, now called a Turing machine, that could perform any computation by reading and writing symbols on an infinite tape. Computational complexity theorists would later build on Turing\u2019s work by proving that computational problems naturally fall into different complexity classes<\/a> defined by the resources required to solve them.<\/p>\n In 2019, Barcel\u00f3 and two other researchers proved<\/a> that an idealized version of a transformer with a fixed number of parameters could be just as powerful as a Turing machine. If you set up a transformer to repeatedly feed its output back in as an input and set the parameters to the appropriate values for the specific problem you want to solve, it will eventually spit out the correct answer.<\/p>\n That result was a starting point, but it relied on some unrealistic assumptions that would likely overestimate the power of transformers. In the years since, researchers have worked to develop more realistic theoretical frameworks.<\/p>\n One such effort began in 2021, when William Merrill<\/a>, now a graduate student at New York University, was leaving a two-year fellowship at the Allen Institute for Artificial Intelligence in Seattle. While there, he\u2019d analyzed other kinds of neural networks using techniques that seemed like a poor fit for transformers\u2019 parallel architecture. Shortly before leaving, he struck up a conversation with the Allen Institute for AI researcher Ashish Sabharwal<\/a>, who\u2019d studied complexity theory before moving into AI research. They began to suspect that complexity theory might help them understand the limits of transformers.<\/p>\n \u201cIt just seemed like it\u2019s a simple model; there must be some limitations that one can just nail down,\u201d Sabharwal said.<\/p>\n The pair analyzed transformers using a branch of computational complexity theory, called circuit complexity, that is often used to study parallel computation and had recently been applied<\/a> to simplified versions of transformers. Over the following year, they refined several of the unrealistic assumptions in previous work. To study how the parallel structure of transformers might limit their capabilities, the pair considered the case where transformers didn\u2019t feed their output back into their input \u2014 instead, their first output would have to be the final answer. They proved<\/a> that the transformers in this theoretical framework couldn\u2019t solve any computational problems that lie outside a specific complexity class. And many math problems, including relatively simple ones like solving linear equations, are thought to lie outside this class.<\/p>\n Basically, they showed that parallelism did come at a cost \u2014 at least when transformers had to spit out an answer right away. \u201cTransformers are quite weak if the way you use them is you give an input, and you just expect an immediate answer,\u201d Merrill said.<\/p>\n Merrill and Sabharwal\u2019s results raised a natural question \u2014 how much more powerful do transformers become when they\u2019re allowed to recycle their outputs? Barcel\u00f3 and his co-authors had studied this case in their 2019 analysis of idealized transformers, but with more realistic assumptions the question remained open. And in the intervening years, researchers had discovered chain-of-thought prompting, giving the question a newfound relevance.<\/p>\n Merrill and Sabharwal knew that their purely mathematical approach couldn\u2019t capture all aspects of chain-of-thought reasoning in real language models, where the wording in the prompt can be very important<\/a>. But no matter how a prompt is phrased, as long as it causes a language model to output step-by-step solutions, the model can in principle reuse the results of intermediate steps on subsequent passes through the transformer. That could provide a way to evade the limits of parallel computation.<\/p>\n Meanwhile, a team from Peking University had been thinking along similar lines, and their preliminary results were positive. In a May 2023 paper, they identified some math problems that should be impossible for ordinary transformers in Merrill and Sabharwal\u2019s framework, and showed<\/a> that intermediate steps enabled the transformers to solve these problems.<\/p>\n In October, Merrill and Sabharwal followed up their earlier work with a detailed theoretical study<\/a> of the computational power of chain of thought. They quantified how that extra computational power depends on the number of intermediate steps a transformer is allowed to use before it must spit out a final answer. In general, researchers expect the appropriate number of intermediate steps for solving any problem to depend on the size of the input to the problem. For example, the simplest strategy for adding two 20-digit numbers requires twice as many intermediate addition steps as the same approach to adding two 10-digit numbers.<\/p>\n Examples like this suggest that transformers wouldn\u2019t gain much from using just a few intermediate steps. Indeed, Merrill and Sabharwal proved that chain of thought only really begins to help when the number of intermediate steps grows in proportion to the size of the input, and many problems require the number of intermediate steps to grow much larger still.<\/p>\n
\nHow Chain-of-Thought Reasoning Helps Neural Networks Compute<\/br>
\n2024-03-22 21:58:17<\/br><\/p>\nTraining Transformers<\/strong><\/h2>\n
The Complexity of Transformers<\/strong><\/h2>\n
Thought Experiments<\/strong><\/h2>\n