Lately, scaling up the dimensions of language fashions has been proven to be a dependable method to enhance efficiency on a spread of pure language processing (NLP) duties. As we speak’s language fashions on the scale of 100B or extra parameters obtain robust efficiency on duties like sentiment evaluation and machine translation, even with little or no coaching examples. Even the largest language fashions, nonetheless, can nonetheless battle with sure multi-step reasoning duties, akin to math phrase issues and commonsense reasoning. How may we allow language fashions to carry out such reasoning duties?
In “Chain of Thought Prompting Elicits Reasoning in Massive Language Fashions,” we discover a prompting methodology for enhancing the reasoning skills of language fashions. Referred to as chain of thought prompting, this methodology permits fashions to decompose multi-step issues into intermediate steps. With chain of thought prompting, language fashions of ample scale (~100B parameters) can resolve complicated reasoning issues that aren’t solvable with customary prompting strategies.
Comparability to Normal Prompting
With customary prompting (popularized by GPT-3) the mannequin is given examples of enter–output pairs (formatted as questions and solutions) earlier than being requested to foretell the reply for a test-time instance (proven beneath on the left). In chain of thought prompting (beneath, proper), the mannequin is prompted to provide intermediate reasoning steps earlier than giving the ultimate reply to a multi-step drawback. The thought is {that a} model-generated chain of thought would mimic an intuitive thought course of when working by way of a multi-step reasoning drawback. Whereas producing a thought course of has been beforehand completed by way of fine-tuning, we present that such thought processes might be elicited by together with a couple of examples of chain of thought by way of prompting solely, which doesn’t require a big coaching dataset or modifying the language mannequin’s weights.
Chain of thought reasoning permits fashions to decompose complicated issues into intermediate steps which can be solved individually. Furthermore, the language-based nature of chain of thought makes it relevant to any activity that an individual may resolve by way of language. We discover by way of empirical experiments that chain of thought prompting can enhance efficiency on varied reasoning duties, and that profitable chain of thought reasoning is an emergent property of mannequin scale — that’s, the advantages of chain of thought prompting solely materialize with a ample variety of mannequin parameters (round 100B).
Arithmetic Reasoning
One class of duties the place language fashions usually battle is arithmetic reasoning (i.e., fixing math phrase issues). Two benchmarks in arithmetic reasoning are MultiArith and GSM8K, which check the power of language fashions to unravel multi-step math issues just like the one proven within the determine above. We consider each the LaMDA assortment of language fashions starting from 422M to 137B parameters, in addition to the PaLM assortment of language fashions starting from 8B to 540B parameters. We manually compose chains of thought to incorporate within the examples for chain of thought prompting.
For these two benchmarks, utilizing customary prompting results in comparatively flat scaling curves: growing the size of the mannequin doesn’t considerably enhance efficiency (proven beneath). Nonetheless, we discover that when utilizing chain of thought prompting, growing mannequin scale results in improved efficiency that considerably outperforms customary prompting for big mannequin sizes.
![]() |
Using chain of thought prompting permits language fashions to unravel arithmetic reasoning issues for which customary prompting has a largely flat scaling curve. |
On the GSM8K dataset of math phrase issues, PaLM exhibits exceptional efficiency when scaled to 540B parameters. As proven within the desk beneath, combining chain of thought prompting with the 540B parameter PaLM mannequin results in new state-of-the-art efficiency of 58%, surpassing the prior cutting-edge of 55% achieved by fine-tuning GPT-3 175B on a big coaching set after which rating potential options by way of a specifically skilled verifier. Furthermore, follow-up work on self-consistency exhibits that the efficiency of chain of thought prompting might be improved additional by taking the bulk vote of a broad set of generated reasoning processes, which ends up in 74% accuracy on GSM8K.
![]() |
Chain of thought prompting with PaLM achieves a brand new cutting-edge on the GSM8K benchmark of math phrase issues. For a good comparability in opposition to fine-tuned GPT-3 baselines, the chain of thought prompting outcomes proven right here additionally use an exterior calculator to compute primary arithmetic capabilities (i.e., addition, subtraction, multiplication and division). |
Commonsense Reasoning
Along with arithmetic reasoning, we think about whether or not the language-based nature of chain of thought prompting additionally makes it relevant to commonsense reasoning, which includes reasoning about bodily and human interactions underneath the presumption of common background information. For these evaluations, we use the CommonsenseQA and StrategyQA benchmarks, in addition to two domain-specific duties from BIG-Bench collaboration relating to date understanding and sports activities understanding. Instance questions are beneath:
![]() |
As proven beneath, for CommonsenseQA, StrategyQA, and Date Understanding, efficiency improved with mannequin scale, and using chain of thought prompting led to further small enhancements. Chain of thought prompting had the most important enchancment on sports activities understanding, for which PaLM 540B’s chain of thought efficiency surpassed that of an unaided sports activities fanatic (95% vs 84%).
![]() |
Chain of thought prompting additionally improves efficiency on varied varieties of commonsense reasoning duties. |
Conclusions
Chain of thought prompting is an easy and broadly relevant methodology for enhancing the power of language fashions to carry out varied reasoning duties. By experiments on arithmetic and commonsense reasoning, we discover that chain of thought prompting is an emergent property of mannequin scale. Broadening the vary of reasoning duties that language fashions can carry out will hopefully encourage additional work on language-based approaches to reasoning.
Acknowledgements
It was an honor and privilege to work with Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Quoc Le on this mission.