- The Decoding
- Posts
- Emergent Abilities In LLMs Light The Way Towards Progress
Emergent Abilities In LLMs Light The Way Towards Progress
Here is why it is not the beginning of runaway intelligence.
Emergent Abilities In LLMs Light The Way Towards Progress
Last year, large language models (LLM) have broken record after record. ChatGPT got to 1 million users faster than Facebook, Spotify, and Instagram did. They helped create billion-dollar companies, and most notably they helped us recognize the divine nature of ducks.
2023 has started and ML progress is likely to continue at a break-neck speed. This is a great time to take a look at one of the most interesting papers from last year.
Emergent Abilities in LLMs
In a recent paper from Google Brain, Jason Wei and his colleagues allowed us a peak into the future. To me, it is not clear whether this means anything else than that current LLMs are just not properly fit to the data yet.
However, this research showed that further scaling LLMs will allow them, among other things, to [1]:
Become better at math
Understand even more subtleties of human language
reduce hallucination and answer truthfully
…
(See the plot on break-out performance below for a full list)
Some Context:
If you played around with ChatGPT or any of the other LLMs, you will likely have been as impressed as I was. However, you have probably also seen the models go off the rails here and there. The model might hallucinate gibberish, give untrue answers, or fail at performing math.
Why does this happen?
LLMs are commonly trained by maximizing the likelihood over all tokens in a body of text. Put more simply, they learn to predicting the next word in a sequence of words.
Hence, if such a model learns to do any math at all, it learns it by figuring concepts present in human language (and thereby math).
Let's look at the following sentence.
"The sum of two plus two is ..."
The model figures out that the most likely missing word is "four".
The fact that LLMs learn this at all is mind-bending to me! However, once the math gets more complicated LLMs begin to struggle.
There are many other cases where the models fail to capture the elaborate interactions and meanings behind words. One other example is that of words that change their meaning with context. When the model encounters the word "bed", it needs to figure out from the context, if the text is talking about a "river bed" or a "bed" to sleep in.
What they discovered:
For smaller models, the performance on the challenging tasks outlined above remains approximately random. However, the performance shoots up once a certain number of training FLOPs is reached.
The figure below visualizes this effect on eight benchmarks. The critical number of training FLOPs is around 10^23. The big version of GPT-3 already lies to the right of this point, but we seem to be at the beginning stages of performance increases.
Break-Out Performance At Critical Scale
They observed similar improvements in (few-shot) prompting strategies, such as multi-step reasoning and instruction following. If you are interested, I also encourage you to check out Jason Wei's blog. There he listed a total of 137 emergent abilities.
Looking at the results, one could be forgiven for thinking: that simply making models bigger will make them more powerful. That would only be half the story.
(Language) models are primarily scaled along three dimensions: number of parameters, amount of training compute, and dataset size. Hence, these sudden jumps in performance are likely to also occur with e.g. bigger and/or cleaner datasets.
There is other research suggesting that current models, such as GPT-3, are undertrained. Therefore, scaling datasets promises to boost performance in the near-term, without using more parameters.
So what does this mean exactly?
This beautiful paper shines light on the fact that our understanding of how to train these large models is still very limited. The lack of understanding is largely due to the sheer cost of training LLMs. Running the same number of experiments as people do for smaller models would cost in the hundreds of millions.
I understand how this paper became so popular. It perfectly fits a narrative of “OMG AI will suddenly explode in smartness and we won’t know what to do”.
Whether that is true or not remains to be shown. My hunch is that these “jumps” in performance are what should be expected. If a model learns a skill A and a skill B and their combination enables it to perform skill C you’d expect there to be a sudden jump. However, this does not mean that performance goes through the roof, all of a sudden. It rather implies that performance gains will not track a smooth line but rather move along a more jittery path.
Either way, the results strongly hint that we still have a lot to discover
Improvements in our understanding of how to train these models and further scaling will continue the exhilarating performance gains of the last years.
Such exciting times to be alive!
As always, I really enjoyed making this for you and I sincerely hope you found it useful!