The Decoding
Posts
A Simple Trick That Improves Reasoning In LLMs

A Simple Trick That Improves Reasoning In LLMs

Reduce errors by 10%!

Gregor von Dulong
April 09, 2024

In partnership with

A Simple Trick That Improves Reasoning In LLMs

Reduce errors by 10%!

Some questions are easier than others.

What color is the sky when the sun is out?

This does not require us to do a whole lot of thinking. It’s trivial to us. The same goes for an LLM.

It can simply output the answer.

However, asking a question such as “What color is the sky across the hours of the day?” is much harder.

The color will change depending on the weather, the time of day, humidity, and a gazillion other factors. If a human would try to answer this question, she would break the reasoning up into multiple steps.

For example:

How does the color of the sky change when the angle of the sun changes relative to the Earth?
How does humidity influence this change?
…

Then, a human would try to draw upon physical principles to explain each of these effects. Lastly, she would draw general conclusions based on these principles before working herself across different levels of abstraction to come up with a satisfactory answer.

Boom.

“The sky goes from blue to red to black, whereas the red is more pronounced on humid days.”

This multi-step reasoning is one of the hardest challenges for LLMs to date.

Most of the errors a model makes on benchmarks happen during this kind of reasoning. Different methods, such as chain-of-though prompting have been used to improve these shortcomings.

In a recent paper, researchers from Google DeepMind proposed to ask so-called step-back questions to improve the model’s reasoning abilities.

In the following, we will look at how this works and see how we can use it to improve our workflows.

Let’s start at the beginning! But before, a quick word from our sponsor:

MaxAI.me - Outsmart Most People with 1-Click AI

Discover MaxAI.me, one of the top 50 GenAI apps of 2024!

Best features:

Chat with the latest AI like GPT-4, Claude 3, and Gemini 1.5, all in one place.
Perfect your writing anywhere with 1-click AI without copy-pasting.
Save 90% of your reading & watching time with AI summaries.
Reply 10x faster with AI on email, social media, and messaging web apps.
Rapidly turn your visions into stunning images with AI art generators.

Join 1M+ users on MaxAI.me to get more done in less time!

What Is A Step-Back Question?

The authors define a step-back question as a question at a higher level of abstraction that is derived from the original question.

Let’s say we want a model to answer our sky-color question from above. A step-back question would ask the model to provide us with a list of all possible colors the sky can have.

So far so simple.

But why is this helpful?

The authors argue that step-back questions work because it is typically much easier to answer them first to obtain helpful abstractions. Grounding a final answer in these abstractions helps to avoid reasoning errors in intermediate steps.

Here is how the authors suggest implementing this strategy.

How To Do Step-Back Prompting

In the paper, the authors suggest to perform two distinct steps:

Abstraction: The model is prompted to ask a generic step-back question about the concepts that underlie the original question., such as: “What are the physics principles involved in solving this problem?”
Reasoning: In this second step, the model is asked the original question. In doing so, it is provided with the output of the step-back prompt. The authors term this step Abstraction-grounded Reasoning because the model can reason about the solution using information about the high-level concept or principle.

Before we wrap up, I would like to emphasize another neat aspect of the paper.

Categorizing Errors In Step-Back Results

Comparing the results from step-back prompting against the baseline of using the PaLM-2L model shows that the method corrects 20.5% of errors while introducing 11.9% of new ones.

So, a net positive of 10%.

To further understand what types of errors were made by the model, the false outputs are annotated. This split the errors into five different classes.

Error categorization step-back prompting

Four of the five error types happen during the reasoning step.

Combined, they make up over 90% of all errors made by the model. Less than 10% of errors are caused by the model creating the wrong abstractions (Principle Error) during step-back prompting.

So, reasoning remains the bottleneck.

I would have liked to see a comparison of the error distribution with and without step-back prompting. However, a net reduction in errors of 10% is still great for such a simple method.

I dearly hope this gave you some food for thought. Go ahead and check out the original paper. It’s a fun read.

If you have feedback or questions, send me a reply or hit me up on Twitter or LinkedIn.

Lots of love and see you next time!

P.s. If you found this useful, please, share it with a friend or subscribe here ⭕️ if you haven’t already.