• The Decoding
  • Posts
  • The Gorilla Paper Changes How LLMs Use Tools

The Gorilla Paper Changes How LLMs Use Tools

From a few hard-coded tools to the vast space of cloud-based APIs.

Presented by

The Gorilla Paper Changes How LLMs Use Tools

LLMs are fundamentally limited.

They are built from a constant set of weights.

As a result, they store one constant block of information. Vendors are offering more and more plugins that allow their models to use external tools through APIs. This enables models to perform more complex tasks using Google, Python, or translation services.

What is changing?

Gorilla marks the transition from using a few hard-coded tools to opening LLMs up to use the vast space of cloud-based APIs. If we extrapolate this path into the future, LLMs could become the primary interface to compute infrastructure and the web.

Gorilla vs. GPT-4 and Claude

This sounds quite lofty. And it is!

And surely there is a long way to go before we get there but this week’s paper takes a first exciting step in the right direction.

Let’s check it out!

But before we jump in, a quick word from our sponsor. It is the email platform I used to send you this post and I can highly recommend it. Please, check them out if you ever consider starting a newsletter yourself!

Quit sending emails like a dinosaur.

It’s the year 2024 and all the top newsletters are using beehiiv.

beehiiv was created by the same early Morning Brew employees who scaled their daily email to over 4 million subscribers. And now every newsletter on beehiiv has access to the same tools and winning formula.

So what exactly does beehiiv offer?

  • World-class growth tools like the referral program and recommendation network

  • Monetization via the beehiiv Ad Network and premium subscriptions (i.e. beehiiv helps you get paid)

  • Seamless content creation with a sleek collaborative editor

  • Best-in-class inbox deliverability of 98.7%

  • Oh and it’s the most affordable by a mile…

Take your newsletter to the next level — get started for free.

Motivation: Why was it still hard for models to use tools?

State-of-the-art LLMs such as GPT-4 struggle to generate accurate API calls.

This often happens due to their tendency to hallucinate. Further, much of the prior work on tool use in language models has focussed on integrating a small set of well-documented APIs into the model.

In essence, the API documentation was just dumped into the prompt and then the model was asked to generate an API call.

This approach is limited. Very limited.

It is impossible to fit all of the world’s APIs into the model’s context window. So, to eventually integrate a model with millions of tools, a completely different approach is needed.

Here is where Gorilla comes in!

What is Gorilla?

In a sentence, Gorilla is a finetuned LLaMA-based model that writes API calls better than GPT-4 does.

Let’s break down their approach to understand what they actually did.

How Does It Work?

Their approach can be broken down into three steps.

First, they constructed a sizable dataset of API calls and their documentation. Then, they used the self-instruct method to simulate a user instructing the model to use these APIs. Last, they finetuned LLaMA on their data and did several interesting experiments to investigate how much a retriever could boost performance.

Let’s zoom in on each of the three points to get a better understanding.

Their dataset (APIBench) was created by scraping APIs from public model hubs (TorchHub, TensorHub and HuggingFace). After several cleaning and filtering steps, this resulted in a dataset with more than 1700 documented API calls.

The second step is where the actual training dataset was built.

They used self-instruct to build instruction-API pairs. In plain english, this means they showed each of the documented APIs to GPT-4. Then they asked the model to generate 10 potential real-world use cases that would result in using each of the APIs.

Let’s look at an example!

If GPT-4 would be presented with the following API call:

model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased')

It would output an instruction such as: “Load an uncased BERT model”.

Almost done.

Just one more step to go.

In the third and final step, they finetuned a LLaMA model on the {instruction, API} pairs. The resulting model outperformed GPT-4, ChatGPT, and base LLaMA by 20%, 10%, and 83%, respectively.

Not bad.

However, this only refers to using the model in a zero-shot manner. The model did not have any access to additional API documentation.

This begs the question: What happens if we integrated a retriever and gave the model model access to these things?

The Integration Of Retrievers And Some Criticism

APIs change all the time.

This is a challenge for any model because the frequency of updates to APIs is likely to outpace any retraining schedule. This makes tool use particularly susceptible to changes in the very APIs that are supposed to be processed.

With arguments like this one, the authors drive home the point that retrievers are likely to play an important part in any tool used in LLMs. To approach this challenge, they trained LLaMA with different retrievers.

With mixed results.

They found that using a so-called oracle retriever, which always provides the correct piece of documentation, greatly boosted performance.

That’s not really a surprise if you ask me.

However, using standard retrievers such as BM25 and GPT-Index was shown to degrade performance by double-digit percentages. The authors conclude that using a sub-par retriever tends to confuse the model more than it helps.

This is where I have to disagree slightly with their otherwise great approach. They say that they only include the top-1 result from the retriever.

That makes no sense to me.

Everyone who ever worked with information retrieval knows that it is impossible to make the retriever return the correct paragraph on the top position. If they had included the top 10 or top 50 results from the retrieval step, the results might look very different.

I guess we can’t always get what we want. But I still wonder why they did that.

Before we wrap up, let’s end on a positive note!

I love the fact that they made their dataset APIBench publicly available. In an increasingly closed-source world, I am always delighted to see such acts of kindness to the community!

If you have feedback or questions, hit reply on this email or connect with me on Twitter or LinkedIn.

Lots of love and see you next time!

P.S. If you found this useful, please, share it with a friend or subscribe here ⭕️ if you haven’t already.