Gorilla LLM: Redefining Language Models

11 min readOct 20, 2023

Gorilla LLM is one of the latest innovations in the world of language models, and it promises to reshape how we interact with artificial intelligence. In this article, I’ll explore what Gorilla LLM is, how it functions, its distinctions from ChatGPT, and its practical applications.

A short demo of Gorilla LLM’s content generation capabilities

What is Gorilla LLM?

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in various tasks, such as mathematical reasoning and code generation. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today’s state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. Gorilla LLM is a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls released to face that challenge head-on. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible API updates and version changes. Gorilla also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model’s ability, Gorilla LLM uses APIBench, a comprehensive dataset consisting of 1600+ HuggingFace, TorchHub, and Tensorflow Hub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla models and source code are available here.

Gorilla is an LLM that can provide appropriate API calls. Zero-shot Gorilla outperforms GPT-4, Chat-GPT, and Claude. Gorilla is extremely reliable, and significantly reduces hallucination errors in making API calls.

Response when given prompt was for speech recognition

Example API calls generated by GPT-4 and Gorilla when the given prompt is speech recognition. In this example, GPT-4 presents a model that doesn’t exist. In contrast, our model, Gorilla, can identify the task correctly and suggest a fully qualified API call.

How Does Gorilla LLM Work?

Gorilla LLM’s operation is underpinned by its API augmentation. It leverages APIs to access real-time information, enabling it to deliver highly relevant and up-to-date responses. This capability is achieved through retrieval-aware fine-tuning, which allows it to interact with a variety of deep-learning models, making Gorilla LLM incredibly versatile.

Now, there are a few steps that the model has to take.

First, it needs to understand the natural language that you are using. In order to deduce which API to use from all available APIs, it has to pick an API, which can perform the operation that you’re requesting.

Next, the model has to accurately figure out what are the inputs to this API call. And this is where it becomes very tricky because, in a lot of cases, you have very similar APIs with different inputs. This is where most of the current LLMs fail. Just to summarize the whole process, they have fine-tuned a LLaMA 7 billion parameter model, and now they are calling it a Gorilla 7 billion parameter model.

The model is fine-tuned specifically for making API calls. Now, there is a filtering process for how they selected these different APIs. They looked at the model card and had complete information. Then, for each of these API calls, they created instruction API pairs.

Basically, Gorilla is an instruct fine-tuned model using the instruction and the corresponding API pairs. Now, interestingly, this dataset was generated using GPT-4. For the rest of the process, they used an information retrieval system. So, it’s very similar to the retrieval of information from your PDF files. So, what you do is, when the user provides an input query, it simply looks at all the APIs, and then picks the one that is closest to it. In this case, they tried different attributes, including one based on GPT-4. Now, you get the API, and then the Gorilla model, based on its knowledge, creates inputs for that API call. So, in some way, that’s how the whole thing works. Now, based on the results, they are showing that this approach can give you better results than GPT-4. It’s around 20% better than GPT-4, and surprisingly, it’s 10% better than ChatGPT. So, it seems like chat GPT is much better at making these API calls compared to GPT-4.

In summary, it’s a language model that is trained on API calls rather than language, and therefore they’re calling it an API App Store for LLMs. On making correct API calls, it’s able to beat GPT-4 by 20%.

How is Gorilla LLM Different from ChatGPT?

The niche that Gorilla LLM caters to in this wild west of LLMs

First, let’s talk about why Gorilla is even important. If language models by themselves do not have the ability to interact with the physical world, they need external tools. So for example, ChatGPT has access to plugins, which are essentially just making API calls to external tools. If you’re familiar with Langchain, it also has a concept of agents, where agents are able to use tools to interact with the external world, and these tools are accessed through API calls. However, if you notice the number of tools inside Langchain is very limited. That’s where this Gorilla comes in. Using the approach, a platform like Langchain will have access to an order of magnitude more API or tools than it previously had.

Gorilla LLM distinguishes itself from ChatGPT in two key ways. Firstly, it excels in handling API calls, offering better performance in this aspect. Secondly, Gorilla LLM can seamlessly integrate with specialized deep learning models, making it adaptable to a wide range of tasks and domains, whereas ChatGPT may not have the same level of versatility.

APIBench paired with a document retriever, Gorilla showcases impressive adaptability to changes in documents, facilitating smooth updates to APIs and their versions. This blend of retrieval systems and Gorilla’s core capabilities paints a bright future where LLMs can utilize tools with higher precision, stay updated with dynamic documentation, and consistently produce reliable outputs.

How Can We Use Gorilla LLM?

Running Gorilla locally

The instructions to run Gorilla LLM locally can be found here.
You can run the Gorilla Colab notebook. For any further instructions regarding running that notebook, you may consult this video.

Various Gorilla Models

Gorilla provides five different models.

gorilla-7b-hf-v1: Finetuned on LLaMA-7B it chooses from 925 Hugging Face APIs in a 0-shot fashion (without any retrieval).
gorilla-7b-th-v0: Finetuned on LLaMA-7B it chooses from 94 (exhaustive) APIs from Torch Hub
gorilla-7b-tf-v0: Finetuned on LLaMA-7B it chooses from 626 (exhaustive) APIs from Tensorflow.
gorilla-mpt-7b-hf-v0: Finetuned on MPT-7B it chooses from 925 Hugging Face APIs in a 0-shot fashion (without any retrieval).
gorilla-falcon-7b-hf-v0: Finetuned on Falcon-7B it chooses from 925 Hugging Face APIs in a 0-shot fashion (without any retrieval).

The first 3 models are all based on LLaMA so it can’t really use them for commercial purposes due to its license.

However, gorilla-mpt-7b-hf-v0 and gorilla-falcon-7b-hf-v0 are Apache 2.0 licensed models (commercially usable) fine-tuned on MPT-7B and Falcon-7B respectively.

All gorilla weights are hosted here.

For the gorilla-7b-hf-v1, gorilla-7b-tf-v0, and gorilla-7b-th-v0 models, only their delta weights are available, but you can get the actual weights by following the instructions provided here.

Out of the 5 models, gorilla-7b-hf-v1 has shown the best performance on the non-commercial side of things, and gorilla-mpt-7b-hf-v0 on the commercial side.

The gorilla-falcon-7b-hf-v0 model has been outperformed by the gorilla-mpt-7b-hf-v0 model in most scenarios that I have tested. Here are some of them:

I used the prompt “write an essay for me on apples”, while the output was the same, the Falcon model took about 15s longer to give that output.
The results were similar for the multiple prompts of type “translate <some text> from <language-1> to <language-2> for me”. While both outputs were correct, the falcon-based model took 10–15s longer on average to return the results.
I used the prompt “draw an anime-style image for me”, the MPT model called the lllyasviel/control_v11p_sd15s2_lineart_anime model while the falcon called stabilityai/stable-diffusion-xl-base-1.0. Now while both models can generate images, the former can do a better job at it as it is finetuned on lineart_anime images.

Between the 3 llama-based models, all 3 take about the same amount of time to generate the output with an error margin of 1–2s seconds in each of them. But in terms of the quality and validity of the output generated by them, the gorilla-7b-hf-v1 has the other two beat simply due to the larger collection of APIs from different domains available to it.

So keeping the above factors in mind, for the comparison between the commercially non-viable and viable models, I will be using the gorilla-7b-hf-v0 and gorilla-mpt-7b-hf-v0 respectively as the benchmarks.

While the time taken for generating the output by the gorilla-7b-hf-v1 model was about 1–2s faster than the gorilla-mpt-7b-hf-v0 model in some, the difference is negligible in most cases.

Now in terms of output generated, the gorilla-7b-hf-v1 in the majority of cases calls the correct APIs on the first try, while in the gorilla-mpt-7b-hf-v0 model:

It called the optimal API for propmts from domains of text generation and translation on the first try in most cases but with a small caveat that I will discuss below.
For prompts related to image and video generation, it took me about 3–5 tries on a lot of prompts to get the best-suited API to create the type of content I was going for, with the video generation prompts giving the most inaccurate results.
Now the caveat I mentioned earlier, in some cases (about 40%), the output was incomplete, i.e., the output was cut off at say, just calling the API and not doing anything with it.

How do I add my own APIs?

Now comes the most important question. The question that, if Gorilla LLM as a model distinguishes itself by being the best in finding the best API to call based on your prompt, can you add your own APIs to it?

To answer your question, Yes. You can do that, but you might not enjoy the process as it is not very streamlined and the model only supports APIs from the previously mentioned 3 platforms.

To contribute you have to submit an API JSON file or a URL JSON file following a particular format in the Gorilla repository and then raise a Pull Request. But there is no guarantee that your PR will be accepted or even seen.

You can learn how to add your own APIs here.

The question that follows is that if it's so difficult to add my own APIs then can I train my own custom version of Gorilla LLM?

No, you can’t, as Gorilla is yet to release their training code. But if their repository is to be believed, they will be doing it soon enough, so fingers crossed.

Using Gorilla with Langchain, Toolformer, AutoGPT, etc.

Gorilla is an end-to-end model, specifically tailored to serve correct API calls without requiring any additional coding. It’s designed to work as part of a wider ecosystem and can be flexibly integrated with other tools.

Langchain is a versatile developer tool. Its “agents” can efficiently swap in any LLM, Gorilla included, making it a highly adaptable solution for various needs.

AutoGPT, on the other hand, concentrates on the art of prompting GPT series models. It’s worth noting that Gorilla, as a fully fine-tuned model, consistently shows remarkable accuracy, and lowers hallucination, outperforming GPT-4 in making specific API calls.

Now, when it comes to ToolFormer, Toolformer zeroes in on a select set of tools, providing specialized functionalities. Gorilla, in contrast, has the capacity to manage thousands of API calls, offering a broader coverage over a more extensive range of tools.

The beauty of these tools truly shines when they collaborate, complementing each other’s strengths and capabilities to create an even more powerful and comprehensive solution. This is where your contribution can make a difference. Gorilla LLM enthusiastically welcomes any input to further refine and enhance these tools.

Weaknesses

As seen above in how gorilla works, given an input in natural language, it will interpret it and select a domain closest to the input provided and call an API closest to the prompt provided in the selected domain.

For example, if you want to, say, “Write an article”, or “Translate text from English to Japanese”, it can do it just fine.

But if you want to “Write an article in English and the translate it to Japanese”, it will fail because the problem belongs to multiple domains. That is, Gorilla can only provide responses to single-domain prompts.

You can, in some cases, engineer the prompt to work around this issue, but the results are suboptimal in most cases. The problem can also be solved if the API itself can handle multi-domain functionality.

Furthermore, it only gives output in the form of Python code. So to get the actual desired output you will need to run the Python code that was given by Gorilla LLM and if you lack the minimum required hardware specifications needed to run the code, you will not get the required output.

Finally, the biggest weakness of Gorilla LLM is the limited number of APIs that it can call and as mentioned in previous sections, the lack of streamlined support to either add your own APIs or train/fine-tune your own custom model.

With these weaknesses in mind, although Gorilla LLM as a model has great potential, it is very limited in scope as of now.

Gorilla Use Cases

Gorilla LLM has a myriad of potential applications. Its ability to handle API calls efficiently makes it invaluable in tasks that require various tasks requiring multiple API access such as code generation, image classification, video auto-annotation, and customer support.

Additionally, its adaptability to different domains makes it suitable for content generation, code assistance, and more.

Here are some real-life applications for the above-mentioned domains that Gorilla LLM can be used for:

Code Generation:

On the variety of code generation tasks I tested it upon, Gorilla called upon the bigcode/starcoder and the lmsys/vicuna-7b-v1.3 models for the completion of those tasks. You might notice that Vicuna is not as good for coding tasks when compared to models like starcoder, but Gorilla LLM currently does not have a way to evaluate the quality of its output, so the model it chooses is out of the end user’s control.

2. Computer Vision:

For computer vision use cases Gorilla LLM called the facebook/dino-vits8, facebook/detr-resnet-101-dc5, YOLO, and a few other APIs hosted on HuggingFace, which were appropriate for the CV task given in the prompt be it Object Tracking, Detection, etc.

3. Video Auto-Annotation:

Gorilla called APIs such as openai/whisper-large for prompts mentioning speech recognition and models like naver-clova-ix/donut-base for those mentioning captions and annotation as appropriate. Thus depending on the prompt it produced different types of models where the former generates text based on the audio while the latter generates captions based on the frames of the video.
Thus using Gorilla requires some degree of prompt engineering and technical knowledge.

4. Content Generation:

As shown in the small demo above, Gorilla called appropriate models for different content generation based tasks.

The integration of an LLM with external APIs can revolutionize many industries by automating tasks, enhancing accuracy, and providing real-time solutions.

Conclusion

In conclusion, Gorilla LLM represents a significant leap forward in the field of language models. Its API-driven approach and versatility make it a powerful tool for a wide range of applications and create a niche for itself. As we continue to explore the capabilities of Gorilla LLM, it is poised to reshape how we interact with AI and leverage its potential in various domains. However, it is limited by its small selection of APIs and the need for highly specific prompts to get your job done.