Ollama: from zero to running an LLM in less than 2 minutes!

Build
Diego Coy
6 min read

The Artificial Intelligence (AI) field has been fueled by open source initiatives from the very beginning, from data sets used in model training, frameworks, libraries and tooling, to the models themselves. These initiatives have been mainly focused on empowering researchers and a subset of experts to facilitate their investigations and further contributions. Fortunately for the rest of us – technologists without deep AI knowledge – there has been a wave of open source initiatives aimed at allowing us to leverage the new opportunities AI brings along.

Data sourcing, model training, math thinking, and its associated coding are done by a group of dedicated folks who then release models, such as Mixtral or Stable Diffusion. Then another group of people build wrappers around them to make the experience of using them a matter of basic configuration, and in some cases nowadays, just executing a command, allowing us to focus on leveraging the models and simply build on top of them. That’s the power of open source!

One such tool that has caught the internet’s attention lately is Ollama, a cross-platform tool that can be installed on a wide variety of hardware, including Scaleway’s H100 PCIe GPU Instances.

A model

Before diving into Ollama and how to use it, it is important to spend a few moments getting a basic understanding of what a machine learning (ML) model is. This is by no means intended to be an extensive explanation of AI concepts, but instead, a quick guide that will let you sort your way out to experience the power of AI firsthand.

A model is a representation of the patterns an algorithm has learned from analyzing data it was fed during its training phase. The goal of a Machine Learning model is to make predictions or decisions based on new, unseen data.

A model is generally trained by feeding it labeled data or unlabeled – depending on the type of model – and then adjusting the model's parameters to minimize the error between the expected and actual outputs.

By the end of its training phase, a model will be distributed as either a set of multiple files including the patterns it learned, configuration files, or a single file containing everything it needs. The number of files will vary depending on the frameworks and tools used to train it, and most tools today can adapt to the different ways a model is distributed.

The size of a machine learning model refers to the number of parameters that make up the model, and in turn, its file size: from a couple of megabytes to tens of gigabytes. A larger model size typically means more complex patterns can be learned from the training data. However, larger models also require more computational resources which can negatively affect their practicality.

Some of the most popular models today have been trained on huge amounts of data, with Llama2 reaching 70 Billion parameters (Also known as Llama2 70B), however, the model’s size doesn’t always correlate with its accuracy. Some other models that have been trained with fewer parameters claim they can outperform Llama 2 70B, such as Mixtral 8x7B, in certain benchmarks.

Choosing the right tool for the job

Deciding to use a model that is smaller in size – instead of a larger one that will potentially require larger sums of hardware resources – when the task at hand can be easily performed by it can be the most efficient optimization you can achieve without having to tweak anything else.

Depending on your needs, using the 7B version of Llama 2 instead of the 70B one can cover your use case and provide faster results. In other cases, you may realize that using a model that has been trained to do a smaller set of specific tasks instead of the more generic ones can be the best call. Making the right choice will require some time trying out different alternatives, but this can yield improved inference times and hardware resource optimization.

Choosing the right tool also can be seen from the hardware angle: should I use a regular x86-64 CPU, an ARM CPU, a gaming GPU, or a Tensor Core GPU…? And this is a conversation worth having in a separate blog post. For this scenario, we’ll stick with Scaleway’s H100 PCIe GPU Instances as they run the fastest hardware of its kind.

Ollama: up and running in less than 2 minutes

Finally, we get to talk about Ollama, an open source tool that will hide away all the technical details and complexity of finding and downloading the right LLM, setting it up, and then deploying it. Ollama was originally developed with the idea of enabling people to run LLMs locally on their own computers, but that doesn’t mean you can’t use it on an H100 PCIe GPU Instance; in fact, its vast amount of resources will supercharge your experience.

After creating your H100 PCIe GPU Instance, getting Ollama up and running is just a matter of running the installation command:

curl -fsSL https://ollama.com/install.sh | sh

Note: It’s always a good idea to take a moment to review installation scripts before execution. Although convenient, running scripts directly from the internet without understanding their content can pose significant security risks.

Once installed, you can run any of the supported models available in their model library, for instance, Mixtral from Mistral AI – a model licensed under Apache 2.0, that is on-par and sometimes outperforms GPT3.5 – by using the run command:

Ollama run mixtral

Ollama will begin the download process, which will take just a few seconds – thanks to the 10Gb/s networking capabilities of Scaleway’s H100 PCIe GPU Instances –, and once done, you will be able to interact with the model through your terminal. You can start a conversation with the model, as you would with ChatGPT, or any other AI chatbot; the difference here is that your conversation is kept locally within your H100 PCIe GPU Instance, and only you have access to the prompts you submit, and the answers you receive.

The Ollama model library showcases a variety of models you can try out on your own helping you decide what’s the best tool for the job, be it a compact model, such as TinyLlama or a big one, like Llama2; there are multimodal models, like LLaVA, which include a vision encoder that enables both visual and language understanding. There are also models made for specific use cases, such as Code Llama, an LLM that can help in the software development process, or Samantha Mistral, a model trained in philosophy, psychology, and personal relationships.

But as you may be thinking, interacting with a model from a terminal through an SSH connection is a good way to experiment, but doesn’t allow you to bring any value to your users. Luckily, Ollama’s features don’t stop there.

Serving a Model

Besides its simplicity, the reason we decided to highlight this tool for a first hands-on approach toward AI is its ability to expose the model as an API that you can interact with through HTTP requests.

By default, Ollama’s API server won’t accept requests from devices over the internet, however, you can change this behavior by updating Ollama’s system service settings, as described in their documentation.

Once the Ollama service restarts, you can begin making HTTP calls to your server:

curl http://your-instance.instances.scw.cloud:11434/api/chat -d '{
"model": "mixtral",
"messages": [
{
"role": "system",
"content": "You are system that acts as API server. Answer with the following JSON: 'common names' (array of strings, a maximum of 3), 'family' (string), 'genus' (string), 'specific epithet' (string), 'distribution' (array of strings), 'origin' (array of strings), 'known uses' (a JSON object with a field named 'description' where you specify its uses by industry, and the following boolean fields: medicinal, edible)."
},
{
"role": "user",
"content": "Lactuca sativa"
}
],
"stream": false
}'
  • The “model” attribute lets you specify which model you want to chat with, ensuring flexibility in different use cases where more than one model is required to obtain the desired response.
  • The “messages” attribute allows you to specify messages by role. In this case, the message with the system role lets you define how the model should interact with the user messages. The message with the user role is the user prompt fed to the model.
  • The “stream”: false attribute will make the server reply with a single JSON, instead of a stream of objects split by token strings.

The API’s response to the previous request would look like this:

{
"model": "mixtral",
"created_at": "2023-12-31T14:35:23.089402623Z",
"message": {
"role": "assistant",
"content": " {\n\"common\\_names\": [\"garden lettuce\", \"lettuce\", \"cultivated lettuce\"],\n\"family\": \"Asteraceae\",\n\"genus\": \"Lactuca\",\n\"specific\\_epithet\": \"sativa\",\n\"distribution\": [\"Native to the Mediterranean region, now widely cultivated worldwide\"],\n\"origin\": [\"Originally domesticated in ancient Egypt over 4500 years ago\"],\n\"known\\_uses\": {\n\"description\": \"Lactuca sativa is primarily used as a leaf vegetable in salads and sandwiches. It is also used in soups, wraps, and other culinary applications. The leaves can be eaten raw or cooked.\",\n\"medicinal\": true,\n\"edible\": true\n}\n}"
},
"done": true,
// ... Skipped for simplicity
}

Having an API accessible over HTTP will give you the ability to empower your products and services by taking advantage of the model(s) of your choosing, and the guidance provided by your “system prompts”.

Integrating with your applications

Being able to interact with the model through an HTTP endpoint gives you the flexibility to call it from basically any device, platform, and programming language, and if you’re already using Python or JavaScript, there are official Ollama libraries you can use to abstract some complexity away. Here’s the default example for the Python library:

from ollama import Client

OLLAMA_API_URL = "http://your-instance.instances.scw.cloud:11434"
ollama_client = Client(host=OLLAMA_API_URL)

response = ollama_client.chat(model='llama2', messages=[
{
'role': 'user',
'content': 'Why is the sky blue?',
},
])

print(response['message']['content'])

Assuming you already have deployed your services using Instances (Virtual Machines), bare metal, Elastic Metal, or a Serverless solution, making them talk to your model is only a matter of pointing them in the right direction, either by using regular HTTP calls using your preferred client, or one of the official libraries. For more information, check out Ollama’s GitHub repository.

In conclusion

Even though Ollama’s current tagline is “Get up and running with large language models, locally”, as you can see, it can be tweaked to serve its API over the internet and integrate with your existing software solutions in just a few minutes. Even if you decide to use a different approach when going to production, It is a great resource that can help you get familiar with the process of running and communicating with a large set of LLMs.

Note: Even though there’s community interest in a built-in authentication method, currently Ollama does not prevent unauthorized access to the API, which means you should take measures to protect it using your preferred method (Nginx Proxy Manager, or following and adapting this guide for instance) so it only accepts requests from your application server, for instance.

The open source tooling ecosystem around AI has skyrocketed during the last few years, and will continue to evolve, making it even easier for us developers to leverage AI in our applications without necessarily having to understand what’s happening under the hood: you can be a successful web developer without even understanding what the V8 engine is, the same way you don’t need to understand how your car’s engine works before being able to drive.

This blog post guided you through one of the simplest approaches towards helping developers, and technologists in general, understand that “AI is doable” and it doesn’t take a team of AI researchers and years of studying to harness its power!

Share on
Other articles about:

Recommended articles