In recent years, large language models (LLMs) like GPT-3 have demonstrated impressive natural language capabilities. However, their proprietary nature and computational requirements have put them out of reach for most developers and researchers. This is starting to change thanks to open source alternatives like Ollama.

Ollama
Get up and running with large language models, locally.

A tool named Ollama addresses this by simplifying the process of running open-source LLMs locally. This essay explores Ollama's features, its step-by-step installation process, and the subsequent ease of interacting with large language models on personal computers.

An Introduction to Ollama

Ollama is simple as a user-friendly interface for running large language models locally, specifically on MacOS and Linux, with Windows support on the horizon. It supports an impressive range of models including Lama 2, Lama 2 uncensored, and the newly released Mistal 7B, among others. The straightforward installation and ease of use are particularly appealing to individuals and small teams who may not have extensive computational resources or technical expertise.

Ollama Installation

The installation process begins by downloading Ollama from a provided link. Upon initiating the downloaded file, users are prompted to move the file to applications and to proceed through a few straightforward steps to complete the installation. The process is intuitive and doesn’t seem to require any advanced technical knowledge. This simplicity is a significant boon for users unfamiliar with command-line operations.

Ollama
Get up and running with large language models, locally.

Running Large Language Models with Llama

Once installed, running a model is as simple as executing a command in the terminal. For instance, the command to run Lama 2 is provided by default. However, switching to another model, say Mistal 7B, is also straightforward. Copying a specific command and pasting it into the terminal initiates the downloading of the chosen model. Despite the large file sizes, which may result in lengthy download times, the process remains uncomplicated.

OLlama Performance and Resource Requirements

Responses are fast, averaging 60-80 tokens per second on an M2 chip. Ollama also features a verbose mode that provides insights into model performance. For example, it shows the number of tokens generated and prompt/response rates. This helps users better understand how models behave.

GitHub - jmorganca/ollama: Get up and running with Llama 2 and other large language models locally
Get up and running with Llama 2 and other large language models locally - GitHub - jmorganca/ollama: Get up and running with Llama 2 and other large language models locally

Ollama's GitHub page provides valuable information regarding the RAM requirements for running different models. For instance, running a 3 billion parameter model requires around 8 GB of VRAM, while a 7 billion parameter model necessitates around 16 GB of VRAM. This transparency is essential as it helps users gauge the feasibility of running particular models on their machines.

Interactive Features and API Serving

Beyond merely running models, Ollama facilitates interactive experimentation. Users can pose questions to the models and receive responses. Additionally, the Verbose mode provides insights into the number of tokens processed and the speed of token processing. A notable feature is the ability to serve the installed model through an (REST) API, allowing for real-time interaction and responses from the model.

Llama Model Hub/Library

Ollama's model hub makes switching between different LLMs straightforward. It lists specifications like size and RAM needs for each one. Currently available models range from 125 million parameters up to 7 billion.

Ollama supports a list of open-source models available on ollama.ai/library

Here are some example open-source models that can be downloaded:

  • Mistral7B - 4.1GB - ollama run mistral
  • Llama 2 - 7B - 3.8GB - ollama run llama2
  • Code Llama - 7B - 3.8GB - ollama run codellama
  • Llama 2 Uncensored - 7B - 3.8GB - ollama run llama2-uncensored
  • Llama 2 13B - 13B - 7.3GB - ollama run llama2:13b
  • Llama 2 70B - 70B - 39GB - ollama run llama2:70b
  • Orca Mini - 3B - 1.9GB - ollama run orca-mini
  • Vicuna - 7B - 3.8GB - ollama run vicuna
Note: You should have at least 8 GB of RAM to run the 3B models, 16 GB to run the 7B models, and 32 GB to run the 13B models.

Integration Capabilities

Ollama further shines with its integration capabilities. It is compatible with several platforms like L chain, Llama index, and even Light LLM, thereby broadening its utility. Such integrations make Ollama an attractive option for a variety of use-cases and user types.

The Broader Implication

By simplifying the local operation of large language models, Ollama lowers the entry barrier to harnessing the power of these cutting-edge AI tools. It fosters a more inclusive environment where individuals and small entities can explore, interact with, and benefit from large language models without hefty investments in computational resources or technical expertise.

Tools like Ollama are instrumental in democratizing access to advanced AI technologies. By offering a simplified, local avenue to run large language models, Ollama brings a slice of the AI revolution closer to the masses.

Share this post