# Ollama

> Source: https://aiwiki.ai/wiki/ollama
> Updated: 2026-06-20
> Categories: Developer Tools, Large Language Models, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Ollama** is a free, open-source runtime for downloading, running, and managing open-weight [large language models](/wiki/large_language_model) (LLMs) locally on personal computers and servers. Often described as "Docker for LLMs," it lets a user pull and chat with a model in a single command (for example `ollama run llama3.2`) without cloud services, API keys, or deep machine-learning expertise. Built primarily in [Go](https://en.wikipedia.org/wiki/Go_(programming_language)) and powered by [llama.cpp](/wiki/llama_cpp) under the hood, Ollama is one of the most widely adopted local-inference platforms in the AI ecosystem, with more than 175,000 stars and over 16,700 forks on GitHub as of June 2026, making its repository one of the most-starred AI projects on the platform.[1][2] It is released under the MIT license, and the current stable line is the v0.30.x series (v0.30.10 was published on June 17, 2026).[2]

## History

### When was Ollama released and who created it?

The company behind Ollama was founded in 2021 by Jeffrey Morgan and Michael Chiang in Palo Alto, California, and the tool itself had its first public release in 2023.[3] The company participated in Y Combinator's Winter 2021 batch[4] and raised $125,000 in pre-seed funding from investors including Y Combinator, Essence Venture Capital, Rogue Capital, and Sunflower Capital.[5]

Prior to founding Ollama, Morgan and Chiang, along with Sean Li, created Kitematic, a tool designed to simplify [Docker](https://en.wikipedia.org/wiki/Docker_(software)) container management on macOS, which was eventually acquired by Docker, Inc.[3] Jeffrey Morgan and Sean Li graduated from the University of Waterloo (BASc 2013, Software Engineering), while Michael Chiang was an electrical engineering student there at the time of Kitematic's acquisition. This experience in making complex command-line tools accessible through simpler interfaces directly influenced Ollama's design philosophy.

The platform quickly gained traction in the open-source AI community for its ease of use and Docker-like simplicity in managing LLMs. Initial releases focused on core functionality for running models like [LLaMA](/wiki/llama) 2, with subsequent updates introducing features such as multimodal support and tool calling.

### Key Milestones

| Date | Milestone | Notes |
| --- | --- | --- |
| 2021 | Company Founded | Participated in Y Combinator W21 batch |
| March 23, 2021 | Pre-seed Funding | Raised $125,000 from Y Combinator and other investors |
| 2023 | Public Launch | Basic model management and inference capabilities |
| February 8, 2024 | [OpenAI](/wiki/openai) Compatibility | Initial compatibility with the OpenAI Chat Completions API at `/v1/chat/completions` |
| February 15, 2024 | Windows Preview | Native Windows build with built-in GPU acceleration and always-on API |
| March 14, 2024 | AMD GPU Preview | Preview acceleration on supported AMD Radeon/Instinct cards on Windows and Linux |
| November 2024 | Structured Outputs | JSON Schema-based constrained output via the `format` parameter (v0.5+) |
| June 2025 | Secure Minions | Collaboration with Stanford's Hazy Research for encrypted local-cloud inference |
| July 30, 2025 | Desktop App (v0.10) | Official GUI app for macOS and Windows with file drag-and-drop and context-length controls |
| September 19, 2025 | Cloud Models (Preview) | Option to run larger models on datacenter hardware while maintaining local workflows |
| September 24, 2025 | Web Search API | REST API for augmenting models with live web data |
| October 2025 | Vulkan Support (Experimental) | Vulkan GPU backend in v0.12.6-rc0 for broader AMD and Intel GPU coverage |
| January 2026 | Image Generation (Experimental) | Local text-to-image with Z-Image Turbo and FLUX.2 Klein on macOS |
| January 2026 | `ollama launch` Command | Zero-config setup for coding tools such as Claude Code, Codex, and OpenCode |
| February 2026 | OpenClaw Integration | Personal AI assistant bridging messaging apps to local models |
| March 18, 2026 | Version 0.18.2 | Stable release with performance improvements and OpenClaw support |
| June 17, 2026 | Version 0.30.10 | Current stable release in the v0.30.x line[2] |

Development is rapid: the project moved from the v0.18.x line in March 2026 to the v0.30.x line by June 2026, and the GitHub repository's tagline by mid-2026 advertised support for "Kimi-K2.6, GLM-5.1, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models."[2]

### Funding and Revenue

Ollama's only publicly confirmed funding round is the $125,000 pre-seed raised during Y Combinator's W21 batch.[5] According to third-party estimates, Ollama generated approximately $3.2 million in revenue in 2024 with a team of roughly 21 people, primarily through the Ollama Cloud subscription tiers introduced in September 2025.[3] The company reportedly received an M&A offer in April 2025, though no details about the acquiring party or terms have been publicly disclosed.[3]

## Architecture and Technical Implementation

### How does Ollama work under the hood?

Ollama is built primarily in Go and leverages llama.cpp as its underlying inference engine through CGo bindings.[2] The llama.cpp project, created by Georgi Gerganov in March 2023, provides an efficient C++ implementation of LLaMA and other language models, enabling them to run on consumer-grade hardware. Because Ollama wraps llama.cpp, it inherits support for a wide range of model architectures and quantization formats while presenting a much simpler interface to the end user.

When a user runs a model, Ollama handles downloading the model weights from its registry, loading them into memory with appropriate quantization, allocating GPU or CPU resources, and exposing a local HTTP server for interaction. The server runs on `127.0.0.1:11434` by default and supports both streaming and non-streaming responses.

### Model Format

Ollama primarily uses the GGUF (GPT-Generated Unified Format) file format for storing and loading models. GGUF replaced the earlier [GGML](/wiki/ggml) format and provides better compatibility, metadata handling, and performance optimization for quantized models. This quantization is what allows massive models (for example 70 billion parameters) to run on machines with limited VRAM.

Ollama can also import models from specific [Safetensors](/wiki/safetensors) directories for supported architectures (for example Llama, [Mistral](/wiki/mistral_ai), Gemma, Phi).

### Quantization

[Quantization](/wiki/quantization) is central to Ollama's ability to run large models on consumer hardware. By reducing the precision of model weights from 16-bit floating point to lower bit representations (such as 4-bit or 8-bit integers), quantization dramatically reduces both memory usage and computation time. Ollama supports multiple quantization levels through GGUF:

| Quantization | Bits per Weight | Typical Use Case | Quality vs. Size Trade-off |
| --- | --- | --- | --- |
| Q2_K | 2 | Extreme compression for very limited hardware | Noticeable quality loss |
| Q4_0 / Q4_K_M | 4 | Default for most users; good balance | Minimal quality loss |
| Q5_K_M | 5 | Higher quality with moderate size | Near full-precision quality |
| Q6_K | 6 | High quality for users with sufficient RAM | Very close to full precision |
| Q8_0 | 8 | Near-lossless for critical applications | Large files, high memory use |
| FP16 | 16 | Full precision (no quantization) | Maximum quality, maximum size |

Ollama also added experimental support for NVFP4 and FP8 quantization in late 2025, leveraging [NVIDIA](/wiki/nvidia) hardware for faster token generation at lower precision.[28]

## Installation

### System Requirements

| Platform | Minimum Version | Installation Method |
| --- | --- | --- |
| macOS | 11 Big Sur or later | Download `.dmg` from official website |
| Linux | Ubuntu 18.04 or equivalent | `curl -fsSL https://ollama.com/install.sh \| sh` |
| Windows | Windows 10 22H2 or later | Download `.exe` installer |
| Docker | Any platform | `docker pull ollama/ollama` |

| Model Size | RAM Required | Storage | GPU VRAM (Optional) |
| --- | --- | --- | --- |
| 3B parameters | 8GB | 10GB+ | 4GB |
| 7B parameters | 16GB | 20GB+ | 8GB |
| 13B parameters | 32GB | 40GB+ | 16GB |
| 70B parameters | 64GB+ | 100GB+ | 48GB+ |

### Quick Start

After installation, getting started with Ollama takes a single command:

```
ollama run llama3.2
```

This command downloads the model (if not already present) and starts an interactive chat session. The Ollama server launches automatically in the background when any command is run, or it can be started explicitly with `ollama serve`.

### GPU Support

Ollama provides hardware acceleration across multiple GPU vendors:

| GPU Platform | API | Operating Systems | Notes |
| --- | --- | --- | --- |
| [NVIDIA](/wiki/nvidia) | [CUDA](/wiki/cuda) | Windows, Linux | Compute capability 5.0+; auto-detected |
| AMD | ROCm v7 | Linux | Supported Radeon and Instinct cards |
| Apple Silicon | Metal | macOS | Native acceleration on M1/M2/M3/M4 chips |
| AMD / Intel | Vulkan (experimental) | Linux, Windows | Added in v0.12.6-rc0 (October 2025) for broader GPU coverage |

Ollama automatically detects available GPUs and allocates model layers accordingly. For models that exceed GPU VRAM, Ollama splits inference between GPU and CPU, loading as many layers as possible onto the GPU.[19]

## Features

### What is Ollama used for?

Ollama is used to run open-weight LLMs entirely on a user's own hardware for chat, coding assistance, retrieval-augmented generation, agentic tool use, multimodal (vision) tasks, and, since 2026, experimental local image generation. Because all processing can occur on the loopback interface by default, it is widely adopted for privacy-sensitive workloads where prompts and data must never leave the machine.

### Core Capabilities

| Area | Details | Notes |
| --- | --- | --- |
| Server and Port | Local HTTP server at `127.0.0.1:11434` | Configurable via `OLLAMA_HOST` environment variable |
| Core Endpoints | `/api/generate`, `/api/chat`, `/api/embeddings`, model management | Streaming JSON supported |
| OpenAI Compatibility | `/v1/chat/completions` | Drop-in replacement for many OpenAI-based clients |
| Anthropic Compatibility | Anthropic Messages API (v0.14.0+) | Enables Claude Code and similar tools |
| Local-First Design | All processing occurs locally by default | Ensures complete data privacy |
| Multimodal Support | Text, images, and other data types | Self-contained projection layers |
| Tool Calling | External function calls with streaming support | Enhances reasoning and automation |
| Structured Outputs | JSON Schema-constrained responses | Type-safe API responses (v0.5+) |
| Thinking Mode | Controllable chain-of-thought reasoning | For DeepSeek R1, Qwen3, and similar models |
| Web Search | REST API for live web augmentation | Free tier available (v0.12+) |
| Cloud Integration | Hybrid mode for larger models | Maintains local workflows (v0.12.0+) |
| Image Generation | Experimental text-to-image | Z-Image Turbo and FLUX.2 Klein (v0.14+) |
| Performance | Flash attention, GPU/CPU overlap | Batch processing for efficiency |

### Command-Line Interface

| Command | Description | Example |
| --- | --- | --- |
| `ollama run` | Runs a model interactively | `ollama run llama3.2` |
| `ollama pull` | Downloads a model | `ollama pull gemma:2b` |
| `ollama create` | Creates custom model from Modelfile | `ollama create mymodel -f ./Modelfile` |
| `ollama list` | Lists installed models | `ollama list` |
| `ollama rm` | Removes a model | `ollama rm llama3.2` |
| `ollama cp` | Copies a model | `ollama cp llama3.2 mymodel` |
| `ollama push` | Uploads model to registry | `ollama push mymodel` |
| `ollama serve` | Starts the Ollama server | `ollama serve` |
| `ollama show` | Displays model information and metadata | `ollama show llama3.2` |
| `ollama ps` | Lists running models and resource usage | `ollama ps` |
| `ollama launch` | Sets up coding tools (Claude Code, Codex, etc.) | `ollama launch claude` |

### Modelfile

A key component of Ollama is the **Modelfile**, which serves as a blueprint for creating and sharing models. Similar to a Dockerfile, the Modelfile defines model behavior and configuration.

| Instruction | Description | Example |
| --- | --- | --- |
| FROM | (Required) Specifies the base model or local GGUF path | `FROM llama3.2` or `FROM ./model.gguf` |
| PARAMETER | Sets model parameters | `PARAMETER temperature 0.7`, `PARAMETER num_ctx 4096` |
| SYSTEM | Defines system message/persona | `SYSTEM "You are a helpful assistant"` |
| TEMPLATE | Sets prompt template format | `TEMPLATE "[INST] {{ .System }} {{ .Prompt }} [/INST]"` |
| ADAPTER | Applies [LoRA](/wiki/lora)/QLoRA adapters | `ADAPTER /path/to/adapter.bin` |
| LICENSE | Specifies model license | `LICENSE "MIT"` |
| MESSAGE | Provides conversation history for few-shot learning | `MESSAGE user "What is 1+1?"`, `MESSAGE assistant "2"` |

#### Example Modelfile

```
# Specify the base model
FROM llama3.2

# Set model parameters
PARAMETER temperature 0.8
PARAMETER num_ctx 4096
PARAMETER stop </s>

# Set the system message
SYSTEM """
You are an expert Python programming assistant.
Always provide clear, concise code examples.
Your responses must be formatted in Markdown.
"""

# Define the chat template
TEMPLATE """
<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{{ .[Prompt](/wiki/prompt) }}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""
```

This custom model can be created with: `ollama create my-assistant -f ./Modelfile`

### REST API

Ollama exposes a REST API on port 11434 by default, providing programmatic access to model functionality:

| Endpoint | Method | Description |
| --- | --- | --- |
| `/api/generate` | POST | Generate text completion |
| `/api/chat` | POST | Chat conversation interface |
| `/api/embeddings` | POST | Generate text [embeddings](/wiki/embeddings) |
| `/api/pull` | POST | Download a model |
| `/api/push` | POST | Upload a model to the registry |
| `/api/show` | POST | Show model information |
| `/api/list` | GET | List installed models |
| `/api/delete` | DELETE | Remove a model |
| `/v1/chat/completions` | POST | OpenAI-compatible chat endpoint |

### OpenAI-Compatible API

Since February 2024, Ollama has provided an OpenAI-compatible API endpoint at `/v1/chat/completions`.[6] This allows developers to use Ollama as a drop-in replacement for OpenAI in applications that rely on the OpenAI client library. By simply changing the base URL to `http://localhost:11434/v1`, existing code written for the [OpenAI API](/wiki/openai_api) works with local Ollama models. This compatibility extends to features such as streaming, tool calling, and structured outputs.[6]

Starting with v0.14.0 (released in early 2026), Ollama also added compatibility with the [Anthropic](/wiki/anthropic) Messages API, enabling tools like [Claude Code](/wiki/claude_code) to work with local open-source models through Ollama.[18]

## Supported Models

Ollama supports a wide range of open-source language models through its model library at `ollama.com/library`. The registry hosts over 100 model families with various parameter sizes and quantizations, and new models are added regularly as they are released.[25]

### Popular Models

| Model Name | Parameters | Category | Use Case | Creator |
| --- | --- | --- | --- | --- |
| [Llama 3.2](/wiki/llama) | 1B, 3B, 11B, 90B | General / Vision | General-purpose chat, reasoning, and image understanding | [Meta AI](/wiki/meta_ai) |
| [Gemma](/wiki/gemma) 3 | 1B, 4B, 12B, 27B | General | Lightweight on-device inference; multilingual | [Google DeepMind](/wiki/google_deepmind) |
| [DeepSeek-R1](/wiki/deepseek) | 1.5B, 7B, 14B, 32B, 70B | Reasoning | Complex reasoning with visible chain-of-thought | DeepSeek AI |
| [Qwen 3](/wiki/qwen) | 0.6B, 1.7B, 4B, 8B, 14B, 30B, 32B | General / Code | Multilingual; 128K context; coding and agentic tasks | [Alibaba](/wiki/alibaba_cloud) |
| [Mistral](/wiki/mistral_ai) / Mixtral | 7B, 8x7B, 8x22B | General | High-efficiency models using [mixture of experts](/wiki/mixture_of_experts) | [Mistral AI](/wiki/mistral_ai) |
| Phi 4 | 3B, 14B | Reasoning | Small language models for efficient reasoning | [Microsoft](/wiki/microsoft) |
| CodeLlama | 7B, 13B, 34B | Code | Specialized for code generation and programming | [Meta AI](/wiki/meta_ai) |
| Qwen3-Coder | 8B, 30B | Code | Optimized for coding and agentic workflows | Alibaba |
| GLM-5 | Various | General / Code | Open model with strong tool use and coding | Zhipu AI |
| Kimi-K2.5 | Various (cloud) | General / Code | Large cloud-hosted model with strong performance | Moonshot AI |
| gpt-oss | 20B, cloud variants | General / Code | OpenAI's open-source safeguard and general models | [OpenAI](/wiki/openai) |
| LLaVA | 7B, 13B | Vision | Visual language model for text and image understanding | Various |
| Llama 3.2 Vision | 11B, 90B | Vision | Multimodal image reasoning and captioning | [Meta AI](/wiki/meta_ai) |
| nomic-embed-text | 137M | Embedding | Text embeddings for retrieval and RAG | Nomic AI |
| mxbai-embed-large | 335M | Embedding | High-performance embeddings (MTEB benchmark leader) | Mixedbread AI |
| Snowflake Arctic Embed | 568M | Embedding | Multilingual embedding model for retrieval tasks | Snowflake |
| Z-Image Turbo | 6B | Image Generation | Text-to-image with bilingual text rendering | Alibaba Tongyi Lab |
| FLUX.2 Klein | 4B, 9B | Image Generation | Fast local image generation with text support | Black Forest Labs |

The newest model families surfaced in the GitHub repository's tagline as of mid-2026 include Kimi-K2.6, GLM-5.1, and MiniMax, alongside long-standing families such as DeepSeek, gpt-oss, Qwen, and Gemma.[2]

### Vision Model Support

Ollama supports multimodal vision models that can process both text and images. Users can pass images to supported models through the API or by dragging and dropping files in the desktop app. Vision models available through Ollama include:

- **Llama 3.2 Vision** (11B and 90B): Meta's multimodal models optimized for visual recognition, image reasoning, captioning, and answering questions about images. The architecture adds a vision adapter (image encoder plus cross-attention layers) on top of the Llama 3.1 text transformer.[24]
- **LLaVA 1.6**: An end-to-end trained multimodal model combining a vision encoder with Vicuna for general-purpose visual and language understanding.
- **Gemma 3**: Google DeepMind's models with built-in image understanding capabilities.
- **Qwen3-VL**: Alibaba's vision-language model with support for image and video inputs.

Vision models can be used with the same `ollama run` command by providing an image path:

```
ollama run llama3.2-vision "Describe this image: ./photo.jpg"
```

### Embedding Model Support

Ollama supports dedicated embedding models for tasks such as [retrieval-augmented generation](/wiki/retrieval_augmented_generation) (RAG), semantic search, and text classification. Embeddings are generated through the `/api/embeddings` endpoint or the client libraries.[23]

| Embedding Model | Dimensions | Context Length | Notes |
| --- | --- | --- | --- |
| nomic-embed-text | 768 | 8,192 tokens | Surpasses OpenAI text-embedding-ada-002 on short and long context tasks |
| mxbai-embed-large | 1,024 | 512 tokens | SOTA for BERT-large sized models on the MTEB benchmark |
| Snowflake Arctic Embed | 1,024 | 512 tokens | Multilingual support for retrieval workloads |
| all-minilm | 384 | 256 tokens | Lightweight model for fast similarity search |

### Tool Calling and Function Calling

Ollama supports tool calling (also known as function calling), which allows models to invoke external functions and incorporate the results into their responses.[8] This feature is available through the `/api/chat` endpoint by specifying a `tools` parameter containing a list of available functions with their descriptions and parameter schemas.[20]

Key aspects of Ollama's tool calling support:

- **Streaming with tool calls**: A parser built into Ollama understands the structure of tool calls, enabling real-time streaming of both text and function invocations without waiting for the entire response to complete.[9]
- **Parallel tool calls**: Models can request multiple tool calls in a single response. All tool responses can then be sent back to the model together.
- **Supported models**: Models with strong tool calling performance include Llama 3.1 and later, Qwen 2.5 and Qwen 3, Mistral and [Mixtral](/wiki/mixtral) variants, GLM-5, and specialized models like FunctionGemma.

Tool calling enables use cases such as data retrieval, calculations, API integration, and agentic workflows where the model plans and executes multi-step tasks.

### Structured Outputs

Since version 0.5, Ollama supports structured outputs that constrain a model's response to conform to a specific JSON schema.[10] By passing a JSON schema to the `format` parameter, Ollama generates a grammar that forces the output to match the defined structure. This is useful for extracting typed data from model responses, building reliable data pipelines, and integrating LLM outputs with downstream systems.

Best practices for structured outputs include lowering the temperature (for example, setting it to 0) for deterministic completions, defining schemas with Pydantic (Python) or Zod (JavaScript), and including the schema in the system prompt to ground the model's response.[21]

### Thinking Mode

Ollama provides the ability to enable or disable "thinking" for reasoning models such as [DeepSeek](/wiki/deepseek) R1 and [Qwen](/wiki/qwen) 3. When thinking mode is active, the model generates an explicit chain-of-thought reasoning trace before producing its final answer. This improves accuracy on complex tasks like mathematics, logic puzzles, and multi-step planning, while also providing transparency into the model's reasoning process. Thinking can be toggled on or off through the API, and the thinking level can be controlled in v0.17.7 and later.[29]

### Web Search

Announced on September 24, 2025, Ollama's web search feature provides a REST API that augments models with live information from the web.[13] The API includes two endpoints:

- **Web Search** (`/api/web_search`): Returns search results with titles, URLs, and content snippets for a given query.
- **Web Fetch** (`/api/web_fetch`): Retrieves the full content of a specific web page.

Both endpoints require Bearer token authentication. Ollama provides a generous free tier of web searches for individuals, with higher rate limits available through Ollama Cloud subscriptions. The web search and fetch functions also integrate as tools that instruction-following models can call autonomously via function calling.[22]

### Image Generation (Experimental)

In January 2026, Ollama introduced experimental support for local text-to-image generation.[15] The initial release supports two models:

- **Z-Image Turbo** (6B parameters): A bilingual (English/Chinese) model from Alibaba's Tongyi Lab that generates photorealistic images.
- **FLUX.2 Klein** (4B and 9B variants): A fast image-generation model from [Black Forest Labs](/wiki/black_forest_labs) with strong text-rendering capabilities.

Image generation is currently available on macOS, with Windows and Linux support planned for future releases. Users can configure generation parameters including width, height, step count, random seeds, and negative prompts.

## Ollama Model Library and Registry

The Ollama model library at `ollama.com/library` serves as a centralized registry for discovering and downloading models. It functions similarly to Docker Hub, allowing users to browse, search, and pull models with a single command. Each model page shows available tags (representing different parameter sizes and quantization levels), file sizes, and documentation.

Users can also push custom models to the registry after creating them with a Modelfile, making it possible to share fine-tuned or customized models with the community. The registry supports versioning through tags, so users can pin specific model versions for reproducibility.

As of early 2026, the library hosts over 100 model families spanning categories such as general-purpose chat, code generation, vision, embedding, tools/function calling, and image generation.[25]

## Desktop Application

Ollama v0.10.0, released on July 30, 2025, introduced a native desktop application for macOS and Windows.[11] Previously, Ollama operated exclusively through the command line and API. The desktop app brings a graphical chat interface that lowers the barrier to entry for non-technical users; Ollama described it simply as "An easier way to chat with models."[11]

Key features of the desktop app include:

- **Chat interface**: A polished conversation UI for interacting with any installed model.
- **File drag-and-drop**: Users can drag PDFs, images, and code files directly into the chat window for multimodal processing.
- **Context length controls**: A slider allowing precise control over how much context the model retains, up to 128K tokens for supported models.
- **Model management**: Browse, download, and switch between models from within the GUI.
- **Thinking mode toggle**: Enable or disable chain-of-thought reasoning for supported models.

The desktop app runs the same Ollama server under the hood, so the CLI, API, and third-party integrations continue to work alongside the graphical interface.

## Ollama Cloud

Announced on September 19, 2025, Ollama Cloud extends the platform beyond local hardware by allowing users to run larger models on datacenter-grade GPUs while maintaining the same tools and workflows.[12] Cloud models appear alongside local models in the Ollama interface and work through the same API endpoints, including the OpenAI-compatible API.

Ollama Cloud is designed for cases where a model is too large to fit in local memory (for example, 70B+ parameter models that require 48GB or more of VRAM). The cloud infrastructure uses high-memory GPUs with fast interconnects optimized for LLM inference.

### Privacy

Ollama states that its cloud service does not retain user data, maintaining the platform's privacy-first principles even when offloading to remote hardware.

### Pricing

### How much does Ollama cost?

Ollama itself is free and open source under the MIT license, and running models locally has no usage cost. Ollama Cloud launched with fixed-price subscription tiers, billed against the platform's GPU infrastructure rather than per token:[12]

| Plan | Price | Details |
| --- | --- | --- |
| Free | $0/month | Local models only; free tier of web search API; 1 concurrent cloud model |
| Pro | $20/month (or $200/year) | Cloud model access with higher usage limits and up to 3 concurrent cloud models |
| Max | $100/month | Cloud model access with the highest rate limits and up to 10 concurrent cloud models |

## Coding Tools Integration

In January 2026, Ollama introduced the `ollama launch` command (v0.15), which provides zero-configuration setup for popular AI coding tools.[16] Rather than manually setting environment variables and API endpoints, users can run a single command to connect coding agents to local or cloud models.

Supported coding tools include:

- **[Claude](/wiki/claude) Code**: Anthropic's CLI coding agent, connected via Ollama's Anthropic-compatible API.
- **OpenAI [Codex](/wiki/openai_codex)**: OpenAI's code execution tool.
- **OpenCode**: Open-source coding assistant.
- **Droid**: Android development coding agent.

Usage examples:

```
# Interactive picker for all supported tools
ollama launch

# Launch Claude Code directly
ollama launch claude

# Launch with a specific model
ollama launch claude --model qwen3-coder
```

Popular local models for coding tasks include GLM-4.7-flash, qwen3-coder, and gpt-oss:20b, which require around 23GB of VRAM when running with the recommended 64,000-token context length.

## OpenClaw Integration

In February 2026, Ollama announced integration with [OpenClaw](/wiki/openclaw), an open-source personal AI assistant framework that gained over 113,000 GitHub stars within days of its January 2026 launch.[17] OpenClaw bridges messaging platforms (WhatsApp, Telegram, Slack, Discord, iMessage) to [AI agents](/wiki/ai_agents), allowing users to interact with their local models from any chat application.

Ollama's integration provides a streamlined setup command:

```
ollama launch openclaw
```

This automatically configures the connection between OpenClaw and the user's local Ollama models, enabling tasks such as email management, calendar scheduling, and general assistance through familiar messaging interfaces.

## Secure Minions

In June 2025, Ollama partnered with Stanford's Hazy Research lab to introduce Secure Minions, a protocol for private collaboration between local and cloud models.[14] The protocol allows a small local model (such as Gemma 3 4B running on Ollama) to work together with a larger cloud model (such as [GPT-4o](/wiki/gpt-4)) while keeping all raw data encrypted end-to-end.

In the Secure Minions protocol, the raw context stays on the local device and can only be accessed by the local LLM. The cloud model orchestrates the local models and aggregates their outputs, but never sees plaintext data. As Ollama describes it, "The raw context stays local and can only be accessed by the local LLM."[14] Messages are encrypted before being sent to the cloud and decrypted only inside an NVIDIA Hopper H100 GPU running in confidential-computing (secure enclave) mode, with remote attestation used to verify the GPU's secure state before any encrypted message is processed; according to Ollama, "No plaintext is exposed during transmission or remote LLM inference."[14]

According to the underlying research (the Minions work, presented around ICML 2025), this approach reduces cloud costs by 5x to 30x while achieving 98% of the accuracy of using the frontier model directly; the paper's MinionS protocol specifically reports a 5.7x cost reduction at 97.9% of frontier accuracy.[31] The latency overhead is minimal: less than 1% even with long prompts and large local models.[14]

## Privacy and Security

### Is Ollama private and offline by default?

Yes. By default, Ollama operates entirely locally:

- Server binds to `127.0.0.1:11434` (loopback interface only)
- No prompts or responses sent to external servers
- Complete data privacy for sensitive information
- Offline operation after model download

To expose on a network, users must explicitly set the `OLLAMA_HOST` environment variable (for example `OLLAMA_HOST=0.0.0.0:11434`).

### Security Vulnerabilities

Ollama has addressed several security vulnerabilities:

| CVE | Description | Affected Versions | Status |
| --- | --- | --- | --- |
| CVE-2024-37032 | Remote code execution via API misconfiguration ("Probllama") | <0.1.34 | Fixed |
| CVE-2025-0312 | Malicious GGUF model exploitation | <=0.3.14 | Fixed |
| CNVD-2025-04094 | Unauthorized access due to improper configuration | Various | Configuration issue |

Users are advised to keep Ollama updated and configure the server securely, especially when exposing the API to a network.

## Integration and Ecosystem

### Programming Languages

Ollama provides official client libraries:

| Language | Installation | Notes |
| --- | --- | --- |
| Python | `pip install ollama` | Full API coverage; async support; Pydantic integration |
| JavaScript / TypeScript | `npm install ollama` | Browser and Node.js support |
| Go | Native API (same language as Ollama) | Direct integration without additional libraries |

### Third-Party Integrations

Ollama's ecosystem has grown substantially, with integrations spanning user interfaces, development frameworks, databases, and more.

#### User Interfaces

| Tool | Description |
| --- | --- |
| Open WebUI | Self-hosted web-based chat interface with RAG support; one of the most popular Ollama frontends |
| Continue.dev | VS Code and JetBrains extension for AI-assisted coding with local models |
| AnythingLLM | Multi-model chat application with document processing |
| Enchanted | Native macOS/iOS app for Ollama |
| OpenClaw | Personal AI assistant bridging messaging apps to local models |

#### Development Frameworks

| Framework | Description |
| --- | --- |
| [LangChain](/wiki/langchain) | LLM application framework with dedicated `langchain-ollama` package |
| LlamaIndex | Data framework for building RAG applications with LLMs |
| AutoGen | [Microsoft](/wiki/microsoft)'s multi-agent conversation framework |
| Semantic Kernel | Microsoft's AI orchestration SDK |
| Spring AI | Java/Spring integration for enterprise applications |
| CrewAI | Multi-agent orchestration framework |

#### Database and Infrastructure

- PostgreSQL with pgai extension for AI-powered queries
- ChromaDB and other vector databases for RAG pipelines
- IoT device integrations for edge AI deployments

## Ollama for Enterprise and Teams

As of early 2026, Ollama is developing team and enterprise features, though formal plans have not yet been publicly released. According to Ollama's website, team and enterprise plans are "coming soon," and interested organizations can contact hello@ollama.com for details.

Features reported to be under development include:

- Team collaboration with shared model workspaces
- Usage analytics and monitoring
- Centralized team model management
- SSO (single sign-on) integration
- Professional support and SLAs

Ollama Cloud's hybrid approach (local plus cloud inference) positions the platform for enterprise adoption, particularly for organizations that need to process sensitive data locally while offloading larger workloads to private cloud infrastructure. For organizations requiring enterprise-grade autoscaling, multi-tenant throughput, and GPU pooling, alternatives such as [vLLM](/wiki/vllm) or managed inference services may be more appropriate until Ollama's enterprise features mature.

## Comparisons with Similar Tools

Ollama competes in the growing market for local LLM runners and inference tools. Each tool has distinct strengths depending on the user's needs.

| Feature | Ollama | LM Studio | GPT4All | Jan | LocalAI |
| --- | --- | --- | --- | --- | --- |
| Primary Interface | CLI and API (desktop app added 2025) | GUI-focused | GUI-focused | GUI-focused | API-focused |
| License | MIT (open source) | Proprietary (free for personal use) | MIT (open source) | AGPL (open source) | MIT (open source) |
| Model Sources | Ollama registry + GGUF | [Hugging Face](/wiki/hugging_face) + GGUF | Curated list + GGUF | Hugging Face + GGUF | Multiple backends |
| OpenAI API Compatibility | Yes | Yes | Limited | Yes | Yes (primary focus) |
| Concurrent Handling | Excellent (batching) | Limited | Limited | Limited | Good |
| Tool Calling | Yes (streaming) | Yes | No | Limited | Yes |
| Cloud Option | Ollama Cloud ($20-100/mo) | No | No | No | No |
| macOS Performance | Good (Metal) | Better (MLX support) | Good | Good | Good |
| Best For | Developers, automation, API integration | Non-technical users, GUI workflows | Beginners, offline chat | Privacy-focused personal assistant | API hub, multi-backend orchestration |

### Ollama vs. LM Studio

Ollama and [LM Studio](/wiki/lmstudio) are the two most popular tools for running LLMs locally. Ollama excels at automation, scripting, and integration through its API and CLI, while LM Studio provides a more polished graphical interface. LM Studio also supports Apple's MLX framework for optimized performance on Apple Silicon, which Ollama does not currently support. Ollama's API runs as a system service (always available in the background), whereas LM Studio requires manually starting its server.

### How does Ollama differ from llama.cpp?

Since Ollama is built on top of llama.cpp, it inherits the same model support and inference performance. The key difference is that Ollama adds model management (pulling, pushing, listing, creating), a REST API, and a simpler user experience. Advanced users who need direct control over inference parameters or want to avoid the overhead of Ollama's server may prefer using llama.cpp directly.

## Community and Reception

Ollama has one of the largest open-source AI communities. As of June 2026, the main repository had surpassed 175,000 GitHub stars and 16,700 forks, up from roughly 166,000 stars and 15,000 forks in March 2026.[2] The project maintains active development with thousands of commits and frequent releases (it advanced from the v0.18.x line to the v0.30.x line over the first half of 2026). The companion libraries (ollama-python and ollama-js) are also widely used, with nearly 1,000 forks for the Python library alone.[2]

The project has been praised for advancing local AI accessibility, reducing costs for developers and researchers, and enabling privacy-preserving AI workflows. It is frequently cited as the easiest way to get started with open-source LLMs.

### Licensing Controversy

Some criticism has arisen regarding licensing compliance issues with dependencies like llama.cpp, with community members raising concerns about proper attribution. The Ollama team has been working to address these concerns.

## Significance and Impact

Ollama has been a key driver in the democratization of large language models by:

- Enabling developers to build and test AI-powered applications locally without cost
- Allowing researchers to experiment with various open-source models easily
- Empowering hobbyists to run state-of-the-art AI on personal computers
- Enhancing privacy for users who can leverage powerful AI without data leaving their machine
- Fostering a community-driven approach to AI development
- Providing an on-ramp for enterprises exploring private AI deployments

The tool is widely used in education, research, and enterprise for privacy-sensitive applications and has become a foundational tool in the open-source AI movement. Its Docker-like experience for AI models has set the standard that competitors measure themselves against, and its growing feature set (cloud models, web search, image generation, coding tool integration) signals a trajectory toward becoming a comprehensive local AI platform.

## See Also

- [Large language model](/wiki/large_language_model)
- [llama.cpp](/wiki/llama_cpp)
- [GGUF](/wiki/gguf)
- [LangChain](/wiki/langchain)
- [LlamaIndex](/wiki/llamaindex)
- [Hugging Face](/wiki/hugging_face)
- [Docker](https://en.wikipedia.org/wiki/Docker_(software))
- [LM Studio](/wiki/lmstudio)
- [Mixture of Experts](/wiki/mixture_of_experts)
- [Retrieval-augmented generation](/wiki/retrieval_augmented_generation)
- [Embeddings](/wiki/embeddings)

## References

1. [Ollama Official Website](https://ollama.com)
2. [Ollama GitHub Repository](https://github.com/ollama/ollama)
3. [Ollama on Crunchbase](https://www.crunchbase.com/organization/ollama)
4. [Ollama on Y Combinator](https://www.ycombinator.com/companies/ollama)
5. [Ollama Pre-Seed Funding Round (Crunchbase)](https://www.crunchbase.com/funding_round/ollama-pre-seed--c51b44d8)
6. [Ollama Blog: OpenAI Compatibility](https://ollama.com/blog/openai-compatibility)
7. [Ollama Blog: Windows Preview](https://ollama.com/blog/windows-preview)
8. [Ollama Blog: Tool Support](https://ollama.com/blog/tool-support)
9. [Ollama Blog: Streaming Tool Calls](https://ollama.com/blog/streaming-tool)
10. [Ollama Blog: Structured Outputs](https://ollama.com/blog/structured-outputs)
11. [Ollama Blog: New Desktop App](https://ollama.com/blog/new-app)
12. [Ollama Blog: Cloud Models](https://ollama.com/blog/cloud-models)
13. [Ollama Blog: Web Search](https://ollama.com/blog/web-search)
14. [Ollama Blog: Secure Minions](https://ollama.com/blog/secureminions)
15. [Ollama Blog: Image Generation](https://ollama.com/blog/image-generation)
16. [Ollama Blog: ollama launch](https://ollama.com/blog/launch)
17. [Ollama Blog: OpenClaw Integration](https://ollama.com/blog/openclaw)
18. [Ollama Blog: Claude Code Integration](https://ollama.com/blog/claude)
19. [Ollama Docs: Hardware Support](https://docs.ollama.com/gpu)
20. [Ollama Docs: Tool Calling](https://docs.ollama.com/capabilities/tool-calling)
21. [Ollama Docs: Structured Outputs](https://docs.ollama.com/capabilities/structured-outputs)
22. [Ollama Docs: Web Search](https://docs.ollama.com/capabilities/web-search)
23. [Ollama Blog: Embedding Models](https://ollama.com/blog/embedding-models)
24. [Ollama Blog: Llama 3.2 Vision](https://ollama.com/blog/llama3.2-vision)
25. [Ollama Model Library](https://ollama.com/library)
26. [Open WebUI GitHub Repository](https://github.com/open-webui/open-webui)
27. [LangChain Ollama Integration](https://docs.langchain.com/oss/python/integrations/providers/ollama)
28. [Infralovers: Ollama in 2025 Major Updates](https://www.infralovers.com/blog/2025-08-13-ollama-2025-updates/)
29. [GitHub Releases: ollama/ollama](https://github.com/ollama/ollama/releases)
30. [Phoronix: ollama Experimental Vulkan Support](https://www.phoronix.com/news/ollama-Experimental-Vulkan)
31. [Minions: Cost-efficient Collaboration between On-device and Cloud Language Models (OpenReview, ICML 2025)](https://openreview.net/forum?id=qGDlzt3dKz)

## External Links

- [Official Website](https://ollama.com)
- [Ollama on GitHub](https://github.com/ollama/ollama)
- [Ollama Docker Hub](https://hub.docker.com/r/ollama/ollama)
- [Ollama integration with Hugging Face](https://huggingface.co/docs/hub/en/ollama)
- [Official Documentation](https://docs.ollama.com)