LM Studio

Developer Tools Large Language Models Open Source AI

39 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

37 citations

Revision

v10 · 7,860 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LM Studio is a desktop application for discovering, downloading, and running large language models locally on personal hardware, available free for both personal and commercial use on macOS, Windows, and Linux.^[1]^[6] Developed by Element Labs, Inc. and founded by Yagil Burowski, LM Studio provides a graphical user interface that allows users to interact with open-weight LLMs without relying on cloud services or sending data to external servers.^[1] The application acts as a polished GUI over two inference backends: it runs GGUF models through llama.cpp across all platforms and MLX models on Apple Silicon,^[5] and it exposes an OpenAI-compatible API server on localhost so any tool that speaks the OpenAI API can be pointed at a local model.^[11]

LM Studio was first publicly released in May 2023 and has since become one of the most widely used tools for running local LLMs. The application is available for macOS, Windows, and Linux.^[2] As of April 2026, the latest stable release is version 0.4.12.^[36]

Who makes LM Studio?

LM Studio was created by Yagil Burowski, who previously worked at Apple. The application is built by Element Labs, Inc., a software company based in Brooklyn, New York.^[15] Burowski founded the company in 2023 with the goal of making local AI inference accessible to a broad audience through a polished desktop experience. He earned a Bachelor of Science in Computer Science from the University of Pennsylvania between 2013 and 2017 and serves as the company's chief executive officer. By 2025 the company reported approximately 16 employees and disclosed annual revenue of roughly $1.8 million.^[37]

When was LM Studio released?

The initial public release, version 0.1.x, launched in May 2023. This first version introduced the core functionality of browsing and downloading models from Hugging Face and running them locally through a chat interface. It primarily targeted Windows users, with experimental support for macOS and Linux.

Version 0.2.x arrived in early 2024, bringing improved platform support including more robust macOS compatibility. Later patches in this series, such as 0.2.22, introduced Flash Attention integration for faster inference.

The 0.3.x series, which began in mid-2024, represented a major expansion of capabilities. Version 0.3.0 (August 2024) added built-in RAG, a light theme, internationalization, and the Structured Outputs API. Version 0.3.4 (October 2024) added Apple MLX support for efficient inference on Apple Silicon.^[5] Version 0.3.5 introduced headless mode and on-demand model loading. Version 0.3.10 (February 2025) brought speculative decoding.^[8] Version 0.3.13 (March 2025) added Google Gemma 3 and multimodality support. Version 0.3.14 (April 2025) introduced multi-GPU controls.^[9] Version 0.3.15 (May 2025) added support for NVIDIA RTX 50-series GPUs through the CUDA 12.8 runtime and improved tool use in the API.^[22] Version 0.3.17 (June 2025) added Model Context Protocol (MCP) support.^[7] Version 0.3.21 (August 2025) shipped with support for OpenAI's gpt-oss models at launch.^[10] Flash Attention was enabled by default for Vulkan and Metal backends in version 0.3.32.

Version 0.4.0 was released on January 28, 2026. This release introduced llmster (a headless daemon for server deployments), parallel request processing with continuous batching, a new stateful REST API, and a refreshed user interface with split-view chat and developer mode.^[4] Subsequent updates through 0.4.12 expanded the platform with LM Link for remote instance connectivity, end-to-end encryption via Tailscale, OAuth-based authentication for MCP servers, native reasoning_effort and reasoning_tokens parameters in the OpenAI-compatible endpoints, and improved tool calling for newer model families such as Qwen 3.5, Qwen 3.6, Gemma 4, and GLM.^[21]

The following table summarizes major releases in LM Studio's development history:

Version	Date	Key Features
0.1.x	May 2023	Initial public release; core model downloading and chat functionality; primarily Windows support with experimental macOS and Linux builds
0.2.x	Early 2024	Enhanced platform support, improved macOS compatibility, Flash Attention integration (0.2.22)
0.3.0	August 2024	Built-in RAG, light theme, internationalization, Structured Outputs API
0.3.4	October 2024	Apple MLX engine for Apple Silicon Macs, vision model support via MLX ^[5]
0.3.5	October 2024	Headless mode, on-demand model loading, server auto-start, CLI model downloads
0.3.10	February 2025	Speculative Decoding for llama.cpp and MLX engines ^[8]
0.3.13	March 2025	Google Gemma 3 support, multimodal image input
0.3.14	April 2025	Multi-GPU controls for NVIDIA GPUs ^[9]
0.3.15	May 2025	RTX 50-series support via CUDA 12.8, improved tool use API, preset publishing ^[22]
0.3.17	June 2025	Model Context Protocol (MCP) host support ^[7]
0.3.21	August 2025	Day-one support for OpenAI's gpt-oss models ^[10]
0.3.29	2025	OpenAI Responses API with local models
0.3.31-32	2025	Flash Attention enabled by default for CUDA, Vulkan, and Metal
0.4.0	January 28, 2026	llmster headless daemon, continuous batching, new REST API, revamped UI with split view ^[4]
0.4.1	February 2026	Anthropic-compatible `/v1/messages` endpoint for Claude SDK clients ^[26]
0.4.6	February 27, 2026	LM Link for remote instance connectivity with end-to-end encryption via Tailscale ^[28]
0.4.7	March 18, 2026	Improved tool calling for Qwen 3.5 and GLM models, enhanced Anthropic API compatibility
0.4.8	March 26, 2026	`reasoning_effort` and `reasoning_tokens` parameters in OpenAI-compatible endpoints, CUDA memory improvements
0.4.9	April 2, 2026	Anthropic-compatible API parameters expanded, fix for UI freeze when deleting chat folders
0.4.10	April 9, 2026	OAuth support for MCP servers, improved Gemma 4 tool calling reliability ^[35]
0.4.11	April 10, 2026	Updated Gemma 4 chat template support
0.4.12	April 17, 2026	Qwen 3.6 support, improved chat PDF export styling, Windows OAuth fix for MCP servers ^[36]

Funding

LM Studio raised $19.3 million in a Series B funding round on April 25, 2025. Investors include Matrix, Preston-Werner Ventures, and Torch Capital.^[15] As of 2025, Element Labs had a team of approximately 16 people.^[37] The capital was used to expand the engineering team, accelerate work on the MLX engine and llmster daemon, and fund the buildout of the LM Studio Hub for sharing presets and community models.

How does LM Studio work?

LM Studio is an Electron application built with web technologies. The frontend uses React and TypeScript, while the build system relies on Vite. The application bundles two inference backends: llama.cpp for GGUF models and Apple's MLX framework for MLX-format models on Apple Silicon hardware.^[5]

The llama.cpp backend supports multiple compute variants, including CPU-only, CUDA (for NVIDIA GPUs), Vulkan (cross-platform GPU), ROCm (for AMD GPUs), and Metal (for Apple GPUs). This allows LM Studio to take advantage of GPU acceleration across all major hardware platforms.

The MLX engine is implemented in Python and combines three core libraries: mlx-lm (Apple's MLX Python implementation), Outlines (for structured generation and JSON schema enforcement), and mlx-vlm (for vision model support). LM Studio deploys Python using python-build-standalone with stacked virtual environments for portable, cross-machine compatibility. In benchmarks, Llama 3.2 1B running on an M3 Max chip achieved approximately 250 tokens per second using the MLX backend.^[25]

In 2025, the team migrated to a unified MLX engine architecture. Under the new design, mlx-lm always provides the text model implementation, and mlx-vlm contributes vision components as modular add-ons that emit image embeddings consumable by the text model. The unified architecture significantly improved throughput for multimodal workloads and simplified the integration of new vision-language models on Apple Silicon.^[25] The complete backend matrix for inference is summarized below:

Backend	Format	Supported Hardware	Notes
llama.cpp (CPU)	GGUF	x86_64 with AVX2, ARM64	Used as a fallback when no GPU acceleration is available
llama.cpp (CUDA)	GGUF	NVIDIA RTX 20-series and newer; full support for RTX 50-series via CUDA 12.8	CUDA Graphs reduce CPU overhead by up to 35 percent; Flash Attention CUDA kernels add up to 15 percent throughput ^[22]
llama.cpp (ROCm)	GGUF	AMD RDNA2 and newer (Linux); AMD 9000 series	Open compute stack for AMD discrete GPUs
llama.cpp (Vulkan)	GGUF	NVIDIA, AMD, Intel Arc on Windows and Linux	Cross-vendor GPU backend, used as the default AMD path on Windows
llama.cpp (Metal)	GGUF	Apple Silicon (M1, M2, M3, M4)	Default GGUF backend on macOS
MLX (text)	MLX	Apple Silicon (M1, M2, M3, M4)	Uses mlx-lm with Outlines for structured generation ^[25]
MLX (vision)	MLX	Apple Silicon (M1, M2, M3, M4)	Uses mlx-vlm modules as add-ons in the unified MLX engine ^[25]

Features

Model Browser and Hugging Face Integration

LM Studio includes a built-in model browser that connects directly to Hugging Face, the largest repository of open-weight models. Users can search for models by name, keyword, or model identifier without leaving the application. The browser displays available model variants with different quantization levels (such as Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, and Q8_0), allowing users to choose a balance between model quality and memory requirements.^[2]

The quantization level indicates how many bits are used to represent each model weight. Lower quantization (such as Q4) reduces memory usage and increases inference speed but introduces minor quality loss. Higher quantization (such as Q8) preserves more of the original model's accuracy but requires more memory. Q5_K_M is generally recommended as a good balance between performance and quality.

The lmstudio-community organization on Hugging Face provides pre-quantized GGUF models maintained by community members, making it straightforward to find optimized versions of popular models.^[16] LM Studio also maintains a curated model catalog on its website, highlighting popular and trending models across families like Llama, Qwen, Mistral, Gemma, Phi, and DeepSeek.

Chat Interface

The chat interface provides a conversational experience similar to ChatGPT or Claude. Users can create multiple conversations organized in folders, adjust generation parameters (temperature, top-p, max tokens, repeat penalty), manage conversation history, and customize model behavior through system prompts.

Version 0.4.0 introduced split-view chat, which allows users to view two conversations side by side by dragging and dropping chat tabs.^[4] This feature is useful for comparing outputs from different models or different prompt configurations. Configuration presets let users save and quickly switch between different parameter settings for various use cases.

Chat conversations can be exported as PDF (including images), Markdown, or plain text files. Version 0.4.12 improved PDF export styling so that long conversations and embedded images render with better typography and pagination.^[36]

Configuration Presets

LM Studio bundles a system prompt together with sampling parameters, context length, prompt template, and other advanced settings into a reusable artifact known as a Preset. Presets are stored as .preset.json files inside ~/.lmstudio/hub on macOS and Linux or %USERPROFILE%\.lmstudio\hub on Windows. Users can save the current configuration through the "Save Preset" button in the right-hand panel, then switch between presets in seconds when moving between tasks such as creative writing, coding, or summarization.^[30]

Starting in version 0.3.15, presets can be imported from a local file or a URL and published to the LM Studio Hub for others to download. Per-model defaults let users set a preset that loads automatically whenever a particular model is selected, which removes the need to retune parameters every time the model changes.^[30]

Chat with Documents (RAG)

Introduced in version 0.3.0 (August 2024), LM Studio's "Chat with Documents" feature enables retrieval-augmented generation (RAG) directly within the desktop application. Users can attach files in formats including PDF, DOCX, TXT, CSV, and plain text to a chat session. If the document is short enough to fit within the model's context window, LM Studio inserts the full file contents into the conversation. For longer documents, the application automatically switches to RAG mode, chunking the document and retrieving relevant sections based on the user's query.^[2]

The feature supports up to 5 files with a maximum combined size of 30 megabytes. Citations are provided at the end of responses, showing which portions of the uploaded documents were used to generate the answer. All document processing happens locally on the user's machine, ensuring that sensitive files never leave the device.

Does LM Studio have an OpenAI-compatible API?

One of LM Studio's most important features for developers is its built-in local API server. When started, the server exposes endpoints on localhost (port 1234 by default) that are compatible with OpenAI's API specification. This means any application, script, or library designed to work with the OpenAI API can be redirected to use a local model by simply changing the base URL.^[11]

The server supports the following endpoints:

Endpoint	Path	Description
Chat Completions	`/v1/chat/completions`	Standard chat completion requests with streaming support ^[11]
Embeddings	`/v1/embeddings`	Text embedding generation for semantic search and RAG ^[11]
Models	`/v1/models`	Lists currently loaded models ^[11]
Responses	`/v1/responses`	OpenAI Responses API for stateful conversations ^[11]
Anthropic Messages	`/v1/messages`	Anthropic-compatible endpoint (v0.4.1+) ^[26]
Native REST API	`/api/v1/*`	LM Studio's own stateful REST API with full feature access ^[4]

Starting with version 0.4.0, the server supports parallel request processing through continuous batching. Instead of queuing requests one by one, the model can process up to N requests simultaneously (default: 4 parallel slots). This is powered by a unified KV cache that dynamically allocates memory across concurrent requests of varying sizes.^[4]

The server also supports Just-In-Time model loading, which can automatically load a model when a request arrives and optionally unload it after a period of inactivity. This is useful for server deployments where multiple models need to be available but only one is active at a time.

Anthropic API Compatibility

Version 0.4.1 introduced an Anthropic-compatible /v1/messages endpoint that mirrors the official Claude API. Tools that already speak the Anthropic Messages protocol can talk to LM Studio with no code changes by setting the base URL to http://localhost:1234 and the API key to any non-empty string such as lmstudio.^[26] The endpoint supports streaming via Server-Sent Events with message_start, content_block_delta, and message_stop frames, plus tool use, system messages, and temperature controls.^[26] Subsequent releases through 0.4.9 expanded the parameter coverage so that frameworks such as Claude Code, OpenCode, and the official Anthropic Python SDK can drive a local model in place of a hosted Claude deployment.^[27]

Reasoning Models and Chain-of-Thought

LM Studio exposes the chain-of-thought traces produced by reasoning models such as DeepSeek R1, gpt-oss, and Qwen reasoning variants. By default the server returns reasoning tokens in a separate reasoning_content field, keeping them out of the main content stream so that downstream clients can choose whether to display the model's internal thoughts. Version 0.4.8 added native reasoning_effort and reasoning_tokens parameters to the OpenAI-compatible /v1/chat/completions endpoint, allowing callers to set the depth of deliberation (low, medium, or high) and to limit the budget of reasoning tokens.^[33] The /api/v1/models response includes a reasoning field that indicates which reasoning controls each loaded model accepts.

Headless Deployment with llmster

Version 0.4.0 introduced llmster, a headless daemon that packages the core inference engine without the GUI. llmster can run on servers, cloud instances, or any machine that lacks a graphical display. It supports Linux, macOS, and Windows, making LM Studio suitable for production server deployments in addition to desktop use. The llmster daemon supports the same API endpoints and model management features as the desktop application.^[4]

LM Link

LM Link, introduced in version 0.4.6 on February 27, 2026, allows a user's LM Studio instances to talk to one another over an end-to-end encrypted mesh network. The feature is built on top of custom Tailscale mesh VPNs, with every request wrapped in WireGuard encryption. Devices are never exposed to the public internet and there are no API keys to rotate; identity is verified through the user's LM Studio and Tailscale credentials.^[29]

To enable LM Link on a remote GPU machine, the user runs lms link enable on the host. Once linked, the remote model appears in the local LM Studio UI alongside on-device models, and the local server at localhost:1234 transparently forwards requests through the encrypted tunnel.^[28] This makes it possible to point a developer tool such as Claude Code or a custom SDK at a local port while inference runs on a high-VRAM workstation elsewhere on the user's network. LM Link is free during the preview period for up to two users with five devices each (10 devices total), and Element Labs has stated that paid tiers for additional devices and users will arrive after the feature exits preview.^[28]

Developer SDKs and CLI

LM Studio provides official SDKs and command-line tools for developers:^[13]

Tool	Package	Language	License
TypeScript SDK	`@lmstudio/sdk` (npm)	TypeScript	MIT ^[13]
Python SDK	`lmstudio` (pip)	Python	MIT ^[13]
CLI	`lms`	Command-line	MIT ^[32]

The SDKs allow programmatic model loading, inference, tool calling, and structured output enforcement. The Python SDK supports Pydantic for schema validation, while the TypeScript SDK supports Zod. Both SDKs can enforce JSON schema output from models.^[13]

The lms CLI tool provides commands for model management, server control, and interactive chat sessions from the terminal. The most frequently used commands are summarized below:^[32]

Command	Purpose
`lms bootstrap`	Adds the `lms` binary to the user's shell after the first launch of the GUI
`lms ls`	Lists all locally downloaded models
`lms get <model>`	Searches Hugging Face and downloads a model, with optional `@quant` suffix such as `@q4_k_m`
`lms load <model>`	Loads a model into memory; flags such as `--gpu max` push every layer to the GPU
`lms server start`	Starts the local OpenAI-compatible API server
`lms chat`	Opens an interactive terminal chat with the loaded model
`lms link enable`	Activates LM Link on the current host so remote devices can use its models
`lms push`	Publishes a preset, plugin, or other artifact to the LM Studio Hub
`lms dev`	Runs a hot-reloading development server for plugin authors

The lms chat command, added in 0.4.0, supports slash commands and pasting multi-line content blocks. Models can be downloaded from the CLI using commands like lms get lmstudio-community/llama-3.2-1b-instruct-gguf@q4_k_m.

Plugin System

The plugin system, introduced as part of the developer-mode workflows in the 0.3.x and 0.4.x lines, lets developers extend LM Studio with TypeScript modules that run inside the application. Plugins ship as packages that integrate at four extension points along the prediction pipeline:^[31]

Plugin Type	Role	Typical Use
Tools Provider	Returns an array of `Tool` objects that the model can invoke during generation	Calculators, web search, file system access, custom REST integrations
Prompt Preprocessor	Intercepts the user message after the Send button is pressed and rewrites it before the model receives it	Document retrieval, prompt expansion, safety filtering, persona injection
Generator	Replaces the local LLM with an alternative token source	Routing requests to remote APIs while keeping LM Studio's UI and tool stack
Custom Configuration	Adds per-chat or global settings exposed in LM Studio's UI	User-tunable plugin parameters with type-safe schemas

Plugins run on Node.js v22.21.1 in a sandboxed environment, multiple prompt preprocessors form a pipeline where each preprocessor receives the output of the previous one, and a single plugin package can register tools, preprocessors, and generators together to build sophisticated integrations.^[31] The lms dev command provides hot reloading during development, and lms push ships finished plugins to the LM Studio Hub.^[32]

LM Studio Hub

The LM Studio Hub is the central marketplace and registry for shareable LM Studio artifacts. Users can browse, install, and publish presets, plugins, and curated model collections through both the desktop GUI and the lms CLI. Presets and plugins downloaded from the Hub are stored under ~/.lmstudio/hub, mirroring the layout of a user's local artifacts so that personal and shared content live side by side. Organizations can create Hub orgs that act as a private namespace for sharing artifacts with teammates while keeping content out of the public catalog. The --private flag on lms push keeps a publication unlisted for accounts that support private artifacts.^[34]

Model Context Protocol (MCP)

LM Studio 0.3.17 introduced support for Model Context Protocol (MCP), a standard originally introduced by Anthropic for connecting LLMs to external tools and data sources.^[7] Users can configure MCP servers through a mcp.json file or through the application's Tools & Integrations menu. Each MCP server runs in a separate, isolated process for stability and security.^[24]

When a model invokes a tool through MCP, LM Studio displays a confirmation dialog that lets the user review and optionally edit the tool call arguments before execution. MCP is also accessible through the REST API, enabling programmatic tool use in server deployments. Common use cases include connecting models to web search APIs, file system access, database queries, and the Hugging Face API.^[24]

Version 0.4.10 added OAuth support for MCP servers, which removes the need to copy bearer tokens or configure authorization headers by hand. Users add an integration, log in through a browser-based OAuth flow, and the resulting credentials are stored securely so the MCP server can present them on the user's behalf.^[35] A subsequent fix in 0.4.12 resolved an issue that prevented the OAuth flow from completing inside certain Windows environments.^[36]

Speculative Decoding

Speculative decoding is an inference optimization technique that can speed up text generation by 20 to 50 percent without reducing output quality. It works by pairing a large main model with a smaller draft model. The draft model quickly proposes candidate tokens, which the main model then verifies in parallel. Since verification is faster than generation from scratch, the overall throughput increases.

LM Studio added speculative decoding support in version 0.3.10 (February 2025). The system attempts to offload the draft model entirely to the GPU when one is available. Speculative decoding is accessible through both the chat interface and the OpenAI-compatible API.^[8]

Structured Output

LM Studio supports structured output through JSON schema enforcement. When a JSON schema is provided with a request to the /v1/chat/completions endpoint, the model is constrained to produce valid JSON conforming to the specified schema. This follows the same format as OpenAI's Structured Output API, ensuring compatibility with existing OpenAI client libraries.^[23]

The SDKs surface APIs to enforce the model's output format using Pydantic (for Python) or Zod (for TypeScript). On Apple Silicon, the MLX backend uses the Outlines library to enforce structured generation constraints.^[23]

Multi-GPU Support

Version 0.3.14 (April 2025) introduced multi-GPU controls, allowing users to distribute model layers across multiple GPUs. This is particularly useful for running larger models that exceed the VRAM capacity of a single GPU. Users can enable or disable specific GPUs, set priority order for memory allocation, and limit dedicated GPU memory usage to prevent model weights from spilling into slower shared memory.^[9]

Vision and Multimodal Models

LM Studio supports multimodal vision models that can process both text and images. On Apple Silicon, the mlx-vlm library enables running vision models such as LLaVA.^[25] The llama.cpp backend also supports vision-capable models on all platforms. Users can attach images to chat messages for visual question answering, image description, and other multimodal tasks. Supported vision models include Gemma 3 (multimodal), Qwen-VL, GLM-4V, and Llama 3.2 Vision.

What model formats does LM Studio support?

LM Studio supports two primary model formats:

Format	Backend	Hardware	Description
GGUF	llama.cpp	All platforms (CPU, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan)	The standard format for quantized LLM inference; supports a wide range of quantization levels from Q2 through Q8, F16, and F32 ^[17]
MLX	Apple MLX	Apple Silicon only (M1, M2, M3, M4)	Apple's optimized format for on-device inference; hosted by the mlx-community organization on Hugging Face ^[5]

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It stores quantized model weights in a single file and supports quantization levels from 2-bit to 8-bit, with various K-quant variants (such as Q4_K_M and Q5_K_M) that apply different precision to different model layers.^[17] Dropping from FP16 to Q8 is mathematically lossy but empirically invisible to most users. Dropping to Q4 introduces minor stylistic changes but retains logical reasoning capabilities.

MLX models use Apple's MLX framework and are typically available as separate uploads on Hugging Face under the mlx-community organization. MLX models cannot be used with the llama.cpp engine, and GGUF models cannot be used with the MLX engine. The two formats serve different hardware targets and cannot be interchanged.^[5]

System Requirements

macOS

Requirement	Specification
Chip	Apple Silicon only (M1, M2, M3, M4 series) ^[3]
Operating System	macOS 14.0 (Sonoma) or newer ^[3]
RAM	16 GB recommended; 8 GB possible with smaller models ^[3]
Note	Intel-based Macs are not supported ^[3]

Windows

Requirement	Specification
Architecture	x64 (with AVX2 instruction set) and ARM (Snapdragon X Elite) ^[3]
RAM	16 GB minimum recommended ^[3]
GPU	At least 4 GB dedicated VRAM recommended for GPU acceleration ^[3]
GPU Vendors	NVIDIA (CUDA), AMD (Vulkan/ROCm), Intel (Vulkan) ^[3]

Linux

Requirement	Specification
Architecture	x64 and ARM64 (aarch64) ^[3]
Distribution	Ubuntu 20.04 or newer ^[3]
Format	Distributed as AppImage ^[3]
CPU	x64 builds include AVX2 support by default ^[3]
GPU	NVIDIA (CUDA), AMD (ROCm) ^[3]

For GPU acceleration on all platforms, at least 4 GB of dedicated VRAM is recommended.^[3] Larger models with higher quantization levels require proportionally more VRAM. Models that exceed available VRAM can still run using a combination of GPU and CPU offloading, though at reduced speed. On AMD systems with Variable Graphics Memory (VGM), users can allocate system RAM as dedicated VRAM.

GPU Acceleration

GPU Type	Backend	Platform	Notes
NVIDIA (CUDA)	CUDA 12.8	Windows, Linux	Full support including multi-GPU, Flash Attention, and RTX 50-series ^[22]
Apple Silicon (Metal)	Metal / MLX	macOS	Uses unified memory; supports both llama.cpp (Metal) and MLX backends
AMD	ROCm / Vulkan	Windows, Linux	ROCm support on Linux including AMD 9000 series; Vulkan on Windows
Intel	Vulkan	Windows, Linux	Vulkan backend for Intel Arc and integrated GPUs

Supported Models

LM Studio can run any model available in GGUF format on Hugging Face, as well as MLX-format models for Apple Silicon. This includes thousands of models across many families. Some notable model families supported include:

Model Family	Developer	Available Sizes	Notes
Llama 3 / 3.1 / 3.2	Meta	1B, 3B, 8B, 70B, 405B	Widely used general-purpose models
Qwen 2.5 / 3 / 3.5 / 3.6	Alibaba	0.5B to 235B	Strong multilingual and coding support; Qwen 3.6 added in version 0.4.12 ^[36]
Mistral / Mistral Small / Nemo	Mistral AI	7B, 12B, 24B	Efficient European-developed models
Gemma 2 / 3 / 3n / 4	Google	270M to 27B	Lightweight and multimodal variants; updated chat template for Gemma 4 in 0.4.11
Phi 3 / 4	Microsoft	3B, 14B	Compact models with strong reasoning
DeepSeek R1	DeepSeek	7B to 70B (distilled)	Reasoning-focused models with chain-of-thought capabilities
gpt-oss	OpenAI	20B, 120B	OpenAI's open-weight models, released August 2025 under Apache 2.0 ^[10]
Codestral / Devstral	Mistral AI	22B, 24B	Code-specialized models
GLM-4	Zhipu AI	9B, 30B	Multimodal and general-purpose
NVIDIA Nemotron	NVIDIA	30B, 120B	Enterprise-grade reasoning models
Granite 4.0	IBM	3B, 7B, 32B	Enterprise and research models
Falcon	TII	7B, 40B, 180B	Open-weight models from the Technology Innovation Institute

LM Studio partnered with OpenAI to provide day-one support for gpt-oss models when they launched in August 2025. The gpt-oss-20b model can run on devices with as little as 16 GB of memory, while the gpt-oss-120b model requires approximately 80 GB. These models feature configurable reasoning effort (low, medium, or high) and native agentic capabilities including function calling, web browsing, Python execution, and structured outputs. The gpt-oss-20b release uses a mixture-of-experts design with 21 billion total parameters and 3.6 billion active parameters per token, packaged with native MXFP4 quantization for the MoE layer so that the full model fits inside 16 GB of memory.^[10]

Embedding models are also supported for use with RAG workflows. Available embedding models include nomic-embed-text-v1.5 and EmbeddingGemma.

Is LM Studio free?

LM Studio is proprietary freeware. The desktop application is free for both personal and commercial use, with no subscription fees, usage limits, or API call charges.^[6] In July 2025, Element Labs removed the previous requirement for a separate commercial license. The company explained that the old terms made adopting LM Studio at work "a high friction thing to do," and that as a result some teams were "self-selecting out of using LM Studio altogether, struggling to find a balance between starting a full blown procurement process."^[6] The change took effect on July 8, 2025, aligning the terms with the company's stated mission of making local AI "accessible, useful, and ubiquitous, both at home and at the workplace."^[6]

While the main desktop application is closed-source, several components of the LM Studio ecosystem are open source under the MIT license. These include the TypeScript SDK (lmstudio-js), the Python SDK (lmstudio-python), and the CLI tool (lms).^[14]

For organizations that need advanced features, LM Studio offers additional plans:

Plan	Price	Features
Personal / Work	Free	Full desktop application, local server, SDKs, CLI, all inference features ^[6]
Teams	Free (self-serve)	Private sharing of artifacts within teams, Hub organization
Enterprise	Contact sales	Single Sign-On (SSO), model and MCP gating, private collaboration, priority support ^[18]

The Enterprise plan targets Fortune 500 companies, universities, and large organizations that require centralized administration and compliance features.^[18]

Privacy and Security

Privacy is one of LM Studio's primary advantages. Because all model inference happens locally on the user's hardware, no data is sent to external servers during normal operation. Conversations, documents, and queries remain entirely on the local machine. This makes LM Studio suitable for working with sensitive or confidential information where data residency requirements would prevent the use of cloud-based AI services.^[1]

Key privacy characteristics include:

All model inference runs locally on the user's CPU or GPU
No internet connection is required after initial model download
Chat history is stored locally on the device
Documents used in RAG remain on the local file system
No telemetry or usage data is collected during inference

For organizations that need to share access to local models across a network, LM Studio introduced LM Link in version 0.4.5 (February 2026). LM Link allows users to connect to remote LM Studio instances and use their models as if they were local. The feature includes end-to-end encryption through a partnership with Tailscale.^[29]

How does LM Studio compare to Ollama and other local LLM tools?

LM Studio is one of several tools available for running LLMs locally. The most commonly compared alternatives are Ollama, Jan, and GPT4All.

Feature	LM Studio	Ollama	GPT4All
Primary Interface	GUI desktop application	CLI and REST API	GUI desktop application
Primary Audience	Power users, developers, experimenters	Developers, DevOps engineers	Beginners, privacy-focused users
Model Source	Hugging Face (in-app browser)	Ollama model library, Hugging Face	Curated model list, Hugging Face
Model Formats	GGUF, MLX	GGUF (via Modelfile), Safetensors	GGUF
OpenAI-Compatible API	Yes (port 1234)	Yes (port 11434)	Yes
Anthropic-Compatible API	Yes (`/v1/messages`, v0.4.1+)	Via community proxies	No
Apple Silicon Optimization	MLX engine + Metal	Metal via llama.cpp	Metal via llama.cpp
NVIDIA GPU Support	CUDA 12.8	CUDA	CUDA
AMD GPU Support	ROCm, Vulkan	ROCm	Vulkan (limited)
Multi-GPU	Yes (v0.3.14+)	Yes	No
Document Chat (RAG)	Built-in (PDF, DOCX, TXT, CSV)	No (requires external tools)	Yes (LocalDocs)
MCP Support	Yes (v0.3.17+, OAuth in 0.4.10+)	No	No
Speculative Decoding	Yes (v0.3.10+)	Yes	No
Docker / Headless	Yes (llmster daemon)	Yes (official Docker image)	No
Developer SDKs	TypeScript, Python	REST API, community libraries	Python bindings
Plugin System	Yes (TypeScript plugins)	No	No
Vision Models	Yes	Yes	Limited
Structured Output	Yes (JSON schema)	Yes (JSON mode)	No
Parallel Requests	Yes (continuous batching)	Yes	No
Remote Access	LM Link via Tailscale (v0.4.6+)	Manual networking	No
License	Proprietary freeware	MIT (open source)	MIT (open source)
Platforms	macOS, Windows, Linux	macOS, Windows, Linux	macOS, Windows, Linux
Pricing	Free	Free	Free

How does LM Studio differ from Ollama?

Ollama is a command-line tool and inference server that prioritizes simplicity and minimal resource usage. It is popular among developers who embed LLM capabilities into larger applications or containerized environments. Ollama uses its own model packaging format built on top of GGUF files and maintains a curated model library at ollama.com. It has over 134,000 stars on GitHub and is licensed under the MIT license, making it fully open source.^[19]

LM Studio provides a richer graphical experience with more configuration options, making it better suited for users who want fine-grained control over model parameters, quantization choices, and GPU allocation. LM Studio's MLX support gives it a distinct advantage on Apple Silicon hardware, and its built-in model browser provides a more streamlined model discovery experience. Ollama's Docker integration and lightweight footprint (tens of megabytes versus several hundred for LM Studio) make it more practical for containerized server environments and CI/CD pipelines. Because both tools share llama.cpp as the underlying inference engine, raw token throughput typically falls within a few percentage points; the practical choice tends to come down to interface preference, MLX support, and whether the user wants a built-in plugin and Hub ecosystem.

LM Studio vs. Jan

Jan is an open-source desktop application for running local LLMs that markets itself as a privacy-first, offline-first alternative to ChatGPT. Jan's source code lives on GitHub under the AGPL license and the project leans heavily on community contributions. Like LM Studio, Jan ships an OpenAI-compatible local server, supports GGUF models through llama.cpp, and lets users browse and download models from inside the application.

LM Studio differentiates itself through more polished tooling for power users: the unified MLX engine and unified KV cache push higher throughput on Apple Silicon and during concurrent requests, the Hub provides a centralized registry for presets and plugins, and the LM Link feature exposes remote models behind an encrypted Tailscale tunnel. Jan tends to win on transparency thanks to its open-source codebase and a smaller installation footprint, while LM Studio leads on feature breadth, hardware optimization, and ecosystem integrations.

LM Studio vs. GPT4All

GPT4All is developed by Nomic AI and focuses on accessibility for non-technical users. Its standout feature is LocalDocs, a built-in retrieval-augmented generation system that lets users upload local documents and query them through the chat interface without any additional setup. LocalDocs supports various file formats and uses Nomic's embedding models to bring information from local files into conversations. GPT4All is fully open source under the MIT license and does not require a GPU to run.^[20]

LM Studio offers more advanced features for developers and power users, including MCP support, speculative decoding, multi-GPU controls, split-view chat, and comprehensive developer SDKs. GPT4All is better suited for users who primarily want to chat with a local model and query their own documents without dealing with API configurations or advanced inference settings.

User Interface and Experience

LM Studio is built on the Electron framework, providing a consistent cross-platform experience across macOS, Windows, and Linux. The interface follows a modern, clean design with both light and dark themes.

The main application window is organized into several sections:

Discover Tab: A model browser for searching and downloading models from Hugging Face. Users can filter by model family, size, quantization, and format.
Chat Tab: The primary conversational interface where users interact with loaded models. Supports multi-turn conversations, system prompts, and file attachments.
My Models Tab: A management view for organizing downloaded models, viewing file sizes, and configuring model-specific settings.
Server Tab: Controls for starting and managing the local API server, including port configuration and endpoint monitoring.
Developer Tab: Access to advanced settings, parameter tuning, and structured output configuration.

The application supports keyboard shortcuts for common operations and includes a "Mission Control" overlay (accessible via hotkey) for quick model switching and status monitoring. Version 0.4.0 introduced a refreshed UI with the split-view chat interface, enabling side-by-side model comparison.^[4] The redesign also improved the model loading workflow and added better visual feedback for GPU memory usage and token generation speed.

How fast is LM Studio?

Flash Attention

Flash Attention is an optimization that significantly reduces memory usage and increases inference speed, especially for models with long context windows. LM Studio enabled Flash Attention by default for CUDA in version 0.3.31 and for Vulkan and Metal in version 0.3.32.

CUDA Graphs and RTX Optimization

In May 2025, LM Studio shipped a major performance update for NVIDIA users by upgrading to the CUDA 12.8 runtime. The update added CUDA Graphs, which group multiple GPU operations into a single CPU launch and reduce host-side overhead by up to 35 percent. Flash Attention CUDA kernels add a further throughput boost of up to 15 percent. The CUDA 12.8 runtime also enables full compatibility with the entire NVIDIA RTX 50-series Blackwell lineup. With a compatible driver, LM Studio automatically upgrades to the new runtime, which shortens model load times and increases sustained generation throughput on consumer NVIDIA hardware. Public benchmarks have measured the RTX 4090 at roughly 150 tokens per second with these optimizations on small to mid-size models.^[22]

Continuous Batching and Parallel Requests

Version 0.4.0 introduced continuous batching for the llama.cpp engine, allowing the LM Studio server to dynamically combine multiple incoming requests into a single batch. Users can configure the maximum number of concurrent predictions (default: 4) when loading a model. The unified KV cache feature allows flexible memory allocation across concurrent requests of varying sizes. This feature is particularly valuable for developer workflows where multiple applications or services send requests to the same local model simultaneously. MLX support for continuous batching is in development.^[4]

KV Caching

KV (key-value) caching across prompts improves response times in multi-turn conversations by reusing computations from earlier messages in the chat history. According to LM Studio's documentation, this optimization can reduce processing time from approximately 10 seconds to 0.11 seconds in tested scenarios.

Approximate Performance Benchmarks

Actual performance varies significantly based on hardware, model size, quantization level, and context length. The following approximate benchmarks provide a general reference:

Hardware	Model Size	Approximate Speed
Apple M3 Max	1B (MLX)	~250 tokens/second ^[25]
Apple M3 Max	7B-14B	35-70 tokens/second
NVIDIA RTX 4060	7B (Q4)	~30-40 tokens/second
NVIDIA RTX 4090	7B (Q4)	100+ tokens/second
NVIDIA RTX 4090	Small models with CUDA Graphs	~150 tokens/second ^[22]
CPU-only (modern x64)	7B (Q4)	5-15 tokens/second

Community and Ecosystem

LM Studio has developed a growing ecosystem of community contributions and third-party integrations. The official Discord server (discord.gg/lmstudio) serves as the primary community hub where users ask questions, share configurations, and report issues. The LM Studio team maintains an active presence on the server.

The LM Studio GitHub organization (github.com/lmstudio-ai) hosts the open-source SDKs, CLI tool, and documentation. The lms CLI repository has accumulated over 4,400 stars, and the TypeScript SDK over 1,500 stars.^[14]

The lmstudio-community organization on Hugging Face is a community-driven effort where members quantize and upload optimized GGUF versions of popular models. This resource makes it easier for users to find models that are ready to run in LM Studio without performing quantization themselves.^[16]

Notable integrations with third-party tools include:

LangChain: Developers can use LM Studio as a local LLM provider within LangChain workflows.
Continue (VS Code extension): Code completion and chat powered by local LM Studio models.
Open WebUI: Web-based chat interface that can connect to LM Studio's API.
Obsidian: Community plugins allow Obsidian note-taking users to connect to LM Studio for AI-assisted writing.
Cursor: The AI code editor can be configured to use LM Studio's local API server.
Claude Code: Anthropic's Claude Code CLI can be pointed at LM Studio's /v1/messages endpoint to drive a local model in place of a hosted Claude deployment.^[27]

LM Studio also maintains a presence on Twitter, LinkedIn, and publishes technical blog posts at lmstudio.ai/blog covering new releases, feature guides, and integration tutorials.

What is LM Studio used for?

LM Studio serves a variety of use cases across personal and professional contexts:

Privacy-sensitive workflows: Legal firms, healthcare organizations, and financial institutions that cannot send data to external AI services use LM Studio to run models locally, ensuring that confidential documents and proprietary information remain on-premises.
Development and prototyping: Software developers use the local API server to build and test AI-powered applications without incurring API costs or dealing with rate limits. The OpenAI-compatible API allows seamless switching between local and cloud models.
Model evaluation: The split-view chat and support for loading multiple models simultaneously make LM Studio useful for comparing model outputs side by side before selecting a model for production use.
Offline operation: Researchers and professionals working in environments without reliable internet connectivity can use LM Studio for AI-assisted tasks after initial model download.
Education and research: Students and researchers can experiment with different model architectures, quantization levels, and inference parameters without incurring API costs or needing cloud accounts.
Enterprise deployment: Organizations use llmster and the Enterprise plan to deploy local inference servers with centralized management, SSO, and access controls.
Distributed home and lab setups: Power users with a dedicated GPU workstation use LM Link to point a laptop or office machine at the workstation's models over an encrypted Tailscale tunnel, keeping the heavy hardware in one location while accessing it from any device on the network.

Limitations

Despite its strengths, LM Studio has several notable limitations:

Model size constraints: Running large models (70B+ parameters) requires significant RAM or VRAM. Users with 8 GB of RAM are limited to smaller models (3B or heavily quantized 7B). A 70B parameter model at Q4 quantization requires roughly 40 GB of memory.
No model training or fine-tuning: LM Studio is an inference-only tool. Users cannot train or fine-tune models within the application. Models must be prepared with external tools and then loaded as GGUF or MLX files.
Intel Mac exclusion: macOS support is limited to Apple Silicon. Intel-based Macs are not supported.
AMD GPU limitations: Multi-GPU controls are currently available only for NVIDIA GPUs. AMD GPU support through ROCm and Vulkan exists but with fewer advanced features.
Closed-source GUI: While the SDKs and CLI are open source, the desktop application itself is proprietary and closed-source. Users cannot audit or modify the core application code.
Application size: As an Electron application, LM Studio has a larger installation footprint (several hundred megabytes) compared to lightweight CLI tools like Ollama (tens of megabytes).
MLX continuous batching: As of early 2026, parallel request batching is only fully supported on the llama.cpp backend; the equivalent capability for MLX is still under active development.

References

LM Studio Official Website. Element Labs, Inc. https://lmstudio.ai/ ↩
LM Studio Documentation. "Welcome to LM Studio Docs." https://lmstudio.ai/docs/app ↩
LM Studio Documentation. "System Requirements." https://lmstudio.ai/docs/app/system-requirements ↩
"Introducing LM Studio 0.4.0." LM Studio Blog, January 28, 2026. https://lmstudio.ai/blog/0.4.0 ↩
"LM Studio 0.3.4 ships with Apple MLX." LM Studio Blog, October 2024. https://lmstudio.ai/blog/lmstudio-v0.3.4 ↩
"LM Studio is free for use at work." LM Studio Blog, July 8, 2025. https://lmstudio.ai/blog/free-for-work ↩
"MCP in LM Studio." LM Studio Blog, version 0.3.17, June 2025. https://lmstudio.ai/blog/lmstudio-v0.3.17 ↩
"LM Studio 0.3.10: Speculative Decoding." LM Studio Blog, February 2025. https://lmstudio.ai/blog/lmstudio-v0.3.10 ↩
"LM Studio 0.3.14: Multi-GPU Controls." LM Studio Blog, April 2025. https://lmstudio.ai/blog/lmstudio-v0.3.14 ↩
"Run OpenAI's gpt-oss locally in LM Studio." LM Studio Blog, August 2025. https://lmstudio.ai/blog/gpt-oss ↩
LM Studio OpenAI Compatibility Endpoints Documentation. https://lmstudio.ai/docs/developer/openai-compat ↩
LM Studio Developer Documentation. https://lmstudio.ai/docs/developer
"Introducing lmstudio-python and lmstudio-js." LM Studio Blog. https://lmstudio.ai/blog/introducing-lmstudio-sdk ↩
LM Studio GitHub Organization. https://github.com/lmstudio-ai ↩
LM Studio Crunchbase Profile. https://www.crunchbase.com/organization/lm-studio-1f9c ↩
LM Studio Community Models on Hugging Face. https://huggingface.co/lmstudio-community ↩
GGUF Usage with LM Studio. Hugging Face Hub Documentation. https://huggingface.co/docs/hub/main/en/lmstudio ↩
LM Studio Enterprise. https://lmstudio.ai/work ↩
Ollama GitHub Repository. https://github.com/ollama/ollama ↩
GPT4All Official Website. Nomic AI. https://www.nomic.ai/gpt4all ↩
LM Studio Changelog. https://lmstudio.ai/changelog ↩
"LM Studio Accelerates LLM Performance With NVIDIA GeForce RTX GPUs and CUDA 12.8." Edge AI and Vision Alliance, May 2025. https://www.edge-ai-vision.com/2025/05/lm-studio-accelerates-llm-performance-with-nvidia-geforce-rtx-gpus-and-cuda-12-8/ ↩
LM Studio Structured Output Documentation. https://lmstudio.ai/docs/app/api/structured-output ↩
LM Studio MCP Documentation. https://lmstudio.ai/docs/app/mcp ↩
"Introducing the unified multi-modal MLX engine architecture in LM Studio." LM Studio Blog. https://lmstudio.ai/blog/unified-mlx-engine ↩
LM Studio Anthropic Compatibility Endpoints. https://lmstudio.ai/docs/developer/anthropic-compat ↩
"Use your LM Studio Models in Claude Code." LM Studio Blog. https://lmstudio.ai/blog/claudecode ↩
LM Link Documentation. https://lmstudio.ai/docs/lmlink ↩
"LM Link: Use local models on remote devices." Tailscale Blog. https://tailscale.com/blog/lm-link-remote-llm-access ↩
LM Studio Config Presets Documentation. https://lmstudio.ai/docs/app/presets ↩
LM Studio Plugin Development Documentation. https://lmstudio.ai/docs/typescript/plugins ↩
LM Studio CLI Documentation. https://lmstudio.ai/docs/cli ↩
LM Studio API Changelog. https://lmstudio.ai/docs/developer/api-changelog ↩
LM Studio Hub. https://lmstudio.ai/community ↩
LM Studio Changelog 0.4.10 (OAuth for MCP servers). https://lmstudio.ai/changelog/lmstudio-v0.4.10 ↩
LM Studio Changelog 0.4.12. https://lmstudio.ai/changelog/lmstudio-v0.4.12 ↩
"How LM Studio hit $1.8M revenue with a 16 person team in 2025." getlatka.com. https://getlatka.com/companies/lmstudio.ai ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · full history

Suggest edit

What links here

AutoGen Best Local and On-Device LLMs Code Llama GGML GGUF Jamba Machine learning terms/Natural Language Processing Nous Research Ollama Open Interpreter Open WebUI OpenAI OpenClaw llama.cpp

Who makes LM Studio?

When was LM Studio released?

Funding

How does LM Studio work?

Features

Model Browser and Hugging Face Integration

Chat Interface

Configuration Presets

Chat with Documents (RAG)

Does LM Studio have an OpenAI-compatible API?

Anthropic API Compatibility

Reasoning Models and Chain-of-Thought

Headless Deployment with llmster

LM Link

Developer SDKs and CLI

Plugin System

LM Studio Hub

Model Context Protocol (MCP)

Speculative Decoding

Structured Output

Multi-GPU Support

Vision and Multimodal Models

What model formats does LM Studio support?

System Requirements

macOS

Windows

Linux

GPU Acceleration

Supported Models

Is LM Studio free?

Privacy and Security

How does LM Studio compare to Ollama and other local LLM tools?

How does LM Studio differ from Ollama?

LM Studio vs. Jan

LM Studio vs. GPT4All

User Interface and Experience

How fast is LM Studio?

Flash Attention

CUDA Graphs and RTX Optimization

Continuous Batching and Parallel Requests

KV Caching

Approximate Performance Benchmarks

Community and Ecosystem

What is LM Studio used for?

Limitations

See Also

References

Improve this article

Related Articles

LangChain

Ollama

CrewAI

llama.cpp

Open WebUI

Qwen3-Coder

What links here

Related Articles

LangChain

Ollama

CrewAI

llama.cpp

Open WebUI

Qwen3-Coder

What links here