LM Studio is a desktop application for discovering, downloading, and running large language models locally on personal hardware. Developed by Element Labs, Inc. and founded by Yagil Burowski, LM Studio provides a graphical user interface that allows users to interact with open-weight LLMs without relying on cloud services or sending data to external servers. The application supports GGUF models through llama.cpp and MLX models on Apple Silicon, and it exposes an OpenAI-compatible API server for integration with third-party tools and applications.
LM Studio was first publicly released in May 2023 and has since become one of the most widely used tools for running local LLMs. The application is available for macOS, Windows, and Linux. As of April 2026, the latest stable release is version 0.4.12.
LM Studio was created by Yagil Burowski, who previously worked at Apple. The application is built by Element Labs, Inc., a software company based in Brooklyn, New York. Burowski founded the company in 2023 with the goal of making local AI inference accessible to a broad audience through a polished desktop experience. He earned a Bachelor of Science in Computer Science from the University of Pennsylvania between 2013 and 2017 and serves as the company's chief executive officer. By 2025 the company reported approximately 16 employees and disclosed annual revenue of roughly $1.8 million.
The initial public release, version 0.1.x, launched in May 2023. This first version introduced the core functionality of browsing and downloading models from Hugging Face and running them locally through a chat interface. It primarily targeted Windows users, with experimental support for macOS and Linux.
Version 0.2.x arrived in early 2024, bringing improved platform support including more robust macOS compatibility. Later patches in this series, such as 0.2.22, introduced Flash Attention integration for faster inference.
The 0.3.x series, which began in mid-2024, represented a major expansion of capabilities. Version 0.3.0 (August 2024) added built-in RAG, a light theme, internationalization, and the Structured Outputs API. Version 0.3.4 (October 2024) added Apple MLX support for efficient inference on Apple Silicon. Version 0.3.5 introduced headless mode and on-demand model loading. Version 0.3.10 (February 2025) brought speculative decoding. Version 0.3.13 (March 2025) added Google Gemma 3 and multimodality support. Version 0.3.14 (April 2025) introduced multi-GPU controls. Version 0.3.15 (May 2025) added support for NVIDIA RTX 50-series GPUs through the CUDA 12.8 runtime and improved tool use in the API. Version 0.3.17 (June 2025) added Model Context Protocol (MCP) support. Version 0.3.21 (August 2025) shipped with support for OpenAI's gpt-oss models at launch. Flash Attention was enabled by default for Vulkan and Metal backends in version 0.3.32.
Version 0.4.0 was released on January 28, 2026. This release introduced llmster (a headless daemon for server deployments), parallel request processing with continuous batching, a new stateful REST API, and a refreshed user interface with split-view chat and developer mode. Subsequent updates through 0.4.12 expanded the platform with LM Link for remote instance connectivity, end-to-end encryption via Tailscale, OAuth-based authentication for MCP servers, native reasoning_effort and reasoning_tokens parameters in the OpenAI-compatible endpoints, and improved tool calling for newer model families such as Qwen 3.5, Qwen 3.6, Gemma 4, and GLM.
The following table summarizes major releases in LM Studio's development history:
| Version | Date | Key Features |
|---|---|---|
| 0.1.x | May 2023 | Initial public release; core model downloading and chat functionality; primarily Windows support with experimental macOS and Linux builds |
| 0.2.x | Early 2024 | Enhanced platform support, improved macOS compatibility, Flash Attention integration (0.2.22) |
| 0.3.0 | August 2024 | Built-in RAG, light theme, internationalization, Structured Outputs API |
| 0.3.4 | October 2024 | Apple MLX engine for Apple Silicon Macs, vision model support via MLX |
| 0.3.5 | October 2024 | Headless mode, on-demand model loading, server auto-start, CLI model downloads |
| 0.3.10 | February 2025 | Speculative Decoding for llama.cpp and MLX engines |
| 0.3.13 | March 2025 | Google Gemma 3 support, multimodal image input |
| 0.3.14 | April 2025 | Multi-GPU controls for NVIDIA GPUs |
| 0.3.15 | May 2025 | RTX 50-series support via CUDA 12.8, improved tool use API, preset publishing |
| 0.3.17 | June 2025 | Model Context Protocol (MCP) host support |
| 0.3.21 | August 2025 | Day-one support for OpenAI's gpt-oss models |
| 0.3.29 | 2025 | OpenAI Responses API with local models |
| 0.3.31-32 | 2025 | Flash Attention enabled by default for CUDA, Vulkan, and Metal |
| 0.4.0 | January 28, 2026 | llmster headless daemon, continuous batching, new REST API, revamped UI with split view |
| 0.4.1 | February 2026 | Anthropic-compatible /v1/messages endpoint for Claude SDK clients |
| 0.4.6 | February 27, 2026 | LM Link for remote instance connectivity with end-to-end encryption via Tailscale |
| 0.4.7 | March 18, 2026 | Improved tool calling for Qwen 3.5 and GLM models, enhanced Anthropic API compatibility |
| 0.4.8 | March 26, 2026 | reasoning_effort and reasoning_tokens parameters in OpenAI-compatible endpoints, CUDA memory improvements |
| 0.4.9 | April 2, 2026 | Anthropic-compatible API parameters expanded, fix for UI freeze when deleting chat folders |
| 0.4.10 | April 9, 2026 | OAuth support for MCP servers, improved Gemma 4 tool calling reliability |
| 0.4.11 | April 10, 2026 | Updated Gemma 4 chat template support |
| 0.4.12 | April 17, 2026 | Qwen 3.6 support, improved chat PDF export styling, Windows OAuth fix for MCP servers |
LM Studio raised $19.3 million in a Series B funding round on April 25, 2025. Investors include Matrix, Preston-Werner Ventures, and Torch Capital. As of 2025, Element Labs had a team of approximately 16 people. The capital was used to expand the engineering team, accelerate work on the MLX engine and llmster daemon, and fund the buildout of the LM Studio Hub for sharing presets and community models.
LM Studio is an Electron application built with web technologies. The frontend uses React and TypeScript, while the build system relies on Vite. The application bundles two inference backends: llama.cpp for GGUF models and Apple's MLX framework for MLX-format models on Apple Silicon hardware.
The llama.cpp backend supports multiple compute variants, including CPU-only, CUDA (for NVIDIA GPUs), Vulkan (cross-platform GPU), ROCm (for AMD GPUs), and Metal (for Apple GPUs). This allows LM Studio to take advantage of GPU acceleration across all major hardware platforms.
The MLX engine is implemented in Python and combines three core libraries: mlx-lm (Apple's MLX Python implementation), Outlines (for structured generation and JSON schema enforcement), and mlx-vlm (for vision model support). LM Studio deploys Python using python-build-standalone with stacked virtual environments for portable, cross-machine compatibility. In benchmarks, Llama 3.2 1B running on an M3 Max chip achieved approximately 250 tokens per second using the MLX backend.
In 2025, the team migrated to a unified MLX engine architecture. Under the new design, mlx-lm always provides the text model implementation, and mlx-vlm contributes vision components as modular add-ons that emit image embeddings consumable by the text model. The unified architecture significantly improved throughput for multimodal workloads and simplified the integration of new vision-language models on Apple Silicon. The complete backend matrix for inference is summarized below:
| Backend | Format | Supported Hardware | Notes |
|---|---|---|---|
| llama.cpp (CPU) | GGUF | x86_64 with AVX2, ARM64 | Used as a fallback when no GPU acceleration is available |
| llama.cpp (CUDA) | GGUF | NVIDIA RTX 20-series and newer; full support for RTX 50-series via CUDA 12.8 | CUDA Graphs reduce CPU overhead by up to 35 percent; Flash Attention CUDA kernels add up to 15 percent throughput |
| llama.cpp (ROCm) | GGUF | AMD RDNA2 and newer (Linux); AMD 9000 series | Open compute stack for AMD discrete GPUs |
| llama.cpp (Vulkan) | GGUF | NVIDIA, AMD, Intel Arc on Windows and Linux | Cross-vendor GPU backend, used as the default AMD path on Windows |
| llama.cpp (Metal) | GGUF | Apple Silicon (M1, M2, M3, M4) | Default GGUF backend on macOS |
| MLX (text) | MLX | Apple Silicon (M1, M2, M3, M4) | Uses mlx-lm with Outlines for structured generation |
| MLX (vision) | MLX | Apple Silicon (M1, M2, M3, M4) | Uses mlx-vlm modules as add-ons in the unified MLX engine |
LM Studio includes a built-in model browser that connects directly to Hugging Face, the largest repository of open-weight models. Users can search for models by name, keyword, or model identifier without leaving the application. The browser displays available model variants with different quantization levels (such as Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, and Q8_0), allowing users to choose a balance between model quality and memory requirements.
The quantization level indicates how many bits are used to represent each model weight. Lower quantization (such as Q4) reduces memory usage and increases inference speed but introduces minor quality loss. Higher quantization (such as Q8) preserves more of the original model's accuracy but requires more memory. Q5_K_M is generally recommended as a good balance between performance and quality.
The lmstudio-community organization on Hugging Face provides pre-quantized GGUF models maintained by community members, making it straightforward to find optimized versions of popular models. LM Studio also maintains a curated model catalog on its website, highlighting popular and trending models across families like Llama, Qwen, Mistral, Gemma, Phi, and DeepSeek.
The chat interface provides a conversational experience similar to ChatGPT or Claude. Users can create multiple conversations organized in folders, adjust generation parameters (temperature, top-p, max tokens, repeat penalty), manage conversation history, and customize model behavior through system prompts.
Version 0.4.0 introduced split-view chat, which allows users to view two conversations side by side by dragging and dropping chat tabs. This feature is useful for comparing outputs from different models or different prompt configurations. Configuration presets let users save and quickly switch between different parameter settings for various use cases.
Chat conversations can be exported as PDF (including images), Markdown, or plain text files. Version 0.4.12 improved PDF export styling so that long conversations and embedded images render with better typography and pagination.
LM Studio bundles a system prompt together with sampling parameters, context length, prompt template, and other advanced settings into a reusable artifact known as a Preset. Presets are stored as .preset.json files inside ~/.lmstudio/hub on macOS and Linux or %USERPROFILE%\.lmstudio\hub on Windows. Users can save the current configuration through the "Save Preset" button in the right-hand panel, then switch between presets in seconds when moving between tasks such as creative writing, coding, or summarization.
Starting in version 0.3.15, presets can be imported from a local file or a URL and published to the LM Studio Hub for others to download. Per-model defaults let users set a preset that loads automatically whenever a particular model is selected, which removes the need to retune parameters every time the model changes.
Introduced in version 0.3.0 (August 2024), LM Studio's "Chat with Documents" feature enables retrieval-augmented generation (RAG) directly within the desktop application. Users can attach files in formats including PDF, DOCX, TXT, CSV, and plain text to a chat session. If the document is short enough to fit within the model's context window, LM Studio inserts the full file contents into the conversation. For longer documents, the application automatically switches to RAG mode, chunking the document and retrieving relevant sections based on the user's query.
The feature supports up to 5 files with a maximum combined size of 30 megabytes. Citations are provided at the end of responses, showing which portions of the uploaded documents were used to generate the answer. All document processing happens locally on the user's machine, ensuring that sensitive files never leave the device.
One of LM Studio's most important features for developers is its built-in local API server. When started, the server exposes endpoints on localhost (port 1234 by default) that are compatible with OpenAI's API specification. This means any application, script, or library designed to work with the OpenAI API can be redirected to use a local model by simply changing the base URL.
The server supports the following endpoints:
| Endpoint | Path | Description |
|---|---|---|
| Chat Completions | /v1/chat/completions | Standard chat completion requests with streaming support |
| Embeddings | /v1/embeddings | Text embedding generation for semantic search and RAG |
| Models | /v1/models | Lists currently loaded models |
| Responses | /v1/responses | OpenAI Responses API for stateful conversations |
| Anthropic Messages | /v1/messages | Anthropic-compatible endpoint (v0.4.1+) |
| Native REST API | /api/v1/* | LM Studio's own stateful REST API with full feature access |
Starting with version 0.4.0, the server supports parallel request processing through continuous batching. Instead of queuing requests one by one, the model can process up to N requests simultaneously (default: 4 parallel slots). This is powered by a unified KV cache that dynamically allocates memory across concurrent requests of varying sizes.
The server also supports Just-In-Time model loading, which can automatically load a model when a request arrives and optionally unload it after a period of inactivity. This is useful for server deployments where multiple models need to be available but only one is active at a time.
Version 0.4.1 introduced an Anthropic-compatible /v1/messages endpoint that mirrors the official Claude API. Tools that already speak the Anthropic Messages protocol can talk to LM Studio with no code changes by setting the base URL to http://localhost:1234 and the API key to any non-empty string such as lmstudio. The endpoint supports streaming via Server-Sent Events with message_start, content_block_delta, and message_stop frames, plus tool use, system messages, and temperature controls. Subsequent releases through 0.4.9 expanded the parameter coverage so that frameworks such as Claude Code, OpenCode, and the official Anthropic Python SDK can drive a local model in place of a hosted Claude deployment.
LM Studio exposes the chain-of-thought traces produced by reasoning models such as DeepSeek R1, gpt-oss, and Qwen reasoning variants. By default the server returns reasoning tokens in a separate reasoning_content field, keeping them out of the main content stream so that downstream clients can choose whether to display the model's internal thoughts. Version 0.4.8 added native reasoning_effort and reasoning_tokens parameters to the OpenAI-compatible /v1/chat/completions endpoint, allowing callers to set the depth of deliberation (low, medium, or high) and to limit the budget of reasoning tokens. The /api/v1/models response includes a reasoning field that indicates which reasoning controls each loaded model accepts.
Version 0.4.0 introduced llmster, a headless daemon that packages the core inference engine without the GUI. llmster can run on servers, cloud instances, or any machine that lacks a graphical display. It supports Linux, macOS, and Windows, making LM Studio suitable for production server deployments in addition to desktop use. The llmster daemon supports the same API endpoints and model management features as the desktop application.
LM Link, introduced in version 0.4.6 on February 27, 2026, allows a user's LM Studio instances to talk to one another over an end-to-end encrypted mesh network. The feature is built on top of custom Tailscale mesh VPNs, with every request wrapped in WireGuard encryption. Devices are never exposed to the public internet and there are no API keys to rotate; identity is verified through the user's LM Studio and Tailscale credentials.
To enable LM Link on a remote GPU machine, the user runs lms link enable on the host. Once linked, the remote model appears in the local LM Studio UI alongside on-device models, and the local server at localhost:1234 transparently forwards requests through the encrypted tunnel. This makes it possible to point a developer tool such as Claude Code or a custom SDK at a local port while inference runs on a high-VRAM workstation elsewhere on the user's network. LM Link is free during the preview period for up to two users with five devices each (10 devices total), and Element Labs has stated that paid tiers for additional devices and users will arrive after the feature exits preview.
LM Studio provides official SDKs and command-line tools for developers:
| Tool | Package | Language | License |
|---|---|---|---|
| TypeScript SDK | @lmstudio/sdk (npm) | TypeScript | MIT |
| Python SDK | lmstudio (pip) | Python | MIT |
| CLI | lms | Command-line | MIT |
The SDKs allow programmatic model loading, inference, tool calling, and structured output enforcement. The Python SDK supports Pydantic for schema validation, while the TypeScript SDK supports Zod. Both SDKs can enforce JSON schema output from models.
The lms CLI tool provides commands for model management, server control, and interactive chat sessions from the terminal. The most frequently used commands are summarized below:
| Command | Purpose |
|---|---|
lms bootstrap | Adds the lms binary to the user's shell after the first launch of the GUI |
lms ls | Lists all locally downloaded models |
lms get <model> | Searches Hugging Face and downloads a model, with optional @quant suffix such as @q4_k_m |
lms load <model> | Loads a model into memory; flags such as --gpu max push every layer to the GPU |
lms server start | Starts the local OpenAI-compatible API server |
lms chat | Opens an interactive terminal chat with the loaded model |
lms link enable | Activates LM Link on the current host so remote devices can use its models |
lms push | Publishes a preset, plugin, or other artifact to the LM Studio Hub |
lms dev | Runs a hot-reloading development server for plugin authors |
The lms chat command, added in 0.4.0, supports slash commands and pasting multi-line content blocks. Models can be downloaded from the CLI using commands like lms get lmstudio-community/llama-3.2-1b-instruct-gguf@q4_k_m.
The plugin system, introduced as part of the developer-mode workflows in the 0.3.x and 0.4.x lines, lets developers extend LM Studio with TypeScript modules that run inside the application. Plugins ship as packages that integrate at four extension points along the prediction pipeline:
| Plugin Type | Role | Typical Use |
|---|---|---|
| Tools Provider | Returns an array of Tool objects that the model can invoke during generation | Calculators, web search, file system access, custom REST integrations |
| Prompt Preprocessor | Intercepts the user message after the Send button is pressed and rewrites it before the model receives it | Document retrieval, prompt expansion, safety filtering, persona injection |
| Generator | Replaces the local LLM with an alternative token source | Routing requests to remote APIs while keeping LM Studio's UI and tool stack |
| Custom Configuration | Adds per-chat or global settings exposed in LM Studio's UI | User-tunable plugin parameters with type-safe schemas |
Plugins run on Node.js v22.21.1 in a sandboxed environment, multiple prompt preprocessors form a pipeline where each preprocessor receives the output of the previous one, and a single plugin package can register tools, preprocessors, and generators together to build sophisticated integrations. The lms dev command provides hot reloading during development, and lms push ships finished plugins to the LM Studio Hub.
The LM Studio Hub is the central marketplace and registry for shareable LM Studio artifacts. Users can browse, install, and publish presets, plugins, and curated model collections through both the desktop GUI and the lms CLI. Presets and plugins downloaded from the Hub are stored under ~/.lmstudio/hub, mirroring the layout of a user's local artifacts so that personal and shared content live side by side. Organizations can create Hub orgs that act as a private namespace for sharing artifacts with teammates while keeping content out of the public catalog. The --private flag on lms push keeps a publication unlisted for accounts that support private artifacts.
LM Studio 0.3.17 introduced support for Model Context Protocol (MCP), a standard originally introduced by Anthropic for connecting LLMs to external tools and data sources. Users can configure MCP servers through a mcp.json file or through the application's Tools & Integrations menu. Each MCP server runs in a separate, isolated process for stability and security.
When a model invokes a tool through MCP, LM Studio displays a confirmation dialog that lets the user review and optionally edit the tool call arguments before execution. MCP is also accessible through the REST API, enabling programmatic tool use in server deployments. Common use cases include connecting models to web search APIs, file system access, database queries, and the Hugging Face API.
Version 0.4.10 added OAuth support for MCP servers, which removes the need to copy bearer tokens or configure authorization headers by hand. Users add an integration, log in through a browser-based OAuth flow, and the resulting credentials are stored securely so the MCP server can present them on the user's behalf. A subsequent fix in 0.4.12 resolved an issue that prevented the OAuth flow from completing inside certain Windows environments.
Speculative decoding is an inference optimization technique that can speed up text generation by 20 to 50 percent without reducing output quality. It works by pairing a large main model with a smaller draft model. The draft model quickly proposes candidate tokens, which the main model then verifies in parallel. Since verification is faster than generation from scratch, the overall throughput increases.
LM Studio added speculative decoding support in version 0.3.10 (February 2025). The system attempts to offload the draft model entirely to the GPU when one is available. Speculative decoding is accessible through both the chat interface and the OpenAI-compatible API.
LM Studio supports structured output through JSON schema enforcement. When a JSON schema is provided with a request to the /v1/chat/completions endpoint, the model is constrained to produce valid JSON conforming to the specified schema. This follows the same format as OpenAI's Structured Output API, ensuring compatibility with existing OpenAI client libraries.
The SDKs surface APIs to enforce the model's output format using Pydantic (for Python) or Zod (for TypeScript). On Apple Silicon, the MLX backend uses the Outlines library to enforce structured generation constraints.
Version 0.3.14 (April 2025) introduced multi-GPU controls, allowing users to distribute model layers across multiple GPUs. This is particularly useful for running larger models that exceed the VRAM capacity of a single GPU. Users can enable or disable specific GPUs, set priority order for memory allocation, and limit dedicated GPU memory usage to prevent model weights from spilling into slower shared memory.
LM Studio supports multimodal vision models that can process both text and images. On Apple Silicon, the mlx-vlm library enables running vision models such as LLaVA. The llama.cpp backend also supports vision-capable models on all platforms. Users can attach images to chat messages for visual question answering, image description, and other multimodal tasks. Supported vision models include Gemma 3 (multimodal), Qwen-VL, GLM-4V, and Llama 3.2 Vision.
LM Studio supports two primary model formats:
| Format | Backend | Hardware | Description |
|---|---|---|---|
| GGUF | llama.cpp | All platforms (CPU, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan) | The standard format for quantized LLM inference; supports a wide range of quantization levels from Q2 through Q8, F16, and F32 |
| MLX | Apple MLX | Apple Silicon only (M1, M2, M3, M4) | Apple's optimized format for on-device inference; hosted by the mlx-community organization on Hugging Face |
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It stores quantized model weights in a single file and supports quantization levels from 2-bit to 8-bit, with various K-quant variants (such as Q4_K_M and Q5_K_M) that apply different precision to different model layers. Dropping from FP16 to Q8 is mathematically lossy but empirically invisible to most users. Dropping to Q4 introduces minor stylistic changes but retains logical reasoning capabilities.
MLX models use Apple's MLX framework and are typically available as separate uploads on Hugging Face under the mlx-community organization. MLX models cannot be used with the llama.cpp engine, and GGUF models cannot be used with the MLX engine. The two formats serve different hardware targets and cannot be interchanged.
| Requirement | Specification |
|---|---|
| Chip | Apple Silicon only (M1, M2, M3, M4 series) |
| Operating System | macOS 14.0 (Sonoma) or newer |
| RAM | 16 GB recommended; 8 GB possible with smaller models |
| Note | Intel-based Macs are not supported |
| Requirement | Specification |
|---|---|
| Architecture | x64 (with AVX2 instruction set) and ARM (Snapdragon X Elite) |
| RAM | 16 GB minimum recommended |
| GPU | At least 4 GB dedicated VRAM recommended for GPU acceleration |
| GPU Vendors | NVIDIA (CUDA), AMD (Vulkan/ROCm), Intel (Vulkan) |
| Requirement | Specification |
|---|---|
| Architecture | x64 and ARM64 (aarch64) |
| Distribution | Ubuntu 20.04 or newer |
| Format | Distributed as AppImage |
| CPU | x64 builds include AVX2 support by default |
| GPU | NVIDIA (CUDA), AMD (ROCm) |
For GPU acceleration on all platforms, at least 4 GB of dedicated VRAM is recommended. Larger models with higher quantization levels require proportionally more VRAM. Models that exceed available VRAM can still run using a combination of GPU and CPU offloading, though at reduced speed. On AMD systems with Variable Graphics Memory (VGM), users can allocate system RAM as dedicated VRAM.
| GPU Type | Backend | Platform | Notes |
|---|---|---|---|
| NVIDIA (CUDA) | CUDA 12.8 | Windows, Linux | Full support including multi-GPU, Flash Attention, and RTX 50-series |
| Apple Silicon (Metal) | Metal / MLX | macOS | Uses unified memory; supports both llama.cpp (Metal) and MLX backends |
| AMD | ROCm / Vulkan | Windows, Linux | ROCm support on Linux including AMD 9000 series; Vulkan on Windows |
| Intel | Vulkan | Windows, Linux | Vulkan backend for Intel Arc and integrated GPUs |
LM Studio can run any model available in GGUF format on Hugging Face, as well as MLX-format models for Apple Silicon. This includes thousands of models across many families. Some notable model families supported include:
| Model Family | Developer | Available Sizes | Notes |
|---|---|---|---|
| Llama 3 / 3.1 / 3.2 | Meta | 1B, 3B, 8B, 70B, 405B | Widely used general-purpose models |
| Qwen 2.5 / 3 / 3.5 / 3.6 | Alibaba | 0.5B to 235B | Strong multilingual and coding support; Qwen 3.6 added in version 0.4.12 |
| Mistral / Mistral Small / Nemo | Mistral AI | 7B, 12B, 24B | Efficient European-developed models |
| Gemma 2 / 3 / 3n / 4 | 270M to 27B | Lightweight and multimodal variants; updated chat template for Gemma 4 in 0.4.11 | |
| Phi 3 / 4 | Microsoft | 3B, 14B | Compact models with strong reasoning |
| DeepSeek R1 | DeepSeek | 7B to 70B (distilled) | Reasoning-focused models with chain-of-thought capabilities |
| gpt-oss | OpenAI | 20B, 120B | OpenAI's open-weight models, released August 2025 under Apache 2.0 |
| Codestral / Devstral | Mistral AI | 22B, 24B | Code-specialized models |
| GLM-4 | Zhipu AI | 9B, 30B | Multimodal and general-purpose |
| NVIDIA Nemotron | NVIDIA | 30B, 120B | Enterprise-grade reasoning models |
| Granite 4.0 | IBM | 3B, 7B, 32B | Enterprise and research models |
| Falcon | TII | 7B, 40B, 180B | Open-weight models from the Technology Innovation Institute |
LM Studio partnered with OpenAI to provide day-one support for gpt-oss models when they launched in August 2025. The gpt-oss-20b model can run on devices with as little as 16 GB of memory, while the gpt-oss-120b model requires approximately 80 GB. These models feature configurable reasoning effort (low, medium, or high) and native agentic capabilities including function calling, web browsing, Python execution, and structured outputs. The gpt-oss-20b release uses a mixture-of-experts design with 21 billion total parameters and 3.6 billion active parameters per token, packaged with native MXFP4 quantization for the MoE layer so that the full model fits inside 16 GB of memory.
Embedding models are also supported for use with RAG workflows. Available embedding models include nomic-embed-text-v1.5 and EmbeddingGemma.
LM Studio is proprietary freeware. The desktop application is free for both personal and commercial use, with no subscription fees, usage limits, or API call charges. In July 2025, Element Labs removed the previous requirement for a separate commercial license, stating that the old model created "high friction" that caused teams to "self-select out of using LM Studio altogether" rather than navigate procurement processes.
While the main desktop application is closed-source, several components of the LM Studio ecosystem are open source under the MIT license. These include the TypeScript SDK (lmstudio-js), the Python SDK (lmstudio-python), and the CLI tool (lms).
For organizations that need advanced features, LM Studio offers additional plans:
| Plan | Price | Features |
|---|---|---|
| Personal / Work | Free | Full desktop application, local server, SDKs, CLI, all inference features |
| Teams | Free (self-serve) | Private sharing of artifacts within teams, Hub organization |
| Enterprise | Contact sales | Single Sign-On (SSO), model and MCP gating, private collaboration, priority support |
The Enterprise plan targets Fortune 500 companies, universities, and large organizations that require centralized administration and compliance features.
Privacy is one of LM Studio's primary advantages. Because all model inference happens locally on the user's hardware, no data is sent to external servers during normal operation. Conversations, documents, and queries remain entirely on the local machine. This makes LM Studio suitable for working with sensitive or confidential information where data residency requirements would prevent the use of cloud-based AI services.
Key privacy characteristics include:
For organizations that need to share access to local models across a network, LM Studio introduced LM Link in version 0.4.5 (February 2026). LM Link allows users to connect to remote LM Studio instances and use their models as if they were local. The feature includes end-to-end encryption through a partnership with Tailscale.
LM Studio is one of several tools available for running LLMs locally. The most commonly compared alternatives are Ollama, Jan, and GPT4All.
| Feature | LM Studio | Ollama | GPT4All |
|---|---|---|---|
| Primary Interface | GUI desktop application | CLI and REST API | GUI desktop application |
| Primary Audience | Power users, developers, experimenters | Developers, DevOps engineers | Beginners, privacy-focused users |
| Model Source | Hugging Face (in-app browser) | Ollama model library, Hugging Face | Curated model list, Hugging Face |
| Model Formats | GGUF, MLX | GGUF (via Modelfile), Safetensors | GGUF |
| OpenAI-Compatible API | Yes (port 1234) | Yes (port 11434) | Yes |
| Anthropic-Compatible API | Yes (/v1/messages, v0.4.1+) | Via community proxies | No |
| Apple Silicon Optimization | MLX engine + Metal | Metal via llama.cpp | Metal via llama.cpp |
| NVIDIA GPU Support | CUDA 12.8 | CUDA | CUDA |
| AMD GPU Support | ROCm, Vulkan | ROCm | Vulkan (limited) |
| Multi-GPU | Yes (v0.3.14+) | Yes | No |
| Document Chat (RAG) | Built-in (PDF, DOCX, TXT, CSV) | No (requires external tools) | Yes (LocalDocs) |
| MCP Support | Yes (v0.3.17+, OAuth in 0.4.10+) | No | No |
| Speculative Decoding | Yes (v0.3.10+) | Yes | No |
| Docker / Headless | Yes (llmster daemon) | Yes (official Docker image) | No |
| Developer SDKs | TypeScript, Python | REST API, community libraries | Python bindings |
| Plugin System | Yes (TypeScript plugins) | No | No |
| Vision Models | Yes | Yes | Limited |
| Structured Output | Yes (JSON schema) | Yes (JSON mode) | No |
| Parallel Requests | Yes (continuous batching) | Yes | No |
| Remote Access | LM Link via Tailscale (v0.4.6+) | Manual networking | No |
| License | Proprietary freeware | MIT (open source) | MIT (open source) |
| Platforms | macOS, Windows, Linux | macOS, Windows, Linux | macOS, Windows, Linux |
| Pricing | Free | Free | Free |
Ollama is a command-line tool and inference server that prioritizes simplicity and minimal resource usage. It is popular among developers who embed LLM capabilities into larger applications or containerized environments. Ollama uses its own model packaging format built on top of GGUF files and maintains a curated model library at ollama.com. It has over 134,000 stars on GitHub and is licensed under the MIT license, making it fully open source.
LM Studio provides a richer graphical experience with more configuration options, making it better suited for users who want fine-grained control over model parameters, quantization choices, and GPU allocation. LM Studio's MLX support gives it a distinct advantage on Apple Silicon hardware, and its built-in model browser provides a more streamlined model discovery experience. Ollama's Docker integration and lightweight footprint (tens of megabytes versus several hundred for LM Studio) make it more practical for containerized server environments and CI/CD pipelines. Because both tools share llama.cpp as the underlying inference engine, raw token throughput typically falls within a few percentage points; the practical choice tends to come down to interface preference, MLX support, and whether the user wants a built-in plugin and Hub ecosystem.
Jan is an open-source desktop application for running local LLMs that markets itself as a privacy-first, offline-first alternative to ChatGPT. Jan's source code lives on GitHub under the AGPL license and the project leans heavily on community contributions. Like LM Studio, Jan ships an OpenAI-compatible local server, supports GGUF models through llama.cpp, and lets users browse and download models from inside the application.
LM Studio differentiates itself through more polished tooling for power users: the unified MLX engine and unified KV cache push higher throughput on Apple Silicon and during concurrent requests, the Hub provides a centralized registry for presets and plugins, and the LM Link feature exposes remote models behind an encrypted Tailscale tunnel. Jan tends to win on transparency thanks to its open-source codebase and a smaller installation footprint, while LM Studio leads on feature breadth, hardware optimization, and ecosystem integrations.
GPT4All is developed by Nomic AI and focuses on accessibility for non-technical users. Its standout feature is LocalDocs, a built-in retrieval-augmented generation system that lets users upload local documents and query them through the chat interface without any additional setup. LocalDocs supports various file formats and uses Nomic's embedding models to bring information from local files into conversations. GPT4All is fully open source under the MIT license and does not require a GPU to run.
LM Studio offers more advanced features for developers and power users, including MCP support, speculative decoding, multi-GPU controls, split-view chat, and comprehensive developer SDKs. GPT4All is better suited for users who primarily want to chat with a local model and query their own documents without dealing with API configurations or advanced inference settings.
LM Studio is built on the Electron framework, providing a consistent cross-platform experience across macOS, Windows, and Linux. The interface follows a modern, clean design with both light and dark themes.
The main application window is organized into several sections:
The application supports keyboard shortcuts for common operations and includes a "Mission Control" overlay (accessible via hotkey) for quick model switching and status monitoring. Version 0.4.0 introduced a refreshed UI with the split-view chat interface, enabling side-by-side model comparison. The redesign also improved the model loading workflow and added better visual feedback for GPU memory usage and token generation speed.
Flash Attention is an optimization that significantly reduces memory usage and increases inference speed, especially for models with long context windows. LM Studio enabled Flash Attention by default for CUDA in version 0.3.31 and for Vulkan and Metal in version 0.3.32.
In May 2025, LM Studio shipped a major performance update for NVIDIA users by upgrading to the CUDA 12.8 runtime. The update added CUDA Graphs, which group multiple GPU operations into a single CPU launch and reduce host-side overhead by up to 35 percent. Flash Attention CUDA kernels add a further throughput boost of up to 15 percent. The CUDA 12.8 runtime also enables full compatibility with the entire NVIDIA RTX 50-series Blackwell lineup. With a compatible driver, LM Studio automatically upgrades to the new runtime, which shortens model load times and increases sustained generation throughput on consumer NVIDIA hardware. Public benchmarks have measured the RTX 4090 at roughly 150 tokens per second with these optimizations on small to mid-size models.
Version 0.4.0 introduced continuous batching for the llama.cpp engine, allowing the LM Studio server to dynamically combine multiple incoming requests into a single batch. Users can configure the maximum number of concurrent predictions (default: 4) when loading a model. The unified KV cache feature allows flexible memory allocation across concurrent requests of varying sizes. This feature is particularly valuable for developer workflows where multiple applications or services send requests to the same local model simultaneously. MLX support for continuous batching is in development.
KV (key-value) caching across prompts improves response times in multi-turn conversations by reusing computations from earlier messages in the chat history. According to LM Studio's documentation, this optimization can reduce processing time from approximately 10 seconds to 0.11 seconds in tested scenarios.
Actual performance varies significantly based on hardware, model size, quantization level, and context length. The following approximate benchmarks provide a general reference:
| Hardware | Model Size | Approximate Speed |
|---|---|---|
| Apple M3 Max | 1B (MLX) | ~250 tokens/second |
| Apple M3 Max | 7B-14B | 35-70 tokens/second |
| NVIDIA RTX 4060 | 7B (Q4) | ~30-40 tokens/second |
| NVIDIA RTX 4090 | 7B (Q4) | 100+ tokens/second |
| NVIDIA RTX 4090 | Small models with CUDA Graphs | ~150 tokens/second |
| CPU-only (modern x64) | 7B (Q4) | 5-15 tokens/second |
LM Studio has developed a growing ecosystem of community contributions and third-party integrations. The official Discord server (discord.gg/lmstudio) serves as the primary community hub where users ask questions, share configurations, and report issues. The LM Studio team maintains an active presence on the server.
The LM Studio GitHub organization (github.com/lmstudio-ai) hosts the open-source SDKs, CLI tool, and documentation. The lms CLI repository has accumulated over 4,400 stars, and the TypeScript SDK over 1,500 stars.
The lmstudio-community organization on Hugging Face is a community-driven effort where members quantize and upload optimized GGUF versions of popular models. This resource makes it easier for users to find models that are ready to run in LM Studio without performing quantization themselves.
Notable integrations with third-party tools include:
/v1/messages endpoint to drive a local model in place of a hosted Claude deployment.LM Studio also maintains a presence on Twitter, LinkedIn, and publishes technical blog posts at lmstudio.ai/blog covering new releases, feature guides, and integration tutorials.
LM Studio serves a variety of use cases across personal and professional contexts:
Despite its strengths, LM Studio has several notable limitations: