LM Studio is a desktop application for discovering, downloading, and running large language models locally on personal hardware. Developed by Element Labs, Inc. and founded by Yagil Burowski, LM Studio provides a graphical user interface that allows users to interact with open-weight LLMs without relying on cloud services or sending data to external servers. The application supports GGUF models through llama.cpp and MLX models on Apple Silicon, and it exposes an OpenAI-compatible API server for integration with third-party tools and applications.
LM Studio was first publicly released in May 2023 and has since become one of the most widely used tools for running local LLMs. The application is available for macOS, Windows, and Linux. As of March 2026, the latest stable release is version 0.4.7.
LM Studio was created by Yagil Burowski, who previously worked at Apple. The application is built by Element Labs, Inc., a software company based in Brooklyn, New York. Burowski founded the company in 2023 with the goal of making local AI inference accessible to a broad audience through a polished desktop experience.
The initial public release, version 0.1.x, launched in May 2023. This first version introduced the core functionality of browsing and downloading models from Hugging Face and running them locally through a chat interface. It primarily targeted Windows users, with experimental support for macOS and Linux.
Version 0.2.x arrived in early 2024, bringing improved platform support including more robust macOS compatibility. Later patches in this series, such as 0.2.22, introduced Flash Attention integration for faster inference.
The 0.3.x series, which began in mid-2024, represented a major expansion of capabilities. Version 0.3.0 (August 2024) added built-in RAG, a light theme, internationalization, and the Structured Outputs API. Version 0.3.4 (October 2024) added Apple MLX support for efficient inference on Apple Silicon. Version 0.3.5 introduced headless mode and on-demand model loading. Version 0.3.10 (February 2025) brought speculative decoding. Version 0.3.13 (March 2025) added Google Gemma 3 and multimodality support. Version 0.3.14 (April 2025) introduced multi-GPU controls. Version 0.3.17 (June 2025) added Model Context Protocol (MCP) support. Version 0.3.21 (August 2025) shipped with support for OpenAI's gpt-oss models at launch. Flash Attention was enabled by default for Vulkan and Metal backends in version 0.3.32.
Version 0.4.0 was released on January 28, 2026. This release introduced llmster (a headless daemon for server deployments), parallel request processing with continuous batching, a new stateful REST API, and a refreshed user interface with split-view chat and developer mode. Subsequent updates through 0.4.7 added LM Link for remote instance connectivity, end-to-end encryption via Tailscale, and improved tool calling for newer model families.
The following table summarizes major releases in LM Studio's development history:
| Version | Date | Key Features |
|---|---|---|
| 0.1.x | May 2023 | Initial public release; core model downloading and chat functionality; primarily Windows support with experimental macOS and Linux builds |
| 0.2.x | Early 2024 | Enhanced platform support, improved macOS compatibility, Flash Attention integration (0.2.22) |
| 0.3.0 | August 2024 | Built-in RAG, light theme, internationalization, Structured Outputs API |
| 0.3.4 | October 2024 | Apple MLX engine for Apple Silicon Macs, vision model support via MLX |
| 0.3.5 | October 2024 | Headless mode, on-demand model loading, server auto-start, CLI model downloads |
| 0.3.10 | February 2025 | Speculative Decoding for llama.cpp and MLX engines |
| 0.3.13 | March 2025 | Google Gemma 3 support, multimodal image input |
| 0.3.14 | April 2025 | Multi-GPU controls for NVIDIA GPUs |
| 0.3.17 | June 2025 | Model Context Protocol (MCP) host support |
| 0.3.21 | August 2025 | Day-one support for OpenAI's gpt-oss models |
| 0.3.29 | 2025 | OpenAI Responses API with local models |
| 0.3.31-32 | 2025 | Flash Attention enabled by default for CUDA, Vulkan, and Metal |
| 0.4.0 | January 28, 2026 | llmster headless daemon, continuous batching, new REST API, revamped UI with split view |
| 0.4.5-0.4.6 | February 2026 | LM Link for remote instance connectivity with end-to-end encryption via Tailscale |
| 0.4.7 | March 2026 | Improved tool calling for Qwen 3.5 and GLM models, enhanced Anthropic API compatibility |
LM Studio raised $19.3 million in a Series B funding round on April 25, 2025. Investors include Matrix, Preston-Werner Ventures, and Torch Capital. As of 2025, Element Labs had a team of approximately 16 people.
LM Studio is an Electron application built with web technologies. The frontend uses React and TypeScript, while the build system relies on Vite. The application bundles two inference backends: llama.cpp for GGUF models and Apple's MLX framework for MLX-format models on Apple Silicon hardware.
The llama.cpp backend supports multiple compute variants, including CPU-only, CUDA (for NVIDIA GPUs), Vulkan (cross-platform GPU), ROCm (for AMD GPUs), and Metal (for Apple GPUs). This allows LM Studio to take advantage of GPU acceleration across all major hardware platforms.
The MLX engine is implemented in Python and combines three core libraries: mlx-lm (Apple's MLX Python implementation), Outlines (for structured generation and JSON schema enforcement), and mlx-vlm (for vision model support). LM Studio deploys Python using python-build-standalone with stacked virtual environments for portable, cross-machine compatibility. In benchmarks, Llama 3.2 1B running on an M3 Max chip achieved approximately 250 tokens per second using the MLX backend.
LM Studio includes a built-in model browser that connects directly to Hugging Face, the largest repository of open-weight models. Users can search for models by name, keyword, or model identifier without leaving the application. The browser displays available model variants with different quantization levels (such as Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, and Q8_0), allowing users to choose a balance between model quality and memory requirements.
The quantization level indicates how many bits are used to represent each model weight. Lower quantization (such as Q4) reduces memory usage and increases inference speed but introduces minor quality loss. Higher quantization (such as Q8) preserves more of the original model's accuracy but requires more memory. Q5_K_M is generally recommended as a good balance between performance and quality.
The lmstudio-community organization on Hugging Face provides pre-quantized GGUF models maintained by community members, making it straightforward to find optimized versions of popular models. LM Studio also maintains a curated model catalog on its website, highlighting popular and trending models across families like Llama, Qwen, Mistral, Gemma, Phi, and DeepSeek.
The chat interface provides a conversational experience similar to ChatGPT or Claude. Users can create multiple conversations organized in folders, adjust generation parameters (temperature, top-p, max tokens, repeat penalty), manage conversation history, and customize model behavior through system prompts.
Version 0.4.0 introduced split-view chat, which allows users to view two conversations side by side by dragging and dropping chat tabs. This feature is useful for comparing outputs from different models or different prompt configurations. Configuration presets let users save and quickly switch between different parameter settings for various use cases.
Chat conversations can be exported as PDF (including images), Markdown, or plain text files.
Introduced in version 0.3.0 (August 2024), LM Studio's "Chat with Documents" feature enables retrieval-augmented generation (RAG) directly within the desktop application. Users can attach files in formats including PDF, DOCX, TXT, CSV, and plain text to a chat session. If the document is short enough to fit within the model's context window, LM Studio inserts the full file contents into the conversation. For longer documents, the application automatically switches to RAG mode, chunking the document and retrieving relevant sections based on the user's query.
The feature supports up to 5 files with a maximum combined size of 30 megabytes. Citations are provided at the end of responses, showing which portions of the uploaded documents were used to generate the answer. All document processing happens locally on the user's machine, ensuring that sensitive files never leave the device.
One of LM Studio's most important features for developers is its built-in local API server. When started, the server exposes endpoints on localhost (port 1234 by default) that are compatible with OpenAI's API specification. This means any application, script, or library designed to work with the OpenAI API can be redirected to use a local model by simply changing the base URL.
The server supports the following endpoints:
| Endpoint | Path | Description |
|---|---|---|
| Chat Completions | /v1/chat/completions | Standard chat completion requests with streaming support |
| Embeddings | /v1/embeddings | Text embedding generation for semantic search and RAG |
| Models | /v1/models | Lists currently loaded models |
| Responses | /v1/responses | OpenAI Responses API for stateful conversations |
| Anthropic Messages | /v1/messages | Anthropic-compatible endpoint (v0.4.1+) |
| Native REST API | /api/v1/* | LM Studio's own stateful REST API with full feature access |
Starting with version 0.4.0, the server supports parallel request processing through continuous batching. Instead of queuing requests one by one, the model can process up to N requests simultaneously (default: 4 parallel slots). This is powered by a unified KV cache that dynamically allocates memory across concurrent requests of varying sizes.
The server also supports Just-In-Time model loading, which can automatically load a model when a request arrives and optionally unload it after a period of inactivity. This is useful for server deployments where multiple models need to be available but only one is active at a time.
Version 0.4.0 introduced llmster, a headless daemon that packages the core inference engine without the GUI. llmster can run on servers, cloud instances, or any machine that lacks a graphical display. It supports Linux, macOS, and Windows, making LM Studio suitable for production server deployments in addition to desktop use. The llmster daemon supports the same API endpoints and model management features as the desktop application.
LM Studio provides official SDKs and command-line tools for developers:
| Tool | Package | Language | License |
|---|---|---|---|
| TypeScript SDK | @lmstudio/sdk (npm) | TypeScript | MIT |
| Python SDK | lmstudio (pip) | Python | MIT |
| CLI | lms | Command-line | MIT |
The SDKs allow programmatic model loading, inference, tool calling, and structured output enforcement. The Python SDK supports Pydantic for schema validation, while the TypeScript SDK supports Zod. Both SDKs can enforce JSON schema output from models.
The lms CLI tool provides commands for model management, server control, and interactive chat sessions from the terminal. The lms chat command (added in 0.4.0) allows terminal-based conversations with slash commands and support for pasting larger content blocks. Models can also be downloaded from the CLI using commands like lms get lmstudio-community/llama-3.2-1b-instruct-gguf@q4_k_m.
LM Studio 0.3.17 introduced support for Model Context Protocol (MCP), a standard originally introduced by Anthropic for connecting LLMs to external tools and data sources. Users can configure MCP servers through a mcp.json file or through the application's Tools & Integrations menu. Each MCP server runs in a separate, isolated process for stability and security.
When a model invokes a tool through MCP, LM Studio displays a confirmation dialog that lets the user review and optionally edit the tool call arguments before execution. MCP is also accessible through the REST API, enabling programmatic tool use in server deployments. Common use cases include connecting models to web search APIs, file system access, database queries, and the Hugging Face API.
Speculative decoding is an inference optimization technique that can speed up text generation by 20 to 50 percent without reducing output quality. It works by pairing a large main model with a smaller draft model. The draft model quickly proposes candidate tokens, which the main model then verifies in parallel. Since verification is faster than generation from scratch, the overall throughput increases.
LM Studio added speculative decoding support in version 0.3.10 (February 2025). The system attempts to offload the draft model entirely to the GPU when one is available. Speculative decoding is accessible through both the chat interface and the OpenAI-compatible API.
LM Studio supports structured output through JSON schema enforcement. When a JSON schema is provided with a request to the /v1/chat/completions endpoint, the model is constrained to produce valid JSON conforming to the specified schema. This follows the same format as OpenAI's Structured Output API, ensuring compatibility with existing OpenAI client libraries.
The SDKs surface APIs to enforce the model's output format using Pydantic (for Python) or Zod (for TypeScript). On Apple Silicon, the MLX backend uses the Outlines library to enforce structured generation constraints.
Version 0.3.14 (April 2025) introduced multi-GPU controls, allowing users to distribute model layers across multiple GPUs. This is particularly useful for running larger models that exceed the VRAM capacity of a single GPU. Users can enable or disable specific GPUs, set priority order for memory allocation, and limit dedicated GPU memory usage to prevent model weights from spilling into slower shared memory.
LM Studio supports multimodal vision models that can process both text and images. On Apple Silicon, the mlx-vlm library enables running vision models such as LLaVA. The llama.cpp backend also supports vision-capable models on all platforms. Users can attach images to chat messages for visual question answering, image description, and other multimodal tasks. Supported vision models include Gemma 3 (multimodal), Qwen-VL, GLM-4V, and Llama 3.2 Vision.
LM Studio supports two primary model formats:
| Format | Backend | Hardware | Description |
|---|---|---|---|
| GGUF | llama.cpp | All platforms (CPU, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan) | The standard format for quantized LLM inference; supports a wide range of quantization levels from Q2 through Q8, F16, and F32 |
| MLX | Apple MLX | Apple Silicon only (M1, M2, M3, M4) | Apple's optimized format for on-device inference; hosted by the mlx-community organization on Hugging Face |
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It stores quantized model weights in a single file and supports quantization levels from 2-bit to 8-bit, with various K-quant variants (such as Q4_K_M and Q5_K_M) that apply different precision to different model layers. Dropping from FP16 to Q8 is mathematically lossy but empirically invisible to most users. Dropping to Q4 introduces minor stylistic changes but retains logical reasoning capabilities.
MLX models use Apple's MLX framework and are typically available as separate uploads on Hugging Face under the mlx-community organization. MLX models cannot be used with the llama.cpp engine, and GGUF models cannot be used with the MLX engine. The two formats serve different hardware targets and cannot be interchanged.
| Requirement | Specification |
|---|---|
| Chip | Apple Silicon only (M1, M2, M3, M4 series) |
| Operating System | macOS 14.0 (Sonoma) or newer |
| RAM | 16 GB recommended; 8 GB possible with smaller models |
| Note | Intel-based Macs are not supported |
| Requirement | Specification |
|---|---|
| Architecture | x64 (with AVX2 instruction set) and ARM (Snapdragon X Elite) |
| RAM | 16 GB minimum recommended |
| GPU | At least 4 GB dedicated VRAM recommended for GPU acceleration |
| GPU Vendors | NVIDIA (CUDA), AMD (Vulkan/ROCm), Intel (Vulkan) |
| Requirement | Specification |
|---|---|
| Architecture | x64 and ARM64 (aarch64) |
| Distribution | Ubuntu 20.04 or newer |
| Format | Distributed as AppImage |
| CPU | x64 builds include AVX2 support by default |
| GPU | NVIDIA (CUDA), AMD (ROCm) |
For GPU acceleration on all platforms, at least 4 GB of dedicated VRAM is recommended. Larger models with higher quantization levels require proportionally more VRAM. Models that exceed available VRAM can still run using a combination of GPU and CPU offloading, though at reduced speed. On AMD systems with Variable Graphics Memory (VGM), users can allocate system RAM as dedicated VRAM.
| GPU Type | Backend | Platform | Notes |
|---|---|---|---|
| NVIDIA (CUDA) | CUDA 12.8 | Windows, Linux | Full support including multi-GPU, Flash Attention, and RTX 50-series |
| Apple Silicon (Metal) | Metal / MLX | macOS | Uses unified memory; supports both llama.cpp (Metal) and MLX backends |
| AMD | ROCm / Vulkan | Windows, Linux | ROCm support on Linux including AMD 9000 series; Vulkan on Windows |
| Intel | Vulkan | Windows, Linux | Vulkan backend for Intel Arc and integrated GPUs |
LM Studio can run any model available in GGUF format on Hugging Face, as well as MLX-format models for Apple Silicon. This includes thousands of models across many families. Some notable model families supported include:
| Model Family | Developer | Available Sizes | Notes |
|---|---|---|---|
| Llama 3 / 3.1 / 3.2 | Meta | 1B, 3B, 8B, 70B, 405B | Widely used general-purpose models |
| Qwen 2.5 / 3 / 3.5 | Alibaba | 0.5B to 235B | Strong multilingual and coding support |
| Mistral / Mistral Small / Nemo | Mistral AI | 7B, 12B, 24B | Efficient European-developed models |
| Gemma 2 / 3 / 3n | 270M to 27B | Lightweight and multimodal variants | |
| Phi 3 / 4 | Microsoft | 3B, 14B | Compact models with strong reasoning |
| DeepSeek R1 | DeepSeek | 7B to 70B (distilled) | Reasoning-focused models with chain-of-thought capabilities |
| gpt-oss | OpenAI | 20B, 120B | OpenAI's open-weight models, released August 2025 under Apache 2.0 |
| Codestral / Devstral | Mistral AI | 22B, 24B | Code-specialized models |
| GLM-4 | Zhipu AI | 9B, 30B | Multimodal and general-purpose |
| NVIDIA Nemotron | NVIDIA | 30B, 120B | Enterprise-grade reasoning models |
| Granite 4.0 | IBM | 3B, 7B, 32B | Enterprise and research models |
| Falcon | TII | 7B, 40B, 180B | Open-weight models from the Technology Innovation Institute |
LM Studio partnered with OpenAI to provide day-one support for gpt-oss models when they launched in August 2025. The gpt-oss-20b model can run on devices with as little as 16 GB of memory, while the gpt-oss-120b model requires approximately 80 GB. These models feature configurable reasoning effort (low, medium, or high) and native agentic capabilities including function calling, web browsing, Python execution, and structured outputs.
Embedding models are also supported for use with RAG workflows. Available embedding models include nomic-embed-text-v1.5 and EmbeddingGemma.
LM Studio is proprietary freeware. The desktop application is free for both personal and commercial use, with no subscription fees, usage limits, or API call charges. In July 2025, Element Labs removed the previous requirement for a separate commercial license, stating that the old model created "high friction" that caused teams to "self-select out of using LM Studio altogether" rather than navigate procurement processes.
While the main desktop application is closed-source, several components of the LM Studio ecosystem are open source under the MIT license. These include the TypeScript SDK (lmstudio-js), the Python SDK (lmstudio-python), and the CLI tool (lms).
For organizations that need advanced features, LM Studio offers additional plans:
| Plan | Price | Features |
|---|---|---|
| Personal / Work | Free | Full desktop application, local server, SDKs, CLI, all inference features |
| Teams | Free (self-serve) | Private sharing of artifacts within teams, Hub organization |
| Enterprise | Contact sales | Single Sign-On (SSO), model and MCP gating, private collaboration, priority support |
The Enterprise plan targets Fortune 500 companies, universities, and large organizations that require centralized administration and compliance features.
Privacy is one of LM Studio's primary advantages. Because all model inference happens locally on the user's hardware, no data is sent to external servers during normal operation. Conversations, documents, and queries remain entirely on the local machine. This makes LM Studio suitable for working with sensitive or confidential information where data residency requirements would prevent the use of cloud-based AI services.
Key privacy characteristics include:
For organizations that need to share access to local models across a network, LM Studio introduced LM Link in version 0.4.5 (February 2026). LM Link allows users to connect to remote LM Studio instances and use their models as if they were local. The feature includes end-to-end encryption through a partnership with Tailscale.
LM Studio is one of several tools available for running LLMs locally. The most commonly compared alternatives are Ollama and GPT4All.
| Feature | LM Studio | Ollama | GPT4All |
|---|---|---|---|
| Primary Interface | GUI desktop application | CLI and REST API | GUI desktop application |
| Primary Audience | Power users, developers, experimenters | Developers, DevOps engineers | Beginners, privacy-focused users |
| Model Source | Hugging Face (in-app browser) | Ollama model library, Hugging Face | Curated model list, Hugging Face |
| Model Formats | GGUF, MLX | GGUF (via Modelfile), Safetensors | GGUF |
| OpenAI-Compatible API | Yes (port 1234) | Yes (port 11434) | Yes |
| Apple Silicon Optimization | MLX engine + Metal | Metal via llama.cpp | Metal via llama.cpp |
| NVIDIA GPU Support | CUDA 12.8 | CUDA | CUDA |
| AMD GPU Support | ROCm, Vulkan | ROCm | Vulkan (limited) |
| Multi-GPU | Yes (v0.3.14+) | Yes | No |
| Document Chat (RAG) | Built-in (PDF, DOCX, TXT, CSV) | No (requires external tools) | Yes (LocalDocs) |
| MCP Support | Yes (v0.3.17+) | No | No |
| Speculative Decoding | Yes (v0.3.10+) | Yes | No |
| Docker / Headless | Yes (llmster daemon) | Yes (official Docker image) | No |
| Developer SDKs | TypeScript, Python | REST API, community libraries | Python bindings |
| Vision Models | Yes | Yes | Limited |
| Structured Output | Yes (JSON schema) | Yes (JSON mode) | No |
| Parallel Requests | Yes (continuous batching) | Yes | No |
| License | Proprietary freeware | MIT (open source) | MIT (open source) |
| Platforms | macOS, Windows, Linux | macOS, Windows, Linux | macOS, Windows, Linux |
| Pricing | Free | Free | Free |
Ollama is a command-line tool and inference server that prioritizes simplicity and minimal resource usage. It is popular among developers who embed LLM capabilities into larger applications or containerized environments. Ollama uses its own model packaging format built on top of GGUF files and maintains a curated model library at ollama.com. It has over 134,000 stars on GitHub and is licensed under the MIT license, making it fully open source.
LM Studio provides a richer graphical experience with more configuration options, making it better suited for users who want fine-grained control over model parameters, quantization choices, and GPU allocation. LM Studio's MLX support gives it a distinct advantage on Apple Silicon hardware, and its built-in model browser provides a more streamlined model discovery experience. Ollama's Docker integration and lightweight footprint (tens of megabytes versus several hundred for LM Studio) make it more practical for containerized server environments and CI/CD pipelines.
GPT4All is developed by Nomic AI and focuses on accessibility for non-technical users. Its standout feature is LocalDocs, a built-in retrieval-augmented generation system that lets users upload local documents and query them through the chat interface without any additional setup. LocalDocs supports various file formats and uses Nomic's embedding models to bring information from local files into conversations. GPT4All is fully open source under the MIT license and does not require a GPU to run.
LM Studio offers more advanced features for developers and power users, including MCP support, speculative decoding, multi-GPU controls, split-view chat, and comprehensive developer SDKs. GPT4All is better suited for users who primarily want to chat with a local model and query their own documents without dealing with API configurations or advanced inference settings.
LM Studio is built on the Electron framework, providing a consistent cross-platform experience across macOS, Windows, and Linux. The interface follows a modern, clean design with both light and dark themes.
The main application window is organized into several sections:
The application supports keyboard shortcuts for common operations and includes a "Mission Control" overlay (accessible via hotkey) for quick model switching and status monitoring. Version 0.4.0 introduced a refreshed UI with the split-view chat interface, enabling side-by-side model comparison. The redesign also improved the model loading workflow and added better visual feedback for GPU memory usage and token generation speed.
Flash Attention is an optimization that significantly reduces memory usage and increases inference speed, especially for models with long context windows. LM Studio enabled Flash Attention by default for CUDA in version 0.3.31 and for Vulkan and Metal in version 0.3.32.
Version 0.4.0 introduced continuous batching for the llama.cpp engine, allowing the LM Studio server to dynamically combine multiple incoming requests into a single batch. Users can configure the maximum number of concurrent predictions (default: 4) when loading a model. The unified KV cache feature allows flexible memory allocation across concurrent requests of varying sizes. This feature is particularly valuable for developer workflows where multiple applications or services send requests to the same local model simultaneously. MLX support for continuous batching is in development.
KV (key-value) caching across prompts improves response times in multi-turn conversations by reusing computations from earlier messages in the chat history. According to LM Studio's documentation, this optimization can reduce processing time from approximately 10 seconds to 0.11 seconds in tested scenarios.
Actual performance varies significantly based on hardware, model size, quantization level, and context length. The following approximate benchmarks provide a general reference:
| Hardware | Model Size | Approximate Speed |
|---|---|---|
| Apple M3 Max | 1B (MLX) | ~250 tokens/second |
| Apple M3 Max | 7B-14B | 35-70 tokens/second |
| NVIDIA RTX 4060 | 7B (Q4) | ~30-40 tokens/second |
| NVIDIA RTX 4090 | 7B (Q4) | 100+ tokens/second |
| CPU-only (modern x64) | 7B (Q4) | 5-15 tokens/second |
LM Studio has developed a growing ecosystem of community contributions and third-party integrations. The official Discord server (discord.gg/lmstudio) serves as the primary community hub where users ask questions, share configurations, and report issues. The LM Studio team maintains an active presence on the server.
The LM Studio GitHub organization (github.com/lmstudio-ai) hosts the open-source SDKs, CLI tool, and documentation. The lms CLI repository has accumulated over 4,400 stars, and the TypeScript SDK over 1,500 stars.
The lmstudio-community organization on Hugging Face is a community-driven effort where members quantize and upload optimized GGUF versions of popular models. This resource makes it easier for users to find models that are ready to run in LM Studio without performing quantization themselves.
Notable integrations with third-party tools include:
LM Studio also maintains a presence on Twitter, LinkedIn, and publishes technical blog posts at lmstudio.ai/blog covering new releases, feature guides, and integration tutorials.
LM Studio serves a variety of use cases across personal and professional contexts:
Despite its strengths, LM Studio has several notable limitations: