Microsoft Foundry Local
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,763 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,763 words
Add missing citations, update stale details, or suggest a clearer explanation.
Foundry Local is an on-device artificial intelligence runtime from Microsoft that lets applications run open weight language models entirely on a user's own hardware. The project ships a lightweight inference engine, a curated model catalog, native software development kits in four languages, a command line interface, and an optional local web server that speaks the same request and response format as the OpenAI API. It is part of the broader Azure AI Foundry product family, but it does not require an Azure subscription, a network connection at runtime, or any cloud account. Once a model has been downloaded to a device, the entire chat or transcription loop happens locally.
Microsoft introduced Foundry Local in public preview at Microsoft Build on 19 May 2025 and declared the product generally available on 9 April 2026. The runtime supports Windows, macOS on Apple silicon, and Linux on x64. It uses ONNX Runtime under the hood and selects the best available execution provider for the host machine, which means it will use an NPU or GPU when one is present and quietly fall back to the CPU when one is not. The official runtime adds roughly twenty megabytes to an application bundle, which is small enough that developers can ship it inside their installer and treat local inference as just another library dependency.
The product sits in a crowded category that already includes Ollama, LM Studio, GPT4All, and a long tail of community runners. What sets Foundry Local apart is the focus on shipping inside a finished consumer or enterprise app rather than running as a standalone developer tool. The SDK ships under the MIT license, the CLI ships under standard Microsoft software license terms, and individual models keep their own upstream licenses, which include Apache License 2.0, MIT, and various source available terms depending on the model in question.
The arrival of Foundry Local marks a deliberate change in how Microsoft talks about its AI stack. Before 2025 the company aimed almost all of its generative AI marketing at cloud workloads on Azure, with OpenAI hosted models in Azure OpenAI Service sitting at the center. On-device work was treated as the domain of Windows Copilot Runtime and the small handful of Phi models that Microsoft Research had been publishing since late 2023. There was no clean way for a developer outside Microsoft to take one of those models, embed it in a desktop application, and ship the result to customers without writing a great deal of glue code around quantization, hardware probing, and driver updates.
The shift toward local inference was driven by several real world forces. First, the cost of cloud inference at scale became hard to ignore. Many independent developers and startups noticed that their monthly bills for chat features dwarfed every other line in their infrastructure spend. Second, regulators in Europe and elsewhere kept tightening the rules around sending user content to remote endpoints, especially for healthcare, legal, and education products. Third, the consumer hardware market got dramatically better at running large language models. The arrival of Qualcomm Snapdragon X laptops in mid 2024, the Apple silicon M3 and M4 chips, and the Intel Lunar Lake series brought neural processing units, abbreviated NPU, into the mainstream of mid range Windows machines and every recent Mac. Fourth, the open weight ecosystem produced a wave of models in the one to ten billion parameter range that were genuinely useful for chat and transcription rather than novelty grade.
Microsoft already had two adjacent projects that made Foundry Local possible. ONNX Runtime, originally released in 2018, had grown into a cross platform inference library with execution providers for CUDA, DirectML, CoreML, OpenVINO, QNN, and several other backends. Olive, a model optimization toolkit, handled the offline work of converting and quantizing transformer models so they fit into consumer memory budgets. Foundry Local pulled both of those pieces together into a single user facing product with a friendlier surface area. The cloud counterpart, Azure AI Foundry, continued to host frontier models through the Foundry Models catalog, and Microsoft positioned Foundry Local as the matching on-device runtime so that a team could prototype against the cloud catalog and then migrate suitable workloads down to client devices without rewriting their application code.
Foundry Local is built as a small native binary plus a set of language bindings. The binary contains the lifecycle manager, the model catalog client, and the inference loop. When an application calls the SDK, the manager spawns or attaches to the local service, downloads the model files on first use, caches them on disk, and then either runs inference in process through the bindings or exposes an HTTP server that other processes can call.
Under the SDK there is a layered execution stack. ONNX Runtime sits at the bottom and provides the abstract model graph and the kernel dispatch logic. Above it Foundry Local maintains a registry of execution providers. An execution provider is a hardware specific backend that knows how to run ONNX operators on a particular accelerator. On Windows the registry typically includes the DirectML provider for Direct3D 12 GPUs, the QNN provider for Qualcomm NPUs, the OpenVINO provider for Intel iGPU and NPU silicon, and a CPU provider as the universal fallback. On macOS it includes a CoreML provider that targets the Metal stack and the Apple Neural Engine. On Linux x64 it ships CPU and CUDA providers, with additional providers being added over the course of 2026.
When the manager starts, it probes the host machine, reports the available execution providers to the application through a discoverEps call, and then downloads the matching native packages on demand. The first run of a Foundry Local enabled app is therefore noticeably longer than subsequent runs, because the runtime is acquiring both the model weights and the hardware specific shared libraries it needs. After that initial setup the bytes stay on disk and the manager skips straight to loading.
Models in the catalog are not raw PyTorch checkpoints. They are ONNX graphs that have been quantized and recompiled by the Olive pipeline so that they run efficiently on the target hardware. A single model alias such as phi-4-mini typically resolves to several physical variants, including int4 weights for CPU, int4 weights packed for NPU, and a fp16 GPU variant. The manager picks the variant that matches the chosen execution provider, which is why the same alias can deliver very different memory footprints and throughput numbers on different machines. Microsoft has not published every internal quantization recipe, but the documentation makes clear that every catalog entry goes through a process the team calls quantization and compression to fit consumer memory budgets.
The runtime keeps the key value cache, the tokenizer, and the streaming logic inside its own process. Applications never have to load PyTorch, Transformers, or Hugging Face tooling. The optional web server speaks the OpenAI Responses API and the older Chat Completions API, which means that a developer who already wrote code against the OpenAI SDK can swap the base URL for http://localhost and get the same calls working against a local model with no other code change. That compatibility decision is one of the more important design choices in the project, because it lets a team move between cloud and local backends with minimal friction.
The Foundry Local catalog is curated rather than open. Microsoft has explicitly said that the project is not designed to host every model on the public internet, and that the catalog is meant to include models that have been quantized, tested across a range of consumer hardware, and small enough to ship to end users. The current catalog covers two main task families: chat completions and audio transcription. The table below lists the model families that Microsoft has confirmed in the product documentation as of mid 2026.
| Family | Origin | Task | License notes |
|---|---|---|---|
| Phi-4 and Phi-4 mini | Microsoft Research | Chat and reasoning | MIT |
| Phi-4 mini reasoning | Microsoft Research | Reasoning oriented chat | MIT |
| GPT OSS | OpenAI | Chat | Open weight, see OpenAI terms |
| Llama variants | Meta | Chat | Llama Community License |
| Qwen and Qwen 2.5 | Alibaba | Chat | Apache 2.0 for several variants |
| DeepSeek | DeepSeek | Chat and reasoning | DeepSeek model license |
| Mistral | Mistral AI | Chat | Apache 2.0 for several variants |
| Whisper | OpenAI | Audio transcription | MIT |
The exact list of available aliases changes over time as new versions are added and older ones are deprecated. The CLI exposes the live list through foundry model ls, and the SDK exposes the same data through the catalog object. Each model entry includes its size, its supported hardware variants, and the upstream license text. Models that Microsoft has aliased for convenience can be invoked by short names such as qwen2.5-0.5b or phi-4-mini, and the manager handles the resolution to the appropriate quantized file automatically.
One important point that often confuses newcomers is that Foundry Local does not load models from the Hugging Face Hub directly. The catalog is hosted by Microsoft, and models are added only after the Olive pipeline has produced an ONNX variant that the team is willing to support. This is the trade off behind the product's promise of a reliable end user experience. Developers who need access to a model outside the catalog typically reach for Ollama or LM Studio instead, both of which let users load arbitrary GGUF files.
Foundry Local targets three operating systems. The exact installation flow differs by platform but the resulting CLI and service behave the same way.
| Platform | Install command | Hardware acceleration paths |
|---|---|---|
| Windows 10 and 11, x64 and ARM64 | winget install Microsoft.FoundryLocal | Windows ML, DirectML for GPU, QNN for Qualcomm NPU, OpenVINO for Intel iGPU and NPU, CUDA for NVIDIA GPU, CPU fallback |
| macOS on Apple silicon | brew tap microsoft/foundrylocal then brew install foundrylocal | CoreML provider targeting Metal and the Apple Neural Engine |
| Linux x64 | Distributed via the Foundry Local GitHub releases page | CPU provider, CUDA provider for NVIDIA GPUs, additional providers rolling out through 2026 |
The Windows experience is the most polished of the three. The Windows package integrates with Windows ML, the system level inference runtime that Microsoft has positioned as the umbrella for all local model execution on the operating system. That integration means the Foundry Local SDK on Windows can use the system managed pool of execution providers, which keeps driver updates aligned with regular Windows Update cycles instead of forcing every app to ship its own GPU runtime.
Microsoft documents minimum system requirements of 8 gigabytes of RAM and 3 gigabytes of free disk space, with a recommended configuration of 16 gigabytes of RAM and 15 gigabytes of free disk space to accommodate larger models and additional execution provider packages. An internet connection is required for the first run, because the manager needs to fetch both the model weights and the relevant execution provider, but every run after that can be fully offline.
The Linux build is the youngest of the three. At GA it covered x64 with CPU and CUDA acceleration, with explicit notes that more execution providers and ARM64 builds were planned over the course of 2026. The macOS build covers only Apple silicon. There is no Intel Mac build, which mirrors Apple's own deprecation timeline for that architecture.
The Foundry Local CLI is the first thing most developers touch. After installation the foundry command becomes available in a terminal. The basic commands cover listing the catalog, inspecting individual models, downloading weights, and running an interactive chat session.
foundry model ls # list models in the catalog
foundry model info phi-4-mini # show details for one model
foundry model run qwen2.5-0.5b # download if needed and start a chat
foundry service status # check whether the local service is running
The foundry model run command opens a REPL style chat in the terminal. It is useful for quick smoke tests but it is not the intended production interface. The intended path is the SDK, which is published for four languages.
| Language | Package name | Notes |
|---|---|---|
| C# | Microsoft.AI.Foundry.Local and Microsoft.AI.Foundry.Local.WinML | The WinML variant uses the Windows system runtime for broader hardware coverage |
| Python | foundry-local-sdk and foundry-local-sdk-winml | Requires Python 3.11 or later |
| JavaScript and TypeScript | foundry-local-sdk and foundry-local-sdk-winml | Requires Node.js 20 or later |
| Rust | foundry-local-sdk, with an optional winml feature flag | Requires Rust 1.70 or later |
All four SDKs follow the same shape. An application creates a manager with a configuration object, discovers the available execution providers, downloads and registers them, retrieves a model from the catalog, downloads its weights if they are not cached, loads the model into memory, and then opens a chat client that exposes both blocking and streaming completion calls. When the application is done with the model it calls unload, which returns the memory to the operating system.
The streaming completion path is the one most consumer apps use, because it gives the user a typewriter style response instead of waiting for the full reply. The pattern mirrors the OpenAI streaming API closely. In Python it looks roughly like this:
for chunk in client.complete_streaming_chat(messages):
print(chunk.choices<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>.delta.content, end="", flush=True)
The equivalent in C# uses await foreach on a streaming response. JavaScript uses async iterators. Rust uses tokio_stream::StreamExt::next. The conceptual model is the same across all four languages, which makes the SDK reasonably easy to port between stacks.
For applications that need to share a loaded model across multiple processes, Foundry Local also exposes an optional local HTTP server with OpenAI compatible endpoints. This is the path most often used to wire up tools like LangChain, Semantic Kernel, or any other library that already speaks OpenAI Chat Completions. For an embedded single user scenario, Microsoft recommends the in process SDK instead, because it avoids the overhead of an extra HTTP hop and a separate server process.
Foundry Local lands in the same conceptual space as a handful of established projects, and the design trade offs make for an interesting contrast. The table below summarizes the main differences as of mid 2026.
| Dimension | Foundry Local | Ollama | LM Studio |
|---|---|---|---|
| Primary audience | Application developers shipping AI in their own installer | Developers and power users running local models from a terminal | End users who want a desktop GUI for local chat |
| Runtime engine | ONNX Runtime with hardware specific execution providers | llama.cpp with custom Go orchestration | llama.cpp wrapped in an Electron app |
| Model format | Curated ONNX graphs produced by Microsoft's Olive pipeline | GGUF files pulled from a public registry | GGUF files, with direct browsing of Hugging Face |
| Model catalog | Closed and curated | Open registry plus user uploaded Modelfiles | Open browsing of any Hugging Face GGUF |
| Hardware backends | CPU, NVIDIA GPU, AMD GPU, Intel iGPU and NPU, Qualcomm NPU, Apple Metal and ANE | CPU, NVIDIA GPU, Apple Metal, some AMD support | CPU, NVIDIA GPU, Apple Metal, Vulkan |
| Distribution model | SDK bundled inside customer apps, plus standalone CLI | Standalone background service with HTTP API | Standalone desktop application with optional server mode |
| OpenAI compatible API | Yes, including the Responses API | Yes, partial Chat Completions | Yes, Chat Completions |
| Native SDKs | C#, Python, JavaScript, Rust | Community SDKs for several languages, none official until 2025 | Community SDKs |
| Licensing | SDK under MIT, CLI under Microsoft terms, models keep upstream licenses | Open source, MIT | Proprietary, free for personal and limited commercial use |
| Installer footprint | About 20 MB for the runtime | Larger, includes the model server | Larger, includes a full Electron GUI |
The practical division of labor is roughly this. Ollama is the easiest path for a developer who wants to grab a model from a public registry and call it from a script. LM Studio is the easiest path for a non technical user who wants a visual chat window. Foundry Local is the easiest path for a team that wants to embed local inference inside a finished product, control the model selection process, and rely on Microsoft for hardware compatibility, driver updates, and OpenAI API parity.
The closed catalog is the most polarizing piece of that trade. Some developers see it as a feature, because it removes the support burden of arbitrary user supplied models and produces a predictable end user experience. Others see it as a limitation, because their use case depends on a specific community fine tune that will never appear in the Microsoft catalog. The right answer depends entirely on whether the application is trying to ship a known good experience or trying to expose the full long tail of open weight models.
Foundry Local uses a split license model that is worth understanding before shipping the runtime in a commercial product.
The SDK packages, including the C# library, the Python package, the JavaScript module, and the Rust crate, are released under the MIT license. That is a permissive license that allows commercial use, redistribution, and modification with attribution. The MIT license sits in the project's LICENSE file on GitHub.
The CLI and the background service are released under standard Microsoft software license terms rather than under MIT. Those terms allow free installation and use but are not an open source license in the OSI sense. Teams that plan to redistribute the CLI inside their installer should review the Microsoft terms carefully, because the conditions differ from those of the SDK.
Each model in the catalog keeps its own upstream license. Phi-4 ships under MIT. Several Qwen and Mistral variants ship under Apache 2.0. Llama variants ship under the Meta Llama Community License, which restricts certain large scale commercial deployments. GPT OSS uses OpenAI's own open weight terms. Whisper is MIT. DeepSeek models ship under the DeepSeek model license. The CLI surfaces the relevant license string for each model through the foundry model info command, and the SDK exposes the same information through the catalog object. Microsoft has been explicit that the user of the model, not Microsoft, is responsible for complying with the model's upstream license terms.
The overall effect is that Foundry Local is open enough for most product teams to embed in their applications without legal friction, but it is not as cleanly open source as some competing runtimes. A team that wants a fully MIT or Apache licensed stack end to end will need to read each individual model license before shipping.
The reception of Foundry Local in the developer community has been broadly positive. Most coverage during the public preview period focused on three points. First, the runtime is genuinely small, which is unusual for a product with so much surface area, and the twenty megabyte installer footprint compares favorably to the multi hundred megabyte downloads associated with other local model stacks. Second, the OpenAI API compatibility removed an enormous amount of integration work for teams that already had OpenAI based prototypes, which made migration to local inference feel cheap rather than expensive. Third, the automatic hardware detection actually worked, which is a low bar in principle but a high bar in practice on a fragmented Windows ecosystem.
The most common criticism during preview was the closed catalog. Developers who wanted to run a specific quantized variant from Hugging Face had to reach for Ollama or LM Studio instead. Microsoft has acknowledged the trade off and framed the curation as a deliberate quality decision, while continuing to expand the catalog with each release. The April 2026 GA announcement noted that resumable downloads, token by token streaming on every supported platform, and improved hardware detection had landed as part of the production push.
Foundry Local has also become a common starting point for developers experimenting with NPU acceleration on the latest generation of consumer hardware. The Qualcomm Snapdragon X Copilot Plus PCs in particular benefit from Foundry Local because the QNN execution provider is one of the few easy ways to run a model on the dedicated Hexagon NPU. Apple silicon machines see similar gains from the CoreML provider, which uses the Neural Engine for the parts of the graph it can accelerate.
In the enterprise market the story is shaped by the relationship with the cloud Foundry product. Teams that already use Azure AI Foundry for cloud inference appreciate that they can ship the same prompt and the same OpenAI client code against a local Foundry runtime on a customer device, which makes hybrid deployments cleaner than they would be otherwise. That story has helped Foundry Local pick up adoption in regulated industries where keeping prompt and response data on a customer device is preferable to sending it to a cloud endpoint.
The runtime has not displaced Ollama in the hobbyist developer community, which is the part of the market most invested in being able to run any model from any source. It has, however, carved out a distinct niche as the default choice for product teams that want to ship a local AI feature inside a finished application and want Microsoft to take responsibility for the hardware integration work.