Microsoft Foundry Local

AI Infrastructure Developer Tools Microsoft Open Source AI

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 3,759 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Foundry Local is an on-device artificial intelligence runtime from Microsoft that lets applications run open weight language models entirely on a user's own hardware. The project ships a lightweight inference engine, a curated model catalog, native software development kits in four languages, a command line interface, and an optional local web server that speaks the same request and response format as the OpenAI API. It is part of the broader Azure AI Foundry product family, but it does not require an Azure subscription, a network connection at runtime, or any cloud account. Once a model has been downloaded to a device, the entire chat or transcription loop happens locally.

Microsoft introduced Foundry Local in public preview at Microsoft Build on 19 May 2025 and declared the product generally available on 9 April 2026. The runtime supports Windows, macOS on Apple silicon, and Linux on x64. It uses ONNX Runtime under the hood and selects the best available execution provider for the host machine, which means it will use an NPU or GPU when one is present and quietly fall back to the CPU when one is not. The official runtime adds roughly twenty megabytes to an application bundle, which is small enough that developers can ship it inside their installer and treat local inference as just another library dependency.

The product sits in a crowded category that already includes Ollama, LM Studio, GPT4All, and a long tail of community runners. What sets Foundry Local apart is the focus on shipping inside a finished consumer or enterprise app rather than running as a standalone developer tool. The SDK ships under the MIT license, the CLI ships under standard Microsoft software license terms, and individual models keep their own upstream licenses, which include Apache License 2.0, MIT, and various source available terms depending on the model in question.

Background

The arrival of Foundry Local marks a deliberate change in how Microsoft talks about its AI stack. Before 2025 the company aimed almost all of its generative AI marketing at cloud workloads on Azure, with OpenAI hosted models in Azure OpenAI Service sitting at the center. On-device work was treated as the domain of Windows Copilot Runtime and the small handful of Phi models that Microsoft Research had been publishing since late 2023. There was no clean way for a developer outside Microsoft to take one of those models, embed it in a desktop application, and ship the result to customers without writing a great deal of glue code around quantization, hardware probing, and driver updates.

The shift toward local inference was driven by several real world forces. First, the cost of cloud inference at scale became hard to ignore. Many independent developers and startups noticed that their monthly bills for chat features dwarfed every other line in their infrastructure spend. Second, regulators in Europe and elsewhere kept tightening the rules around sending user content to remote endpoints, especially for healthcare, legal, and education products. Third, the consumer hardware market got dramatically better at running large language models. The arrival of Qualcomm Snapdragon X laptops in mid 2024, the Apple silicon M3 and M4 chips, and the Intel Lunar Lake series brought neural processing units, abbreviated NPU, into the mainstream of mid range Windows machines and every recent Mac. Fourth, the open weight ecosystem produced a wave of models in the one to ten billion parameter range that were genuinely useful for chat and transcription rather than novelty grade.

Microsoft already had two adjacent projects that made Foundry Local possible. ONNX Runtime, originally released in 2018, had grown into a cross platform inference library with execution providers for CUDA, DirectML, CoreML, OpenVINO, QNN, and several other backends. Olive, a model optimization toolkit, handled the offline work of converting and quantizing transformer models so they fit into consumer memory budgets. Foundry Local pulled both of those pieces together into a single user facing product with a friendlier surface area. The cloud counterpart, Azure AI Foundry, continued to host frontier models through the Foundry Models catalog, and Microsoft positioned Foundry Local as the matching on-device runtime so that a team could prototype against the cloud catalog and then migrate suitable workloads down to client devices without rewriting their application code.

Architecture

Foundry Local is built as a small native binary plus a set of language bindings. The binary contains the lifecycle manager, the model catalog client, and the inference loop. When an application calls the SDK, the manager spawns or attaches to the local service, downloads the model files on first use, caches them on disk, and then either runs inference in process through the bindings or exposes an HTTP server that other processes can call.

Under the SDK there is a layered execution stack. ONNX Runtime sits at the bottom and provides the abstract model graph and the kernel dispatch logic. Above it Foundry Local maintains a registry of execution providers. An execution provider is a hardware specific backend that knows how to run ONNX operators on a particular accelerator. On Windows the registry typically includes the DirectML provider for Direct3D 12 GPUs, the QNN provider for Qualcomm NPUs, the OpenVINO provider for Intel iGPU and NPU silicon, and a CPU provider as the universal fallback. On macOS it includes a CoreML provider that targets the Metal stack and the Apple Neural Engine. On Linux x64 it ships CPU and CUDA providers, with additional providers being added over the course of 2026.

When the manager starts, it probes the host machine, reports the available execution providers to the application through a discoverEps call, and then downloads the matching native packages on demand. The first run of a Foundry Local enabled app is therefore noticeably longer than subsequent runs, because the runtime is acquiring both the model weights and the hardware specific shared libraries it needs. After that initial setup the bytes stay on disk and the manager skips straight to loading.

Models in the catalog are not raw PyTorch checkpoints. They are ONNX graphs that have been quantized and recompiled by the Olive pipeline so that they run efficiently on the target hardware. A single model alias such as phi-4-mini typically resolves to several physical variants, including int4 weights for CPU, int4 weights packed for NPU, and a fp16 GPU variant. The manager picks the variant that matches the chosen execution provider, which is why the same alias can deliver very different memory footprints and throughput numbers on different machines. Microsoft has not published every internal quantization recipe, but the documentation makes clear that every catalog entry goes through a process the team calls quantization and compression to fit consumer memory budgets.

The runtime keeps the key value cache, the tokenizer, and the streaming logic inside its own process. Applications never have to load PyTorch, Transformers, or Hugging Face tooling. The optional web server speaks the OpenAI Responses API and the older Chat Completions API, which means that a developer who already wrote code against the OpenAI SDK can swap the base URL for http://localhost and get the same calls working against a local model with no other code change. That compatibility decision is one of the more important design choices in the project, because it lets a team move between cloud and local backends with minimal friction.

Supported models

The Foundry Local catalog is curated rather than open. Microsoft has explicitly said that the project is not designed to host every model on the public internet, and that the catalog is meant to include models that have been quantized, tested across a range of consumer hardware, and small enough to ship to end users. The current catalog covers two main task families: chat completions and audio transcription. The table below lists the model families that Microsoft has confirmed in the product documentation as of mid 2026.

Family	Origin	Task	License notes
Phi-4 and Phi-4 mini	Microsoft Research	Chat and reasoning	MIT
Phi-4 mini reasoning	Microsoft Research	Reasoning oriented chat	MIT
GPT OSS	OpenAI	Chat	Open weight, see OpenAI terms
Llama variants	Meta	Chat	Llama Community License
Qwen and Qwen 2.5	Alibaba	Chat	Apache 2.0 for several variants
DeepSeek	DeepSeek	Chat and reasoning	DeepSeek model license
Mistral	Mistral AI	Chat	Apache 2.0 for several variants
Whisper	OpenAI	Audio transcription	MIT

The exact list of available aliases changes over time as new versions are added and older ones are deprecated. The CLI exposes the live list through foundry model ls, and the SDK exposes the same data through the catalog object. Each model entry includes its size, its supported hardware variants, and the upstream license text. Models that Microsoft has aliased for convenience can be invoked by short names such as qwen2.5-0.5b or phi-4-mini, and the manager handles the resolution to the appropriate quantized file automatically.

One important point that often confuses newcomers is that Foundry Local does not load models from the Hugging Face Hub directly. The catalog is hosted by Microsoft, and models are added only after the Olive pipeline has produced an ONNX variant that the team is willing to support. This is the trade off behind the product's promise of a reliable end user experience. Developers who need access to a model outside the catalog typically reach for Ollama or LM Studio instead, both of which let users load arbitrary GGUF files.

Platforms

Foundry Local targets three operating systems. The exact installation flow differs by platform but the resulting CLI and service behave the same way.

Platform	Install command	Hardware acceleration paths
Windows 10 and 11, x64 and ARM64	`winget install Microsoft.FoundryLocal`	Windows ML, DirectML for GPU, QNN for Qualcomm NPU, OpenVINO for Intel iGPU and NPU, CUDA for NVIDIA GPU, CPU fallback
macOS on Apple silicon	`brew tap microsoft/foundrylocal` then `brew install foundrylocal`	CoreML provider targeting Metal and the Apple Neural Engine
Linux x64	Distributed via the Foundry Local GitHub releases page	CPU provider, CUDA provider for NVIDIA GPUs, additional providers rolling out through 2026

The Windows experience is the most polished of the three. The Windows package integrates with Windows ML, the system level inference runtime that Microsoft has positioned as the umbrella for all local model execution on the operating system. That integration means the Foundry Local SDK on Windows can use the system managed pool of execution providers, which keeps driver updates aligned with regular Windows Update cycles instead of forcing every app to ship its own GPU runtime.

Microsoft documents minimum system requirements of 8 gigabytes of RAM and 3 gigabytes of free disk space, with a recommended configuration of 16 gigabytes of RAM and 15 gigabytes of free disk space to accommodate larger models and additional execution provider packages. An internet connection is required for the first run, because the manager needs to fetch both the model weights and the relevant execution provider, but every run after that can be fully offline.

The Linux build is the youngest of the three. At GA it covered x64 with CPU and CUDA acceleration, with explicit notes that more execution providers and ARM64 builds were planned over the course of 2026. The macOS build covers only Apple silicon. There is no Intel Mac build, which mirrors Apple's own deprecation timeline for that architecture.

Command line interface and SDK usage

The Foundry Local CLI is the first thing most developers touch. After installation the foundry command becomes available in a terminal. The basic commands cover listing the catalog, inspecting individual models, downloading weights, and running an interactive chat session.

foundry model ls                 # list models in the catalog
foundry model info phi-4-mini    # show details for one model
foundry model run qwen2.5-0.5b   # download if needed and start a chat
foundry service status           # check whether the local service is running

The foundry model run command opens a REPL style chat in the terminal. It is useful for quick smoke tests but it is not the intended production interface. The intended path is the SDK, which is published for four languages.

Language	Package name	Notes
C#	`Microsoft.AI.Foundry.Local` and `Microsoft.AI.Foundry.Local.WinML`	The WinML variant uses the Windows system runtime for broader hardware coverage
Python	`foundry-local-sdk` and `foundry-local-sdk-winml`	Requires Python 3.11 or later
JavaScript and TypeScript	`foundry-local-sdk` and `foundry-local-sdk-winml`	Requires Node.js 20 or later
Rust	`foundry-local-sdk`, with an optional `winml` feature flag	Requires Rust 1.70 or later

All four SDKs follow the same shape. An application creates a manager with a configuration object, discovers the available execution providers, downloads and registers them, retrieves a model from the catalog, downloads its weights if they are not cached, loads the model into memory, and then opens a chat client that exposes both blocking and streaming completion calls. When the application is done with the model it calls unload, which returns the memory to the operating system.

The streaming completion path is the one most consumer apps use, because it gives the user a typewriter style response instead of waiting for the full reply. The pattern mirrors the OpenAI streaming API closely. In Python it looks roughly like this:

for chunk in client.complete_streaming_chat(messages):
    print(chunk.choices[0].delta.content, end="", flush=True)

The equivalent in C# uses await foreach on a streaming response. JavaScript uses async iterators. Rust uses tokio_stream::StreamExt::next. The conceptual model is the same across all four languages, which makes the SDK reasonably easy to port between stacks.

For applications that need to share a loaded model across multiple processes, Foundry Local also exposes an optional local HTTP server with OpenAI compatible endpoints. This is the path most often used to wire up tools like LangChain, Semantic Kernel, or any other library that already speaks OpenAI Chat Completions. For an embedded single user scenario, Microsoft recommends the in process SDK instead, because it avoids the overhead of an extra HTTP hop and a separate server process.

Comparison to Ollama and LM Studio

Foundry Local lands in the same conceptual space as a handful of established projects, and the design trade offs make for an interesting contrast. The table below summarizes the main differences as of mid 2026.

Dimension	Foundry Local	Ollama	LM Studio
Primary audience	Application developers shipping AI in their own installer	Developers and power users running local models from a terminal	End users who want a desktop GUI for local chat
Runtime engine	ONNX Runtime with hardware specific execution providers	llama.cpp with custom Go orchestration	llama.cpp wrapped in an Electron app
Model format	Curated ONNX graphs produced by Microsoft's Olive pipeline	GGUF files pulled from a public registry	GGUF files, with direct browsing of Hugging Face
Model catalog	Closed and curated	Open registry plus user uploaded Modelfiles	Open browsing of any Hugging Face GGUF
Hardware backends	CPU, NVIDIA GPU, AMD GPU, Intel iGPU and NPU, Qualcomm NPU, Apple Metal and ANE	CPU, NVIDIA GPU, Apple Metal, some AMD support	CPU, NVIDIA GPU, Apple Metal, Vulkan
Distribution model	SDK bundled inside customer apps, plus standalone CLI	Standalone background service with HTTP API	Standalone desktop application with optional server mode
OpenAI compatible API	Yes, including the Responses API	Yes, partial Chat Completions	Yes, Chat Completions
Native SDKs	C#, Python, JavaScript, Rust	Community SDKs for several languages, none official until 2025	Community SDKs
Licensing	SDK under MIT, CLI under Microsoft terms, models keep upstream licenses	Open source, MIT	Proprietary, free for personal and limited commercial use
Installer footprint	About 20 MB for the runtime	Larger, includes the model server	Larger, includes a full Electron GUI

The practical division of labor is roughly this. Ollama is the easiest path for a developer who wants to grab a model from a public registry and call it from a script. LM Studio is the easiest path for a non technical user who wants a visual chat window. Foundry Local is the easiest path for a team that wants to embed local inference inside a finished product, control the model selection process, and rely on Microsoft for hardware compatibility, driver updates, and OpenAI API parity.

The closed catalog is the most polarizing piece of that trade. Some developers see it as a feature, because it removes the support burden of arbitrary user supplied models and produces a predictable end user experience. Others see it as a limitation, because their use case depends on a specific community fine tune that will never appear in the Microsoft catalog. The right answer depends entirely on whether the application is trying to ship a known good experience or trying to expose the full long tail of open weight models.

Licensing

Foundry Local uses a split license model that is worth understanding before shipping the runtime in a commercial product.

The SDK packages, including the C# library, the Python package, the JavaScript module, and the Rust crate, are released under the MIT license. That is a permissive license that allows commercial use, redistribution, and modification with attribution. The MIT license sits in the project's LICENSE file on GitHub.

The CLI and the background service are released under standard Microsoft software license terms rather than under MIT. Those terms allow free installation and use but are not an open source license in the OSI sense. Teams that plan to redistribute the CLI inside their installer should review the Microsoft terms carefully, because the conditions differ from those of the SDK.

Each model in the catalog keeps its own upstream license. Phi-4 ships under MIT. Several Qwen and Mistral variants ship under Apache 2.0. Llama variants ship under the Meta Llama Community License, which restricts certain large scale commercial deployments. GPT OSS uses OpenAI's own open weight terms. Whisper is MIT. DeepSeek models ship under the DeepSeek model license. The CLI surfaces the relevant license string for each model through the foundry model info command, and the SDK exposes the same information through the catalog object. Microsoft has been explicit that the user of the model, not Microsoft, is responsible for complying with the model's upstream license terms.

The overall effect is that Foundry Local is open enough for most product teams to embed in their applications without legal friction, but it is not as cleanly open source as some competing runtimes. A team that wants a fully MIT or Apache licensed stack end to end will need to read each individual model license before shipping.

Reception

The reception of Foundry Local in the developer community has been broadly positive. Most coverage during the public preview period focused on three points. First, the runtime is genuinely small, which is unusual for a product with so much surface area, and the twenty megabyte installer footprint compares favorably to the multi hundred megabyte downloads associated with other local model stacks. Second, the OpenAI API compatibility removed an enormous amount of integration work for teams that already had OpenAI based prototypes, which made migration to local inference feel cheap rather than expensive. Third, the automatic hardware detection actually worked, which is a low bar in principle but a high bar in practice on a fragmented Windows ecosystem.

The most common criticism during preview was the closed catalog. Developers who wanted to run a specific quantized variant from Hugging Face had to reach for Ollama or LM Studio instead. Microsoft has acknowledged the trade off and framed the curation as a deliberate quality decision, while continuing to expand the catalog with each release. The April 2026 GA announcement noted that resumable downloads, token by token streaming on every supported platform, and improved hardware detection had landed as part of the production push.

Foundry Local has also become a common starting point for developers experimenting with NPU acceleration on the latest generation of consumer hardware. The Qualcomm Snapdragon X Copilot Plus PCs in particular benefit from Foundry Local because the QNN execution provider is one of the few easy ways to run a model on the dedicated Hexagon NPU. Apple silicon machines see similar gains from the CoreML provider, which uses the Neural Engine for the parts of the graph it can accelerate.

In the enterprise market the story is shaped by the relationship with the cloud Foundry product. Teams that already use Azure AI Foundry for cloud inference appreciate that they can ship the same prompt and the same OpenAI client code against a local Foundry runtime on a customer device, which makes hybrid deployments cleaner than they would be otherwise. That story has helped Foundry Local pick up adoption in regulated industries where keeping prompt and response data on a customer device is preferable to sending it to a cloud endpoint.

The runtime has not displaced Ollama in the hobbyist developer community, which is the part of the market most invested in being able to run any model from any source. It has, however, carved out a distinct niche as the default choice for product teams that want to ship a local AI feature inside a finished application and want Microsoft to take responsibility for the hardware integration work.

References

Microsoft. "What is Foundry Local?" Microsoft Learn, https://learn.microsoft.com/en-us/azure/foundry-local/what-is-foundry-local.
Microsoft Foundry Blog. "Foundry Local is now Generally Available." 9 April 2026, https://devblogs.microsoft.com/foundry/foundry-local-ga/.
Microsoft. "Foundry Local on GitHub." https://github.com/microsoft/Foundry-Local.
Microsoft. "Get started with Foundry Local." Microsoft Learn, https://learn.microsoft.com/en-us/azure/foundry-local/get-started.
Microsoft. "Foundry Local LICENSE." GitHub, https://github.com/microsoft/Foundry-Local/blob/main/LICENSE.
Microsoft. "Foundry Local security and privacy reference." GitHub, https://github.com/microsoft/Foundry-Local/blob/main/docs/reference/reference-security-privacy.md.
Windows Developer Blog. "Advancing Windows for AI development: New platform capabilities and tools introduced at Build 2025." 19 May 2025, https://blogs.windows.com/windowsdeveloper/2025/05/19/advancing-windows-for-ai-development-new-platform-capabilities-and-tools-introduced-at-build-2025/.
Microsoft Foundry Blog. "What's new in Microsoft Foundry, April 2026." https://devblogs.microsoft.com/foundry/whats-new-in-microsoft-foundry-apr-2026/.
Microsoft. "Microsoft Foundry Models overview." Microsoft Learn, https://learn.microsoft.com/en-us/azure/foundry/concepts/foundry-models-overview.
ONNX Runtime project. "ONNX Runtime." https://onnxruntime.ai/.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Microsoft Maia 200

Background

Architecture

Supported models

Platforms

Command line interface and SDK usage

Comparison to Ollama and LM Studio

Licensing

Reception

See also

References

Improve this article

Related Articles

Semantic Kernel

Guidance (library)

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet