Apple Foundation Models (AFM) are the large language models developed by Apple to power Apple Intelligence, the company's generative AI system introduced at WWDC 2024. The family consists of two primary models: AFM-on-device, a roughly 3-billion-parameter model optimized to run locally on Apple Silicon, and AFM-server, a larger mixture-of-experts model deployed in Apple's Private Cloud Compute infrastructure. Both models underpin features including Writing Tools, Smart Reply, notification summaries, and Siri's extended language-understanding capabilities.
Apple described the technical details of these models in a paper published on arXiv in July 2024, and released a more comprehensive 2025 tech report covering architectural improvements, multilingual expansion, and new developer APIs. Unlike most AI models from large technology companies, AFM runs substantially on the user's own device, a design choice that reflects Apple's long-standing emphasis on local processing and data minimization.
Apple Intelligence was announced at WWDC 2024 in June of that year as the company's first major foray into generative AI integrated across iOS, iPadOS, and macOS. Rather than rely entirely on third-party APIs or cloud inference, Apple built its own foundation models and deployed them using a combination of on-device inference and a hardened cloud backend.
Apple's AI research work predates the public announcement substantially. The company had been working on neural language models and on-device machine learning for years, publishing work on topics such as federated learning, differential privacy, and on-device speech recognition. The foundation model effort represents a consolidation of those threads into a single, general-purpose language model suitable for consumer use cases at scale.
The initial October 2024 release with iOS 18.1 supported U.S. English only. iOS 18.2 in December 2024 extended availability to localized English in Canada, Australia, New Zealand, Ireland, the United Kingdom, and South Africa. iOS 18.4 in April 2025 added French, German, Italian, Portuguese (Brazil), Spanish, Japanese, Korean, and Simplified Chinese, bringing Apple Intelligence to the European Union for the first time. By the 2025 tech report, the models supported 16 languages with ongoing expansion.
AFM-on-device runs on devices with an A17 Pro chip or newer (iPhone 15 Pro and later) or any M-series Apple Silicon chip (iPad and Mac). The minimum configuration is an M1 chip, which Apple introduced in late 2020. The Neural Engine in these chips handles the bulk of inference; the M1's Neural Engine can execute 11 trillion operations per second, which is sufficient for the model's 2-bit quantized weights. The AFM-server model runs on Apple-designed server hardware in Apple's data centers, also using Apple Silicon.
AFM-on-device is a dense, decoder-only Transformer with approximately 3.18 billion total parameters (2.58B non-embedding, 0.15B embedding weights). The model uses a standard modern Transformer configuration:
| Parameter | Value |
|---|---|
| Model dimension | 3,072 |
| Feed-forward dimension | ~8,192 (SwiGLU) |
| Attention heads (query) | 24 |
| Attention heads (key/value) | 8 |
| Layers | 26 |
| Head dimension | 128 |
| Vocabulary size | 49,000 tokens |
| Context length (production) | 4,096 tokens |
| Extended context | 32,768 tokens |
The architecture uses grouped-query attention with 8 key/value heads against 24 query heads, which reduces the memory footprint of the KV cache substantially compared to multi-head attention. Positional embeddings use RoPE (Rotary Position Embeddings). Normalization is RMSNorm applied before each sublayer. The activation function is SwiGLU. Input and output embedding matrices are shared.
One notable architectural decision is a two-block structure with a 5:3 depth ratio. The full 26-layer model is divided into a first block of roughly 16 layers and a second block of roughly 10 layers. All key-value caches in the second block are shared with the KV caches produced by the final layer of the first block. This reduces KV cache memory by 37.5% compared to a conventional setup where each layer maintains its own cache. The practical effect is a significant reduction in time-to-first-token, which is the latency metric most noticeable to users when starting a generation.
On iPhone 15 Pro, Apple reported a time-to-first-token of approximately 0.6 milliseconds per prompt token and a generation rate of 30 tokens per second before token speculation. These numbers position the model within the range needed for interactive use without perceptible lag.
The production on-device model uses a mixed-precision scheme averaging 3.7 bits per weight (the model can be compressed to 3.5 bits without significant quality loss). Most projection weights use 4-bit palettization with 16-column grouping. Some layers use 2-bit quantization. Embedding layers use 8-bit per-channel quantization. KV caches use 8-bit quantization.
Critically, Apple applies quantization-aware training (QAT) rather than post-training quantization. This means the model is trained with simulated quantization noise so that the weights settle into positions that are robust to low-bit representation. Apple also trains lightweight accuracy-recovery LoRA adapters on approximately 10 billion tokens to compensate for any residual quantization loss. The company uses an internal tool called Talaria to optimize per-layer bit-rate assignments, balancing model quality against memory and latency budgets.
Activation quantization is applied separately from weight quantization. KV cache updates use efficient Neural Engine kernels optimized for the specific memory layout of Apple Silicon.
AFM-server is a larger model running on Apple Silicon servers inside the Private Cloud Compute (PCC) infrastructure. Apple has not publicly disclosed the total parameter count. The model uses a different vocabulary with 100,000 tokens (expanded to 150,000 in the 2025 update to support more languages).
The architectural innovation in AFM-server is a design called Parallel-Track Mixture-of-Experts (PT-MoE). Conventional mixture-of-experts (MoE) architectures route each token to a subset of expert feed-forward networks, reducing the active parameters per forward pass. The PT-MoE design extends this with an additional dimension of parallelism called track parallelism.
In track parallelism, the server model contains multiple smaller Transformer stacks called tracks. Each track processes tokens independently in parallel. Synchronization between tracks occurs only at the input and output boundaries of each track block, not at every layer. Conventional tensor parallelism requires synchronization at every layer, which generates 2L synchronization points for a model with L layers. With PT-MoE and track block depth D, synchronization points are reduced to L/D. At D=4, this represents an 87.5% reduction in inter-device communication overhead.
Each track block also has its own MoE layers. The combined structure gives the server model high total capacity (via the MoE experts) while keeping inference latency manageable (via reduced synchronization). Apple additionally uses interleaved global-local attention in the server model: most attention layers use sliding-window local attention over nearby tokens, with global attention layers interspersed at intervals. This combination supports efficient processing of longer sequences up to 65,536 tokens.
The server model includes a vision encoder: ViT-g (approximately 1 billion parameters) for image understanding tasks.
Both models are trained using Apple's AXLearn framework, an open-source library built on JAX and XLA released in 2023 (Apache 2.0 license). AXLearn supports data parallelism, tensor parallelism, sequence parallelism, and Fully Sharded Data Parallel (FSDP) training simultaneously across thousands of accelerators. AFM-server was pre-trained on 8,192 TPUv4 chips for 6.3 trillion tokens.
The pre-training data mixture combines:
Apple does not use private user data or user interactions in training. The company applies filtering to remove personally identifiable information and low-quality content.
Pre-training uses a sequence length of 4,096 tokens with batch size 4,096. A continued pre-training phase of 1 trillion tokens at sequence length 8,192 emphasizes code and mathematics. A context-lengthening phase trains on 100 billion tokens at 32,768 sequence length.
AFM-on-device is not trained from scratch at its final size. Instead, it is initialized from a pruned version of a 6.4-billion-parameter model. The pruning procedure learns sparse masks using a Soft-Top-K masking method (similar to methods from Wang et al. 2020 and Xia et al. 2023). Pruning is applied only to the hidden dimension of feed-forward layers. The mask is learned over 188 billion tokens using the same data mixture as core pre-training.
After pruning to the 3B target size, the model is trained with knowledge distillation for the full 6.3 trillion token core pre-training run. The distillation loss replaces the standard cross-entropy target labels with a convex combination of the true one-hot labels and the teacher model's top-1 predictions, with weight 0.9 assigned to the teacher's labels. The teacher model is AFM-server or a larger model in the training pipeline.
This combination yields measurable gains: initializing from the pruned model improves benchmark results by 0 to 2% over random initialization at the same parameter count. Adding distillation boosts MMLU by approximately 5 percentage points and GSM8K (math reasoning) by approximately 3 percentage points. Distillation was not found to be helpful during the continued pre-training phase, so that phase uses the same recipe as AFM-server.
Both models go through supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF). Learning rates are 5e-6 for the server model and 2e-5 for the on-device model, with dropout of 0.1.
Apple developed two algorithmic innovations for post-training:
Iterative Teaching Committee (iTeC): A multi-round data collection strategy in which a committee of models (not just the single model being trained) generates candidate responses. This committee approach produces higher-quality synthetic training data than self-improvement methods that rely on a single model. Human annotators and automated judges then filter and rank the committee outputs.
Mirror Descent with Leave-One-Out advantage estimation (MDLOO): An RLHF algorithm that uses mirror descent policy optimization with a leave-one-out baseline for variance reduction in the advantage estimate. This is more stable than standard PPO-based methods in Apple's training setup. The reward model uses soft labels that encode the intensity of human preferences rather than hard binary win/loss labels, and uses single-sided grading as a regularization technique.
Synthetic data generation is used across several domains:
The 2025 models added RLHF training data in all 16 supported languages. The multilingual RLHF phase yielded a 16:9 win/loss rate over SFT-only models in human evaluations.
AFM-on-device uses a LoRA-style adapter system to specialize the base model for individual features without retraining the full model. Apple refers to these as task-specific adapters.
Adapters are small neural network modules inserted into specific layers of the frozen pre-trained model. They modify attention projection matrices, attention matrices, and the fully connected layers in feed-forward networks. Adapter weights use 16-bit (float16) representation. A rank-16 adapter on the 3.18B parameter model occupies tens of megabytes of storage.
Supported ranks are 8, 16, and 32. Adapters are initialized from accuracy-recovery adapters (the same components used to compensate for quantization loss) to provide a warm start.
At runtime, adapters are dynamically loaded, temporarily cached in memory, and swapped as the user switches between features. A writing task loads the Writing Tools adapter; a summarization task loads the summarization adapter; a reply suggestion task loads the Smart Reply adapter. This design keeps the base model in a fixed location in memory while feature-specific behavior is provided by swappable modules.
The 2025 Foundation Models Framework for developers allows third-party developers to train their own rank-32 adapters using Apple's Python adapter training toolkit, enabling custom on-device capabilities for specialized apps.
When a request cannot be handled on-device due to complexity or context length, Apple Intelligence routes it to AFM-server via Private Cloud Compute (PCC). PCC is a cloud inference infrastructure designed from the ground up with privacy as a first-order constraint rather than an afterthought.
PCC runs on custom Apple Silicon server nodes. These servers use the same hardware security technologies as iPhone: a Secure Enclave, Secure Boot, and hardware-rooted trust chains. The operating system is a hardened subset of iOS and macOS, stripped of components not needed for LLM inference, minimizing the attack surface.
PCC enforces several cryptographic guarantees:
Stateless processing: User data exists on PCC nodes only for the duration of a request. The Secure Enclave randomizes encryption keys on every reboot without persisting them. After a response is returned, no user data remains in any form.
No privileged runtime access: PCC nodes have no remote shell, no interactive debugging capability, and no general-purpose logging. Only pre-specified, structured, audited logs and metrics can leave the node.
End-to-end encryption to specific nodes: User devices encrypt requests directly to the public keys of specific validated PCC nodes, not to a general load balancer. The load balancer can route requests but cannot decrypt them. This prevents a compromised load balancer from reading user data.
Target diffusion: Multiple mechanisms prevent an attacker from reliably routing a specific user's requests to a compromised node. Request metadata excludes personally identifiable information. Authorization uses RSA Blind Signatures that grant access without identifying the user. An OHTTP relay operated by a third party hides the source device's IP address.
Code signing: All software running on PCC nodes must be part of a trust cache signed by Apple and approved for that specific node, loaded by the Secure Enclave in a way that cannot be modified at runtime.
Apple commits to publishing cryptographic measurements of all code running on PCC in an append-only, tamper-proof transparency log. Software images become publicly available within 90 days of deployment, enabling independent security researchers to verify that the deployed software matches the published measurements. Apple provides a Virtual Research Environment (VRE) for Mac that allows researchers to simulate PCC node behavior locally. Security-critical source code, sepOS firmware, and the iBoot bootloader are published in plaintext.
This transparency architecture is notable among cloud AI providers. Most cloud inference systems offer privacy policies but not cryptographic verifiability. Apple's design allows anyone to confirm that Apple itself cannot access user requests, not merely by trusting Apple's claims but by inspecting the software.
Apple released the Foundation Models Framework for developers at WWDC 2025 in June of that year. The framework ships with iOS 26, iPadOS 26, macOS Tahoe, and visionOS 3. It gives third-party developers access to AFM-on-device through a Swift API.
Guided generation: Developers annotate Swift types with the @Generable macro. The framework uses constrained decoding to guarantee that model output conforms to the specified type schema. A struct with an integer field and a string field will always produce valid integers and strings, not malformed JSON or hallucinated field names. This removes the fragile parsing step that typically accompanies LLM-generated structured data.
Tool calling: The model can invoke developer-defined tools as callbacks during generation. Both parallel tool invocation (where multiple tools can be called simultaneously) and serial chains (where one tool's output informs the next call) are handled automatically by the framework.
Stateful sessions: Multi-turn conversations maintain context across turns through session objects. Developers do not manage conversation history manually.
Custom adapters: Developers can train rank-32 LoRA adapters using Apple's Python toolkit and package them for distribution with their apps. These adapters load and unload at runtime the same way Apple's own task adapters do.
Offline operation: Because inference runs entirely on-device, apps using the framework work without a network connection, have no per-query cost, and do not require API keys.
As few as three lines of Swift code are needed for basic text generation. The framework integrates with Swift concurrency (async/await) and Combine.
In addition to the base model access, Apple provides a set of system-level adapters for common tasks:
| Adapter | Purpose |
|---|---|
| Summarization | Condense text to key points |
| Writing Tools | Rewrite, proofread, adjust tone |
| Smart Reply | Suggest message responses |
| Content tagging | Classify and tag content |
| Extraction | Pull structured data from unstructured text |
Third-party apps using these adapters benefit from Apple's own fine-tuning data without needing to train their own models.
By October 2025, dozens of apps in the App Store used the Foundation Models framework. Examples include:
The following benchmarks come from Apple's 2024 technical report and 2025 tech report. Comparisons are against models that were publicly available at the time of each publication.
| Model | Params | MMLU (5-shot) | Notes |
|---|---|---|---|
| AFM-on-device | 3B | 61.4% | Apple |
| Phi-3-mini | 3.8B | 68.8% | Microsoft |
| Mistral-7B | 7B | ~64% | Mistral AI |
| Gemma-7B | 7B | ~64% | |
| Llama-3-8B | 8B | ~66% | Meta |
Despite scoring lower than Phi-3-mini on MMLU in the raw pre-training evaluation, Apple's post-training human evaluations showed AFM-on-device preferred over Phi-3-mini 47.7% of the time vs 25% for Phi-3-mini (with the remainder tied), suggesting that task-specific alignment and adapter fine-tuning close much of the gap on practical use cases.
| Model | Params | MMLU | MGSM |
|---|---|---|---|
| AFM-on-device | 3B | Higher | Lower |
| Qwen-2.5-3B | 3B | Lower | -- |
| Gemma-3-4B | 4B | Lower | -- |
| Gemma-3n-E4B | 4B | Lower | Slightly higher |
The 2025 on-device model outperforms Qwen-2.5-3B, Gemma-3-4B, and Gemma-3n-E4B on MMLU and multilingual benchmarks (MMMLU), though Gemma-3n-E4B edges it slightly on MGSM (multilingual math reasoning).
| Model | MMLU | Human eval win rate vs AFM-server |
|---|---|---|
| AFM-server | 75.4% | baseline |
| DBRX-Instruct | ~74% | AFM-server preferred |
| Mixtral-8x22B | ~77% | comparable |
| GPT-3.5 | ~70% | AFM-server preferred |
| Llama-3-70B | ~79% | comparable |
The server model achieved over 50% win rate with 27.4% ties against GPT-3.5 in human evaluation of writing and summarization tasks. On Berkeley Function Calling Leaderboard for tool use, AFM-server achieved the best overall accuracy in the 2024 report, ahead of Gemini 1.5 Pro and GPT-4 at that evaluation date.
| Model | Comparison |
|---|---|
| LLaMA 4 Scout | AFM-server slightly behind |
| Qwen-3-235B | AFM-server behind |
| GPT-4o | AFM-server behind |
The 2025 server model is positioned as competitive with models of comparable total and active parameter counts but trails significantly larger models.
Apple does not use private user data or user interactions in training. Web crawl data is collected by Applebot with publishers able to opt out. Licensed data agreements with publishers cover specific content types. Synthetic data is generated by other models and filtered for quality. Personally identifiable information is removed from training corpora.
When processing happens on-device, no data leaves the user's device at all. The model weights, adapters, and KV cache all reside in device memory. This covers the large majority of Apple Intelligence features in everyday use.
When cloud inference is required, PCC provides the cryptographic and architectural guarantees described above. Apple's position is that the combination of on-device processing for the majority of requests and cryptographically verifiable stateless cloud processing for the remainder gives users stronger privacy guarantees than services that route everything through conventional cloud inference.
Apple's responsible AI taxonomy for the foundation models covers 12 primary safety categories with 51 subcategories. More than 10% of post-training data addresses adversarial prompts or safety-relevant scenarios. The models scan inputs and outputs to detect content that should not be processed or surfaced.
Violation rates on adversarial benchmarks are lower than open-source and commercial models of comparable size. Apple reported that the summarization adapter did not amplify sensitive content in over 99% of targeted adversarial test cases.
Code execution capabilities are sandboxed in a locked Firecracker micro-VM environment. Red-teaming is conducted by both internal teams and external researchers under voluntary participation agreements.
Locale-specific safety evaluation covers culture-specific bias and sensitive content across all supported languages, with human red-teaming by annotators fluent in each language.
| Aspect | AFM-on-device | Phi-3 mini | Gemma 3 4B | Llama 3.2 3B |
|---|---|---|---|---|
| Parameters | ~3B | 3.8B | 4B | 3B |
| Target platform | Apple Silicon | Mobile/cloud | Mobile/cloud | Mobile/cloud |
| Quantization | 2-4 bit QAT | INT4 | INT4 | INT4 |
| Context window | 32K (extended) | 4K | 128K | 128K |
| Developer API | Foundation Models (Swift) | Open weights | Open weights | Open weights |
| Privacy model | On-device + PCC | User-managed | User-managed | User-managed |
| Task adapters | Dynamic LoRA swap | Fine-tune manually | Fine-tune manually | Fine-tune manually |
| Offline operation | Yes | Yes (self-hosted) | Yes (self-hosted) | Yes (self-hosted) |
| Cost to developers | Free (on-device) | Varies by hosting | Varies by hosting | Varies by hosting |
The primary distinction from open-weight models like Phi-3, Gemma 3, or Llama 3.2 is that AFM-on-device is not available as open weights. Developers access it only through Apple's API. This means developers cannot inspect the full model weights, fine-tune the base model (only adapters), or deploy it outside Apple hardware. In return, they get a pre-integrated, privacy-preserving, zero-cost inference runtime with guaranteed hardware acceleration on all supported Apple devices.
AFM-on-device and AFM-server collectively power all Apple Intelligence features as of iOS 18.1 through iOS 26:
AFM-on-device is constrained by the memory and compute budget of mobile hardware. At roughly 3 billion parameters and 3.7 bits per weight, the model weights occupy approximately 1.4 GB of device storage. This leaves limited headroom for longer contexts or more capable architectures without hardware advances.
Capability gaps compared to frontier models are real. The server model lags behind Qwen-3-235B and GPT-4o on general benchmarks. The on-device model, while competitive for its size class, cannot perform complex multi-step reasoning that larger models handle more reliably.
The model is not accessible as open weights. Developers cannot run it outside Apple hardware, audit the weights, or fine-tune beyond the adapter interface. This limits research use and independent safety auditing of the base model.
Apple Intelligence is not available on devices with A16 Bionic or older chips, excluding a large portion of the installed iPhone base. In mainland China, Apple Intelligence had not launched as of early 2026 due to regulatory requirements.
The on-device context window in production is limited to 4,096 tokens for most features, with extended 32,768-token contexts available for specific use cases. Inputs requiring longer context are routed to AFM-server, which introduces latency and requires network connectivity.