Pixtral
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 6,909 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 6,909 words
Add missing citations, update stale details, or suggest a clearer explanation.
Pixtral is a family of multimodal vision-language models developed by Mistral AI, a French AI company founded in April 2023. The family consists of two models: Pixtral 12B, announced on September 11, 2024, as Mistral's first publicly available vision model, and Pixtral Large, a 124-billion-parameter model released on November 18, 2024.[1][2][3] Both models combine a custom vision encoder with a text decoder to process images and text within a single 128,000-token context window. Pixtral 12B is open-source under the Apache 2.0 license; Pixtral Large is released under the Mistral Research License (MRL) with a separate commercial license for production use.[4][5] Pixtral introduced several architectural novelties, including a dedicated vision encoder trained from scratch, 2D rotary positional encoding (RoPE-2D), and special break tokens that allow the model to process images at their native resolution and aspect ratio without padding or cropping.[1] Both models were subsequently deprecated on Mistral's API in late 2025 and early 2026 and replaced by the Ministral 3 14B and Mistral Large 3 families.[6]
Mistral AI was founded in Paris in April 2023 by Arthur Mensch, Guillaume Lample, and Timothée Lacroix, all veterans of Google DeepMind and Meta's research teams. The company built an early reputation for releasing compact, high-performing text models including Mistral 7B and Mistral Large, positioning itself as Europe's most prominent open AI laboratory and a direct competitor to OpenAI and Anthropic. By mid-2024, Mistral had raised roughly EUR 1 billion in funding and was valued at approximately EUR 5.8 billion. In June 2024, the company closed a USD 645 million Series B round led by General Catalyst that valued it at about USD 6 billion, making it minority-owned in part by Microsoft following a separate earlier investment.[7]
Despite strong text model performance, Mistral had no multimodal offering through early 2024 while GPT-4o, Gemini 1.5, and Claude 3 had all added vision capabilities to their flagship products. The absence of image understanding put Mistral at a disadvantage for enterprise use cases involving documents, charts, and visual data. Building a vision system required more than bolting a pre-trained image encoder onto an existing decoder; Mistral chose to train a new vision encoder from scratch, giving them more control over how the encoder represents images and making it possible to support variable-resolution inputs natively.[1]
Arthur Mensch, Mistral's CEO, had prior multimodal research experience: before co-founding Mistral he worked at DeepMind Paris on projects including Flamingo, one of the early large vision-language models. This background informed Pixtral's design philosophy, which prioritizes high-resolution document understanding and the ability to reason over multiple images in a single conversation.
The choice of name reflects a portmanteau common in vision-language work: "Pix" from pixel and "tral" from Mistral. Internally the encoder is referred to as Pixtral-ViT, signalling its lineage from the Vision Transformer family but with several deliberate architectural departures.[1]
Pixtral 12B was first revealed on September 11, 2024, when Mistral released the model weights directly via a torrent link on X (formerly Twitter), several days before the official blog announcement.[7][8] The torrent contained the raw weights plus a params.json describing the architecture, and the bundle was approximately 25 GB.[8] Mistral then published a detailed launch post on September 17, 2024, and made the model available on Hugging Face under the model identifier mistralai/Pixtral-12B-2409.[2][4] It was simultaneously made available through Mistral's API platform, La Plateforme, and through the Le Chat web interface. The launch described it as Mistral's first natively multimodal model, meaning vision was built in from training rather than added as a post-hoc adapter.
The model has 12 billion parameters in its multimodal decoder and 400 million parameters in its dedicated vision encoder, giving a total parameter count of approximately 12.4 billion.[2][4] The decoder is built on the same architectural foundation as Mistral Nemo 12B, with 40 transformer layers, a hidden dimension of 5,120, 32 attention heads, and 8 key-value heads for grouped-query attention.[1] The context window spans 131,072 tokens, which at the model's patch granularity of 16x16 pixels is large enough to accommodate multiple images alongside long text in a single conversation.
The original release post described Pixtral 12B as able to understand both natural images and documents, a distinction that mattered because many multimodal models at the time were stronger on photographs and weaker on charts, tables, and PDFs. On DocVQA, which measures a model's ability to answer questions about document images, Pixtral 12B scored 90.7 on the ANLS metric, placing it above GPT-4 Turbo and matching or beating several larger open models. On ChartQA it scored 81.8, also strong for its size class.[1][2]
The Apache 2.0 license was notable given the model's capabilities. At the time of release, most competitive multimodal models were proprietary. The license permits unrestricted commercial use, fine-tuning, and redistribution, which drew attention from developers who wanted a capable vision model they could run locally or fine-tune for specialized applications.[4]
Pixtral 12B's release came roughly two months after Mistral Nemo 12B (July 2024), and Mistral explicitly positioned the multimodal model as a "drop-in replacement" for Mistral Nemo, allowing applications that already used the text-only model to add vision with minimal refactoring.[1] Both share the same Tekken tokenizer and the same decoder hidden dimensions; the multimodal version augments the input pipeline rather than restructuring the language stack.
Pixtral Large was announced on November 18, 2024, roughly two months after the 12B release.[3][5] The model has 124 billion parameters in total, comprising a 123-billion-parameter multimodal decoder based on Mistral Large 2 (specifically Mistral-Large-Instruct-2407) and a 1-billion-parameter vision encoder.[5] Mistral released the weights under a dual licensing structure: the Mistral Research License (MRL) for non-commercial and research uses, with a separate commercial license required for production deployments.[3][5]
The scale difference between the two Pixtral models is significant. Pixtral Large requires over 300 GB of GPU memory for full-precision inference, placing it firmly in the data-center tier rather than the consumer GPU tier.[5] Mistral recommends vLLM (version 0.6.4.post1 or higher) with tensor parallelism across at least 8 GPUs for practical deployment.[5] Despite this, the model is available via API at La Plateforme under the alias pixtral-large-latest, where developers can access it without managing infrastructure.[3]
On benchmark evaluations, Pixtral Large outperformed GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet on several multimodal tasks. On MathVista, which tests mathematical reasoning over visual inputs such as geometry diagrams and data plots, Pixtral Large scored 69.4 percent, the highest of any model in the comparison set at the time.[3][5] On DocVQA it scored 93.3 ANLS, and on MM-MT-Bench, Mistral's own evaluation suite for real-world multimodal instruction following, it scored 7.4 out of 10.[3][5] The model was also ranked as the best open-weights model on the LMSys Vision Arena leaderboard at launch, outpacing the nearest competitor by approximately 50 ELO points.[3]
Pixtral Large maintains the text capabilities of its underlying Mistral Large 2 decoder. Unlike some multimodal models that see text performance decline after vision training, Pixtral Large's benchmark scores on text-only tasks remained close to Mistral Large 2's baseline, which Mistral attributed to their training procedure and the careful separation of the vision encoder from the core decoder.[3]
The Pixtral Large launch was bundled with a major upgrade of Le Chat, Mistral's consumer-facing web assistant. The simultaneous Le Chat updates added web search with citations, an interactive canvas pane for collaborative document drafting, agentic workflows for tasks such as invoice processing, and image generation through a partnership with Black Forest Labs' FLUX.1 Pro model.[9][10] Mistral framed the combined product as a direct competitor to ChatGPT, with Pixtral Large supplying the vision and reasoning back end.
The Pixtral vision encoder, called Pixtral-ViT, was trained from scratch rather than adapted from an existing encoder such as CLIP or SigLIP.[1] In Pixtral 12B it has 400 million parameters; in Pixtral Large it has 1 billion parameters.[2][5] Both versions process images as sequences of 16x16 pixel patches.[1]
The encoder for Pixtral 12B has 24 transformer layers with 1,024-dimensional hidden states and 16 attention heads.[1][11] Its 64-dimensional head size and 4,096-token internal context length give it enough capacity to represent high-resolution inputs. The MLP intermediate hidden dimension is 4,096 and the patch size is 16, identical to the original Vision Transformer ViT-L configuration on these axes.[11] The encoder's output, a sequence of patch embeddings, is projected into the decoder's embedding space through a two-layer fully connected network with a GELU activation and an intermediate hidden size equal to the encoder's hidden dimension.[1] Pixtral Large scales these dimensions roughly proportionally to reach 1 billion encoder parameters, while preserving the same patching, positional encoding, and projector design.[5]
A standard Vision Transformer processes images at a fixed resolution by resizing them to a predetermined size (commonly 224x224 or 448x448) and sometimes adding padding. This resizing degrades fine detail in high-resolution documents and creates distortions for images whose natural aspect ratio differs from the target. Pixtral-ViT was designed to avoid both problems.
Standard vision transformers use either learned absolute position embeddings or 1D positional encodings that assign a single index to each patch in raster order. These break down when images have different sizes or aspect ratios because the meaning of any given position index depends on the image dimensions.
Pixtral-ViT replaces these with RoPE-2D, a two-dimensional extension of rotary position embedding.[1] Each patch at grid position (i, j) receives a position encoding computed from both its row index i and its column index j independently. The Pixtral paper writes the operation as RoPE-2D(x, theta) = M(i,j; theta) x, where the rotation matrix M applies one set of rotation pairs parameterised by i to half of the embedding dimensions and another set parameterised by j to the other half.[1] The rotary mechanism ensures that the dot product between any two patch embeddings depends only on their relative position (delta-i, delta-j) rather than their absolute coordinates. This property, called the relative position property, allows the encoder to generalize to images of arbitrary size at inference time, including sizes never seen during training.[1]
The decision to use RoPE-2D rather than learned 2D embeddings tied directly to Pixtral's goal of variable-resolution support. Learned absolute embeddings require a fixed maximum patch grid; RoPE-2D extrapolates naturally because the rotation matrices are deterministic functions of position rather than learned parameters that must be sized in advance.[1]
Two special tokens, [IMG BREAK] and [IMG END], are inserted into the image token sequence.[1] An [IMG BREAK] token is placed at the end of each row of patches, giving the decoder a signal about where row boundaries fall. An [IMG END] token marks the end of the entire image. These tokens allow the model to distinguish between two images that contain the same total number of patches but have different shapes, for example a 16x4 image versus an 8x8 image with equivalent patch counts. The image token signalling is implemented in the mistral_common library starting with version 1.4.0, which also added the [IMG] token that flags an image's position in the text sequence.[8]
When multiple images are processed in a single forward pass, the patch sequences are concatenated along the sequence dimension and a block-diagonal attention mask is applied to the vision encoder. This prevents patches from one image from attending to patches from another image during encoding, while still allowing the full batch to be computed in a single kernel call for efficiency.[1] The Pixtral paper describes this as "sequence packing" and observes that it yields throughput improvements over the more common approach of padding each image to a fixed shape and running them as a batch.[1]
The decoder in Pixtral 12B has 40 transformer layers with 32 attention heads, 8 key-value heads for grouped-query attention, a 5,120-dimensional hidden size, and a 128-dimensional head size.[1] In Pixtral Large the decoder matches Mistral Large 2's architecture, which has roughly 123 billion parameters.[5] Both decoders use sliding-window attention and a large context window of 131,072 tokens.
The vision-language projection between the encoder and decoder is a two-layer MLP with GELU activation. Its input dimension matches the encoder's hidden size and its output dimension matches the decoder's embedding size. There is no cross-attention mechanism between vision and language streams; instead, image tokens are treated as a prefix in the input sequence, and the decoder attends to them through standard causal self-attention.[1] This decoder-only design contrasts with the cross-attention approach used by Flamingo, which the Pixtral paper notes was a deliberate choice to simplify training, allow the decoder to retain its text capabilities, and let the same parameters serve both vision-grounded and text-only inputs.[1]
The tokenizer is Tekken, the same byte-pair encoding scheme used by Mistral Nemo, derived from OpenAI's tiktoken and trained over more than one hundred languages and source code corpora.[8] Tekken is registered with a vocabulary size of 131,072 tokens. Pixtral 12B reserves a handful of these slots for the multimodal control tokens ([IMG], [IMG BREAK], [IMG END]) and standard chat-template control tokens.[8]
A central design goal of the Pixtral family is the ability to process images at their native resolution and aspect ratio without scaling or padding.[1] In practice, a user can submit a 1920x1080 screenshot, a 400x600 receipt photo, and a 3000x2000 landscape photograph in the same conversation, and each will be tokenized according to its actual dimensions.
The number of tokens an image consumes scales with the number of 16x16 patches it contains. A 1024x1024 image produces 4,096 patch tokens; a 256x256 thumbnail produces only 256. This means the model naturally allocates more processing capacity to larger, more detailed images and less to small thumbnails, which is the correct behavior for document-heavy use cases where image resolution carries meaning.
For images that would produce an extremely large number of patches, users can resize before submission to control token consumption. The model's 128K context window imposes a practical ceiling: at 16x16 patches, a context window of 128K tokens can accommodate roughly 30 high-resolution images if only images are present, or fewer if long text is interleaved.[5]
The variable-resolution capability was listed by Mistral as one of Pixtral's primary differentiators relative to models like LLaVA and earlier multimodal models that required fixed input sizes.[1] In the arXiv paper, the authors provided examples of how fixed-resolution models lose fine print when a document is downscaled to fit a 336x336 or 448x448 input size, while Pixtral preserves it by keeping the image at its original dimensions and consuming more tokens proportionally. The tradeoff is that token consumption is not fixed: a large image costs more context window space than a small one, and users of very high-resolution images must plan accordingly.
Mistral has not published a fully detailed description of Pixtral's training data or training procedure. The arXiv paper for Pixtral 12B notes that the model was trained on interleaved image-and-text data and that the training included both pre-training and instruction fine-tuning phases.[1] Pixtral 12B is described as a "natively multimodal" model in the sense that the decoder saw image tokens during pre-training rather than only during fine-tuning of a previously text-only checkpoint.[2]
For the vision encoder, Pixtral-ViT was pre-trained on image data before being integrated with the language decoder. The encoder was trained from scratch rather than initialized from a checkpoint, which Mistral said allowed them to optimize specifically for variable-resolution inputs without inheriting constraints from encoders designed for fixed-size inputs.[1]
The decoder for Pixtral 12B is built on Mistral Nemo 12B, a 12-billion-parameter multilingual text model that Mistral developed in collaboration with NVIDIA and released in July 2024.[7] The decoder for Pixtral Large is built on Mistral Large 2 (Mistral-Large-Instruct-2407), released in late July 2024. In both cases, the text decoder was adapted for multimodal input during training rather than being frozen. Mistral has emphasized that the text performance of both models remained close to their respective base decoders after multimodal training, which is not always the case when vision is added to a pre-trained language model through naive fine-tuning.[1]
The instruction-tuned variants, available under the -Instruct designation, were fine-tuned on a dataset that Mistral describes as covering natural image understanding, document comprehension, chart and figure interpretation, and complex multi-turn visual question answering.[1]
Mistral did not publish specific details about dataset size, data sources, or compute used for training. The arXiv paper noted that Pixtral 12B used the same tokenizer as Mistral Nemo, called Tekken, which encodes text and image tokens in a unified vocabulary.[1]
A noteworthy section of the Pixtral 12B paper is dedicated to evaluation methodology itself, which the authors argue is often a hidden confounder in multimodal benchmark reporting. The paper introduces what it calls "explicit" prompts that specify the expected output format directly to the model, together with three progressively more lenient parsing levels: a baseline exact-match comparator, Level 1 (alternate phrasing tolerated), Level 2 (markdown stripping), and Level 3 (substring matching).[1] Mistral showed that small changes in evaluation setup, such as how prompts are phrased or how outputs are parsed, can dramatically shift reported scores for some model families.
The reported numbers use chain-of-thought evaluation for MMMU and MathVista, which tends to produce higher scores than direct-answer evaluation.[1] For DocVQA, the ANLS (Average Normalized Levenshtein Similarity) metric is used, which rewards partial credit for near-correct string matches and penalizes near-misses less severely than exact match. Readers comparing these scores with numbers from other publications should check whether the same evaluation protocol was used.
The Pixtral 12B paper, posted to arXiv on October 9, 2024, lists 39 authors from Mistral AI's Science team, listed alphabetically by surname rather than by contribution.[1] The author list includes Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amelie Heliou, Paul Jacob, Albert Q. Jiang, Kartik Khandelwal, Timothee Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall, Louis Martin, Arthur Mensch, Pavankumar Muddireddy, Valera Nemychnikova, Marie Pellat, Patrick von Platen, Nikhil Raghuraman, Baptiste Roziere, Alexandre Sablayrolles, Lucile Saulnier, Romain Sauvestre, Wendy Shang, Roman Soletskyi, Lawrence Stewart, Pierre Stock, Joachim Studnia, Sandeep Subramanian, Sagar Vaze, Thomas Wang, and Sophia Yang.[1]
Pixtral 12B is released under the Apache 2.0 open-source license.[4] This permits anyone to use, modify, and redistribute the model commercially without restriction, subject only to attribution requirements. The weights are hosted on Hugging Face under mistralai/Pixtral-12B-2409 and can be downloaded freely.
Pixtral Large is released under a dual-licensing structure. The Mistral Research License (MRL) covers non-commercial and research uses. Commercial deployments require a separate commercial license from Mistral. The weights are available on Hugging Face under mistralai/Pixtral-Large-Instruct-2411 but carry the MRL by default.[5]
This licensing split mirrors Mistral's broader model strategy: smaller models receive maximally open licenses to build developer mindshare, while flagship-scale models carry commercial terms that generate revenue from enterprise users. The MRL itself is a custom license drafted by Mistral that grants free use for research and education but explicitly prohibits commercial deployment; the commercial license is negotiated on a per-customer basis.[5][12]
The following table shows Pixtral 12B's performance on standard multimodal benchmarks compared with models in a similar parameter range, as reported in the Pixtral 12B arXiv paper and the Mistral release post.[1][2]
| Benchmark | Pixtral 12B | Qwen2-VL 7B | LLaVA-OV 7B | Phi-3.5 Vision | Claude 3 Haiku | Gemini 1.5 Flash 8B |
|---|---|---|---|---|---|---|
| MMMU (CoT) | 52.5 | 47.6 | 45.1 | 38.3 | 50.4 | 50.7 |
| MathVista (CoT) | 58.0 | 54.4 | 36.1 | 39.3 | 44.8 | 56.9 |
| ChartQA (CoT) | 81.8 | 38.6 | 67.1 | 67.7 | 69.6 | 78.0 |
| DocVQA (ANLS) | 90.7 | 94.5 | 90.5 | 74.4 | 74.6 | 79.5 |
| VQAv2 | 78.6 | 75.9 | 78.3 | 56.1 | 68.4 | 65.5 |
| MM-MT-Bench | 6.05 | -- | -- | -- | -- | -- |
Among models in the 7B-12B parameter range, Pixtral 12B was the strongest across most benchmarks at the time of release. It was particularly strong on ChartQA and DocVQA, reflecting the emphasis on document understanding in its training.
For context against larger closed models:[1][2]
| Benchmark | Pixtral 12B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| MMMU (CoT) | 52.5 | 68.6 | 68.0 |
| MathVista (CoT) | 58.0 | 64.6 | 64.4 |
| ChartQA (CoT) | 81.8 | 85.1 | 87.6 |
| DocVQA (ANLS) | 90.7 | 88.9 | 90.3 |
| VQAv2 | 78.6 | 77.8 | 70.7 |
Pixtral 12B's DocVQA score of 90.7 exceeded both GPT-4o (88.9) and Claude 3.5 Sonnet (90.3), a result Mistral highlighted in the release announcement. On MMMU and MathVista, the larger closed models retained a meaningful advantage.
Pixtral 12B also retained competitive text-only performance: on the Hugging Face model card, Mistral reports 69.2 on MMLU (5-shot), 48.1 on MATH (pass@1), 72.0 on HumanEval (pass@1), 7.68 on text MT-Bench, and 61.3 on IF-Eval, which are within a small margin of Mistral Nemo 12B's text scores.[4]
The following table shows Pixtral Large's performance, as reported on the Hugging Face model card and Mistral's release post.[3][5]
| Benchmark | Pixtral Large | GPT-4o | Claude 3.5 Sonnet | Gemini 1.5 Pro | Llama 3.2 90B |
|---|---|---|---|---|---|
| MMMU (CoT) | 64.0 | 68.6 | 68.4 | 66.3 | 53.7 |
| MathVista (CoT) | 69.4 | 65.4 | 67.1 | 67.8 | 49.1 |
| ChartQA (CoT) | 88.1 | 85.2 | 89.1 | 83.8 | 70.8 |
| DocVQA (ANLS) | 93.3 | 88.5 | 88.6 | 92.3 | 85.7 |
| VQAv2 | 80.9 | 76.4 | 69.5 | 70.6 | 67.0 |
| AI2D (BBox) | 93.8 | 93.2 | 76.9 | 94.6 | -- |
| MM-MT-Bench | 7.4 | 6.7 | 7.3 | 6.8 | 5.5 |
Pixtral Large scored highest on MathVista (69.4 percent) among the models listed. On DocVQA it scored 93.3, above all listed competitors. On MM-MT-Bench, its score of 7.4 exceeded GPT-4o (6.7), Gemini 1.5 Pro (6.8), and Claude 3.5 Sonnet (7.3). Claude 3.5 Sonnet retained a narrow edge on ChartQA (89.1 vs 88.1), and GPT-4o and Gemini 1.5 Pro retained leads on MMMU (68.6 and 66.3 vs 64.0).
The model also outperformed Llama 3.2 90B Vision across every benchmark in Mistral's comparison set, with the largest gaps on VQAv2 (80.9 vs 67.0) and MathVista (69.4 vs 49.1).[3][5]
Alongside the Pixtral 12B release, Mistral introduced MM-MT-Bench, a multi-turn multimodal instruction-following benchmark designed to evaluate vision-language models in practical conversational scenarios.[1] The benchmark contains 92 multi-turn conversations spanning five image categories: charts (21 conversations), tables (19), PDF pages (24), diagrams (20), and miscellaneous (8).[1] It uses GPT-4o as a judge to score responses on a 1-to-10 scale for correctness and completeness. Mistral argued that existing benchmarks like MMMU and MathVista emphasize single-turn academic-style questions and undervalue conversational quality, justifying the new benchmark. The paper reports a Pearson correlation of 0.91 between MM-MT-Bench scores and LMSys Vision Arena ELO ratings, which the authors interpret as evidence that MM-MT-Bench captures the kind of capability that user-facing arena votes reward.[1]
MM-MT-Bench was released alongside Pixtral 12B as an open evaluation suite and has since been used by several other research groups to evaluate vision-language models, in addition to Mistral's own subsequent releases.[1] A complementary benchmark, MM-IF-Eval, evaluates strict adherence to format instructions in multimodal contexts; Pixtral 12B scored 52.7 on that suite.[4]
On the LMSys Vision Arena leaderboard, Pixtral 12B was reported with an ELO rating of around 1076 at the time of release, placing it well above other models in its size class.[11] Pixtral Large topped the same leaderboard among open-weights vision models at its November 2024 launch, leading the nearest open competitor by approximately 50 ELO points.[3] Third-party evaluations have generally tracked Mistral's reported numbers within a few points; Artificial Analysis assigns Pixtral Large an aggregate Intelligence Index score of 14, ranking it about middle-of-pack among the open-weight vision models tracked there.[13]
Both Pixtral models were available through Mistral's first-party platforms during their active period: La Plateforme (the developer API), where Pixtral 12B was accessible as pixtral-12b-2409 and Pixtral Large as pixtral-large-2411 (also under the alias pixtral-large-latest), and the Le Chat consumer interface.[2][3] Pixtral Large was a core part of Le Chat's late-2024 product expansion, which added image generation, web search with citations, an interactive canvas, and agent-style automated workflows.[9][10]
The Apache 2.0 license on Pixtral 12B enabled rapid hosting by third-party inference providers. Pixtral 12B was available on OpenRouter, Together AI, Hyperbolic, Replicate, Amazon Bedrock Marketplace, and several other inference platforms.[14] On Amazon Bedrock Marketplace, Pixtral 12B was deployable on ml.g6.12xlarge instances and scalable up to 100 replicas per deployment.[14] The Pixtral Large weights, though gated by the MRL, were also hosted by some providers for research evaluation. Snowflake's Cortex AI integrated Pixtral Large for enterprise users in April 2025, exposing it through the COMPLETE function with built-in support for English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish.[15]
Mistral Large 2 (and by extension Pixtral Large) became available through Google Cloud Vertex AI and Microsoft Azure AI Foundry as part of the broader Mistral partnership announcements in late 2024.[9]
Pixtral 12B can be run locally through several inference stacks. Mistral's own mistral-inference library supports the model, as does vLLM (with the mistral_common tokenizer dependency, version 1.4.4 or higher), Ollama, and the Hugging Face Transformers library.[4][16] The full-precision 12B model fits on a single high-end consumer GPU (24 GB) with quantization, while running Pixtral Large locally requires multiple data-center GPUs. Hugging Face Transformers integrated Pixtral via the existing LlavaForConditionalGeneration class, with contributions from Hugging Face maintainers amyeroberts and ArthurZ.[16]
Pixtral Large requires vLLM 0.6.4.post1 or newer and mistral_common 1.5.0 or newer, with the recommended configuration using --tensor-parallel-size 8, --config-format mistral, --load-format mistral, --tokenizer_mode mistral, and --limit_mm_per_prompt 'image=10' to bound the per-prompt image count.[5]
Pixtral 12B is available free for local deployment under the Apache 2.0 license. Via La Plateforme's API, pricing at launch was approximately USD 0.15 per million input tokens and USD 0.15 per million output tokens. Some provider comparisons have cited pricing as low as USD 0.10 per million tokens for both input and output on certain tiers, making it one of the more affordable multimodal API options available.
For cost comparison, GPT-4o was priced at USD 2.50 per million input tokens and USD 10.00 per million output tokens at the time, meaning Pixtral 12B via La Plateforme cost roughly 15-25x less per token for API access.
Pixtral Large via La Plateforme is priced at USD 2.00 per million input tokens and USD 6.00 per million output tokens.[13] Artificial Analysis computes a blended rate of about USD 2.40 per million tokens using its standard 7:2:1 cache/input/output assumption.[13] This places it roughly in the same tier as mid-range frontier models. The open weights can be self-hosted, which eliminates per-token API costs but requires infrastructure capable of running a 124B parameter model.
Independent benchmarking by Artificial Analysis recorded an output throughput of about 53 tokens per second and a time-to-first-token of about 1.67 seconds for Pixtral Large via La Plateforme, slightly slower than the median for comparable open-weight vision models at the time of measurement.[13]
Pixtral's variable-resolution architecture makes it well-suited for document processing tasks. Because it does not resize images before tokenization, text in PDFs, scanned forms, and financial statements is presented to the model at full resolution. The DocVQA benchmark scores for both models reflect this strength: Pixtral 12B scored 90.7 and Pixtral Large scored 93.3, both competitive with or better than larger closed-source models.
Practical applications include extracting structured data from invoices, reading handwritten fields in scanned forms, and answering questions over lengthy PDFs uploaded alongside textual queries. Amazon's reference Pixtral 12B notebook on Bedrock demonstrates handwriting recognition, vehicle damage assessment, and structured-data extraction from product images as canonical use cases.[14] In March 2025, Mistral released a dedicated document-understanding service, Mistral OCR, separate from the Pixtral chat models, that uses similar vision technology specifically tuned for high-throughput document parsing.[17] The OCR service evolved into Mistral OCR 3, released in December 2025 with pricing of USD 2 per 1,000 pages.
ChartQA measures a model's ability to answer questions about bar charts, line plots, pie charts, and other visualizations. Pixtral 12B's score of 81.8 and Pixtral Large's score of 88.1 indicate strong chart reasoning. Applications include summarizing financial dashboards, extracting trend lines from research graphs, and comparing figures across multiple charts in a single conversation.
MathVista tests models on geometry problems, algebra embedded in figures, and statistical reasoning over plots. Pixtral Large's 69.4 percent on MathVista was the highest in Mistral's benchmark comparison at launch. This makes it useful for educational technology applications, automated grading of visual math problems, and scientific literature analysis where equations and diagrams appear together.
Both Pixtral models support multiple images within a single context window. At 128K tokens, the practical limit is roughly 30 high-resolution images or more lower-resolution images. This makes the models suitable for comparative analysis tasks such as comparing product photos, reviewing sequences of UI screenshots, or analyzing a series of medical images with accompanying text. The Hugging Face Transformers Pixtral chat template demonstrates the canonical multi-image pattern with a single prompt referencing two distinct image inputs.[16]
Pixtral can read code screenshots, interpret software architecture diagrams, and describe UI mockups. Developers have used it to convert Figma wireframes into HTML/CSS descriptions, extract code from screenshots when copy-paste is unavailable, and explain data flow diagrams.
Mistral highlighted multilingual OCR and multilingual document understanding as use cases for Pixtral Large.[3] The underlying Mistral Large 2 decoder has strong multilingual text capabilities, and those carry over when processing non-English documents. This makes Pixtral Large suitable for reading scanned government documents in French or Spanish, extracting data from Chinese financial filings, or answering questions about Arabic-language medical records, tasks that require both vision and language components to handle non-Latin scripts correctly. Snowflake's Cortex AI integration explicitly lists English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish as supported languages for Pixtral Large.[15]
The combination of strong document understanding, a large context window, and API availability on La Plateforme makes Pixtral useful in enterprise document workflows. Common integration patterns include attaching Pixtral to PDF upload pipelines for automated data extraction, using it as a back-end for chatbots that need to answer questions about uploaded images, and routing visual questions through Pixtral in mixed text-image RAG (retrieval-augmented generation) systems. The Apache 2.0 license on Pixtral 12B means teams can fine-tune it on proprietary document formats without any licensing complications. Tutorials on LoRA-based fine-tuning of Pixtral 12B for domain-specific datasets, including medical images and satellite imagery, have been published by independent researchers and the Hugging Face community.[18]
Pixtral's design choices differ from those of its main 2024 competitors in several specific ways. The table below contrasts the headline architectural decisions reported by each model's authors.
| Model | Vision encoder origin | Native variable resolution | Multi-image support | License (largest variant) |
|---|---|---|---|---|
| Pixtral 12B / Pixtral Large | Trained from scratch (Pixtral-ViT) | Yes (RoPE-2D + break tokens) | Yes, up to ~30 images at 128K | Apache 2.0 / Mistral Research License |
| Llama 3.2 11B / 90B Vision | Pre-trained image encoder bolted on with cross-attention adapters | Limited (fixed tile slicing) | Yes via tile decomposition | Llama 3.2 Community License |
| Qwen2-VL 7B / 72B | Modified ViT with dynamic resolution | Yes (variable patch grid) | Yes | Qwen License / Apache 2.0 (some sizes) |
| GPT-4o (vision) | Closed | Closed (assumed tile slicing) | Yes | Proprietary |
| Claude 3.5 Sonnet (vision) | Closed | Closed (assumed tile slicing) | Yes | Proprietary |
| LLaVA-OneVision 7B | SigLIP encoder, projector-trained | No (fixed 384x384) | Yes via interleaving | Apache 2.0 (parts) |
Pixtral and Qwen2-VL are the principal open-weight families to ship native variable-resolution support, but they take different routes to it. Qwen2-VL's encoder is a modified ViT that adjusts its patch grid based on input resolution; Pixtral-ViT relies on RoPE-2D plus break tokens to handle variable input shapes within a more conventional encoder skeleton.[1] Llama 3.2 Vision instead pre-processes images into fixed-size tiles and uses cross-attention adapters to inject the visual stream into the Llama 3.1 decoder.[19]
On the benchmark axis, Pixtral Large outperformed Llama 3.2 90B Vision across every metric in Mistral's launch comparison while having a smaller decoder (123B versus Llama 3.2 90B's 88B-parameter language model plus a much larger vision encoder).[3][5] Against the closed frontier, Pixtral Large traded blows with GPT-4o and Claude 3.5 Sonnet, leading on DocVQA, MathVista, VQAv2, and MM-MT-Bench while trailing on MMMU.[5]
The Pixtral 12B announcement generated considerable interest in the open-source AI community. The combination of strong document-understanding benchmarks and a fully permissive Apache 2.0 license was highlighted in coverage by TechCrunch, VentureBeat, and SiliconAngle.[7][9] VentureBeat described it as a model that "can analyze images without any limits," referring to the variable-resolution capability.[20] At the time of release, Mistral was valued at USD 6 billion, and the multimodal expansion was seen as an important step in the company's effort to remain competitive with OpenAI and Anthropic.[7]
One frequently noted point in reviews was the model's strength on DocVQA relative to its size. Achieving 90.7 ANLS while using 12B parameters, and doing so under an open license, was considered notable because most models with comparable document scores were either larger or proprietary. Independent analyses from MarkTechPost and InfoQ echoed this assessment.[12][21]
Pixtral Large's November 2024 release drew commentary about the LMSys Vision Arena results in particular. Being the top open-weights model by a 50 ELO margin placed it clearly above Llama 3.2 90B and other open alternatives at the time, and its performance exceeding GPT-4o on the MM-MT-Bench instruction-following suite was cited by researchers as evidence that open models were narrowing the gap with proprietary frontier models on multimodal tasks. Coverage from VentureBeat and SiliconAngle framed the Le Chat upgrade plus Pixtral Large bundle as Mistral's most direct challenge yet to ChatGPT.[9][10]
Researcher Simon Willison, writing within hours of the September 11 torrent drop, called attention to the unusual release format and noted that the params.json shipped in the torrent already documented the vision encoder dimensions and patch size, allowing independent observers to verify the architecture before Mistral's blog post went live.[8]
Both Pixtral models were subsequently deprecated on Mistral's API in favor of newer multimodal releases.[6]
Pixtral 12B (pixtral-12b-2409) was deprecated on December 2, 2025, with retirement scheduled for December 31, 2025. Mistral recommended Ministral 3 14B as the replacement, a model in the Ministral family of compact-but-capable models that combine vision with text understanding.[6]
Pixtral Large (pixtral-large-2411) was deprecated on February 27, 2026, with retirement scheduled for May 31, 2026. Mistral recommended Mistral Large 3 as the replacement.[6] Mistral Large 3 is part of Mistral's frontier model line, with vision capabilities integrated rather than offered as a separate Pixtral-branded SKU. The Pixtral brand was effectively retired with these deprecations, as Mistral consolidated vision into its main model line.
The Hugging Face weights for both Pixtral models remained available after API deprecation, allowing continued local use under their respective licenses for research and self-hosted deployments. The Pixtral architecture and the RoPE-2D plus break-token design also persisted in Mistral's later vision systems, including the Mistral OCR family, which inherited the variable-resolution tokenization scheme.[17]
Researchers evaluating Pixtral have identified several consistent limitations.
Spatial reasoning over three-dimensional scenes is a known weakness. When images require understanding of depth, occlusion, or three-dimensional object placement, Pixtral (in common with most vision-language models) performs below its level on flat document tasks. Error analyses describe two primary failure modes: encoding errors where visual elements like colors or shapes are misidentified, and visio-semantic errors where the model fails to reason correctly about spatial relationships.
On MMMU, which includes graduate-level questions across academic disciplines, both Pixtral 12B and Pixtral Large score below GPT-4o and Claude 3.5 Sonnet. MMMU tests broad scientific reasoning with images, and the gap suggests that Pixtral's training was more optimized for document-style tasks than for general academic visual reasoning.
The model lacks built-in content moderation. The Hugging Face model card notes that Pixtral has no built-in moderation mechanisms, and deployers are expected to add their own safety layers.[4][5] This is typical for open-weight models but is a practical consideration for production deployments.
For very large images that generate thousands of patch tokens, inference speed and memory consumption can be significant. The flexible resolution that makes Pixtral strong on documents also means users must manage token budgets when submitting high-resolution inputs. Pixtral Large in particular has been measured at roughly 53 output tokens per second on La Plateforme, slower than several frontier models, in part because of its large parameter count.[13]
Training data details remain undisclosed. It is unclear which image datasets were used, what data governance practices were applied, or whether any copyright review was performed on training images. This ambiguity is a limitation for regulated industries where data provenance must be documented.
Early adopters also noted that Pixtral Large at launch was less suited to high-throughput OCR than dedicated OCR systems, a gap Mistral implicitly acknowledged by releasing the separate Mistral OCR product line in 2025.[17] For pure text-extraction workloads, Pixtral's chat interface adds overhead compared with a dedicated OCR pipeline.