Pixtral
Last reviewed
May 7, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 4,067 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 4,067 words
Add missing citations, update stale details, or suggest a clearer explanation.
Pixtral is a family of multimodal vision-language models developed by Mistral AI, a French AI company founded in April 2023. The family consists of two models: Pixtral 12B, released on September 11, 2024, as Mistral's first publicly available vision model, and Pixtral Large, a 124-billion-parameter model released on November 18, 2024. Both models combine a custom vision encoder with a text decoder to process images and text within a single 128,000-token context window. The 12B model is open-source under the Apache 2.0 license; the Large model carries Mistral's commercial license. Pixtral introduced several architectural novelties, including a dedicated vision encoder trained from scratch, 2D rotary positional encoding, and special break tokens that allow the model to process images at their native resolution and aspect ratio without padding or cropping.
Mistral AI was founded in Paris in April 2023 by Arthur Mensch, Guillaume Lample, and Timothée Lacroix, all veterans of DeepMind and Meta's research teams. The company built an early reputation for releasing compact, high-performing text models including Mistral 7B and Mistral Large, positioning itself as Europe's most prominent open AI laboratory and a direct competitor to OpenAI and Anthropic. By mid-2024, Mistral had raised roughly €1 billion in funding and was valued at approximately €5.8 billion.
Despite strong text model performance, Mistral had no multimodal offering through early 2024 while GPT-4o, Gemini 1.5, and Claude 3 had all added vision capabilities to their flagship products. The absence of image understanding put Mistral at a disadvantage for enterprise use cases involving documents, charts, and visual data. Building a vision system required more than bolting a pre-trained image encoder onto an existing decoder; Mistral chose to train a new vision encoder from scratch, giving them more control over how the encoder represents images and making it possible to support variable-resolution inputs natively.
Arthur Mensch, Mistral's CEO, had prior multimodal research experience: before co-founding Mistral he worked at DeepMind Paris on projects including Flamingo, one of the early large vision-language models. This background informed Pixtral's design philosophy, which prioritizes high-resolution document understanding and the ability to reason over multiple images in a single conversation.
Pixtral 12B was announced on September 11, 2024, and released on the same day via Hugging Face under the model identifier mistralai/Pixtral-12B-2409. It was simultaneously made available through Mistral's API platform, La Plateforme, and through the Le Chat web interface. The announcement described it as Mistral's first natively multimodal model, meaning vision was built in from training rather than added as a post-hoc adapter.
The model has 12 billion parameters in its multimodal decoder and 400 million parameters in its dedicated vision encoder, giving a total parameter count of approximately 12.4 billion. The decoder is built on the same architectural foundation as Mistral Nemo 12B, with 40 transformer layers, a hidden dimension of 5,120, 32 attention heads, and 8 key-value heads for grouped-query attention. The context window spans 131,072 tokens, which at the model's patch granularity of 16x16 pixels is large enough to accommodate multiple images alongside long text in a single conversation.
The original release post described Pixtral 12B as able to understand both natural images and documents, a distinction that mattered because many multimodal models at the time were stronger on photographs and weaker on charts, tables, and PDFs. On DocVQA, which measures a model's ability to answer questions about document images, Pixtral 12B scored 90.7 on the ANLS metric, placing it above GPT-4 Turbo and matching or beating several larger open models. On ChartQA it scored 81.8, also strong for its size class.
The Apache 2.0 license was notable given the model's capabilities. At the time of release, most competitive multimodal models were proprietary. The license permits unrestricted commercial use, fine-tuning, and redistribution, which drew attention from developers who wanted a capable vision model they could run locally or fine-tune for specialized applications.
Pixtral Large was announced on November 18, 2024, roughly two months after the 12B release. The model has 124 billion parameters, with a 123-billion-parameter multimodal decoder based on Mistral Large 2 and a 1-billion-parameter vision encoder. Mistral released it under an open-weights model, accessible for download under the Mistral Research License (MRL) for non-commercial use and a separate commercial license for production deployments.
The scale difference between the two Pixtral models is significant. Pixtral Large requires over 300 GB of GPU memory for full-precision inference, placing it firmly in the data-center tier rather than the consumer GPU tier. Mistral recommends vLLM with tensor parallelism across at least 8 GPUs for practical deployment. Despite this, the model is available via API at La Plateforme, where developers can access it without managing infrastructure.
On benchmark evaluations, Pixtral Large outperformed GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet on several multimodal tasks. On MathVista, which tests mathematical reasoning over visual inputs such as geometry diagrams and data plots, Pixtral Large scored 69.4%, the highest of any model in the comparison set at the time. On DocVQA it scored 93.3 ANLS, and on MM-MT-Bench, Mistral's own evaluation suite for real-world multimodal instruction following, it scored 7.4 out of 10. The model was also ranked as the best open-weights model on the LMSys Vision Arena leaderboard at launch, outpacing the nearest competitor by approximately 50 ELO points.
Pixtral Large maintains the text capabilities of its underlying Mistral Large 2 decoder. Unlike some multimodal models that see text performance decline after vision training, Pixtral Large's benchmark scores on text-only tasks remained close to Mistral Large 2's baseline, which Mistral attributed to their training procedure and the careful separation of the vision encoder from the core decoder.
The Pixtral vision encoder, called Pixtral-ViT, was trained from scratch rather than adapted from an existing encoder such as CLIP or SigLIP. In Pixtral 12B it has 400 million parameters; in Pixtral Large it has 1 billion parameters. Both versions process images as sequences of 16x16 pixel patches.
The encoder has 24 transformer layers with 1,024-dimensional hidden states and 16 attention heads. Its 64-dimensional head size and 4,096-token internal context length give it enough capacity to represent high-resolution inputs. The encoder's output, a sequence of patch embeddings, is projected into the decoder's embedding space through a two-layer fully connected network with a GeLU activation and an intermediate hidden size equal to the encoder's hidden dimension.
A standard ViT processes images at a fixed resolution by resizing them to a predetermined size (commonly 224x224 or 448x448) and sometimes adding padding. This resizing degrades fine detail in high-resolution documents and creates distortions for images whose natural aspect ratio differs from the target. Pixtral-ViT was designed to avoid both problems.
Standard vision transformers use either learned absolute position embeddings or 1D positional encodings that assign a single index to each patch in raster order. These break down when images have different sizes or aspect ratios because the meaning of any given position index depends on the image dimensions.
Pixtral-ViT replaces these with RoPE-2D, a two-dimensional extension of rotary positional encoding. Each patch at grid position (i, j) receives a position encoding computed from both its row index i and its column index j independently. The rotary mechanism ensures that the dot product between any two patch embeddings depends only on their relative position (delta-i, delta-j) rather than their absolute coordinates. This property, called the relative position property, allows the encoder to generalize to images of arbitrary size at inference time, including sizes never seen during training.
Two special tokens, [IMG BREAK] and [IMG END], are inserted into the image token sequence. An [IMG BREAK] token is placed at the end of each row of patches, giving the decoder a signal about where row boundaries fall. An [IMG END] token marks the end of the entire image. These tokens allow the model to distinguish between two images that contain the same total number of patches but have different shapes, for example a 16x4 image versus a 8x8 image with equivalent patch counts.
When multiple images are processed in a single forward pass, the patch sequences are concatenated along the sequence dimension and a block-diagonal attention mask is applied to the vision encoder. This prevents patches from one image from attending to patches from another image during encoding, while still allowing the full batch to be computed in a single kernel call for efficiency.
The decoder in Pixtral 12B has 40 transformer layers with 32 attention heads, 8 key-value heads for grouped-query attention, a 5,120-dimensional hidden size, and a 128-dimensional head size. In Pixtral Large the decoder matches Mistral Large 2's architecture, which has roughly 123 billion parameters. Both decoders use sliding window attention and a large context window of 131,072 tokens.
The vision-language projection between the encoder and decoder is a two-layer MLP. Its input dimension matches the encoder's hidden size and its output dimension matches the decoder's embedding size. There is no cross-attention mechanism between vision and language streams; instead, image tokens are treated as a prefix in the input sequence, and the decoder attends to them through standard causal self-attention.
A central design goal of the Pixtral family is the ability to process images at their native resolution and aspect ratio without scaling or padding. In practice, a user can submit a 1920x1080 screenshot, a 400x600 receipt photo, and a 3000x2000 landscape photograph in the same conversation, and each will be tokenized according to its actual dimensions.
The number of tokens an image consumes scales with the number of 16x16 patches it contains. A 1024x1024 image produces 4,096 patch tokens; a 256x256 thumbnail produces only 256. This means the model naturally allocates more processing capacity to larger, more detailed images and less to small thumbnails, which is the correct behavior for document-heavy use cases where image resolution carries meaning.
For images that would produce an extremely large number of patches, users can resize before submission to control token consumption. The model's 128K context window imposes a practical ceiling: at 16x16 patches, a context window of 128K tokens can accommodate roughly 30 high-resolution images if only images are present, or fewer if long text is interleaved.
The variable-resolution capability was listed by Mistral as one of Pixtral's primary differentiators relative to models like LLaVA and earlier multimodal models that required fixed input sizes. In the arxiv paper, the authors provided examples of how fixed-resolution models lose fine print when a document is downscaled to fit a 336x336 or 448x448 input size, while Pixtral preserves it by keeping the image at its original dimensions and consuming more tokens proportionally. The tradeoff is that token consumption is not fixed: a large image costs more context window space than a small one, and users of very high-resolution images must plan accordingly.
Mistral has not published a detailed description of Pixtral's training data or training procedure. The arxiv paper for Pixtral 12B notes that the model was trained on interleaved image-and-text data and that the training included both pre-training and instruction fine-tuning phases.
For the vision encoder, Pixtral-ViT was pre-trained on image data before being integrated with the language decoder. The encoder was trained from scratch rather than initialized from a checkpoint, which Mistral said allowed them to optimize specifically for variable-resolution inputs without inheriting constraints from encoders designed for fixed-size inputs.
The decoder for Pixtral 12B is built on Mistral Nemo 12B. The decoder for Pixtral Large is built on Mistral Large 2. In both cases, the text decoder was adapted for multimodal input during training rather than being frozen. Mistral has emphasized that the text performance of both models remained close to their respective base decoders after multimodal training, which is not always the case when vision is added to a pre-trained language model through naive fine-tuning.
The instruction-tuned variants, available under the -Instruct designation, were fine-tuned on a dataset that Mistral describes as covering natural image understanding, document comprehension, chart and figure interpretation, and complex multi-turn visual question answering.
Mistral did not publish specific details about dataset size, data sources, or compute used for training. The arxiv paper noted that Pixtral 12B used the same tokenizer as Mistral Nemo, called Tekken, which encodes text and image tokens in a unified vocabulary.
The paper also described how the evaluation methodology itself was carefully controlled. Mistral noted that small changes in evaluation setup, such as how prompts are phrased or how outputs are parsed, can dramatically shift reported scores for some model families. The reported numbers use chain-of-thought evaluation for MMMU and MathVista, which tends to produce higher scores than direct-answer evaluation. For DocVQA, the ANLS (Average Normalized Levenshtein Similarity) metric is used, which rewards partial credit for near-correct string matches and penalizes near-misses less severely than exact match. Readers comparing these scores with numbers from other publications should check whether the same evaluation protocol was used.
Pixtral 12B is released under the Apache 2.0 open-source license. This permits anyone to use, modify, and redistribute the model commercially without restriction, subject only to attribution requirements. The weights are hosted on Hugging Face under mistralai/Pixtral-12B-2409 and can be downloaded freely.
Pixtral Large is released under a dual-licensing structure. The Mistral Research License (MRL) covers non-commercial and research uses. Commercial deployments require a separate commercial license from Mistral. The weights are available on Hugging Face under mistralai/Pixtral-Large-Instruct-2411 but carry the MRL by default.
This licensing split mirrors Mistral's broader model strategy: smaller models receive maximally open licenses to build developer mindshare, while flagship-scale models carry commercial terms that generate revenue from enterprise users.
The following table shows Pixtral 12B's performance on standard multimodal benchmarks compared with models in a similar parameter range, as reported in the Pixtral 12B arxiv paper and the Mistral release post.
| Benchmark | Pixtral 12B | Qwen2-VL 7B | LLaVA-OV 7B | Phi-3.5 Vision | Claude 3 Haiku | Gemini 1.5 Flash 8B |
|---|---|---|---|---|---|---|
| MMMU (CoT) | 52.5 | 47.6 | 45.1 | 38.3 | 50.4 | 50.7 |
| MathVista (CoT) | 58.0 | 54.4 | 36.1 | 39.3 | 44.8 | 56.9 |
| ChartQA (CoT) | 81.8 | 38.6 | 67.1 | 67.7 | 69.6 | 78.0 |
| DocVQA (ANLS) | 90.7 | 94.5 | 90.5 | 74.4 | 74.6 | 79.5 |
| VQAv2 | 78.6 | 75.9 | 78.3 | 56.1 | 68.4 | 65.5 |
| MM-MT-Bench | 6.05 | -- | -- | -- | -- | -- |
Among models in the 7B-12B parameter range, Pixtral 12B was the strongest across most benchmarks at the time of release. It was particularly strong on ChartQA and DocVQA, reflecting the emphasis on document understanding in its training.
For context against larger closed models:
| Benchmark | Pixtral 12B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| MMMU (CoT) | 52.5 | 68.6 | 68.0 |
| MathVista (CoT) | 58.0 | 64.6 | 64.4 |
| ChartQA (CoT) | 81.8 | 85.1 | 87.6 |
| DocVQA (ANLS) | 90.7 | 88.9 | 90.3 |
| VQAv2 | 78.6 | 77.8 | 70.7 |
Pixtral 12B's DocVQA score of 90.7 exceeded both GPT-4o (88.9) and Claude 3.5 Sonnet (90.3), a result Mistral highlighted in the release announcement. On MMMU and MathVista, the larger closed models retained a meaningful advantage.
The following table shows Pixtral Large's performance, as reported on the Hugging Face model card and Mistral's release post.
| Benchmark | Pixtral Large | GPT-4o | Claude 3.5 Sonnet | Gemini 1.5 Pro | Llama 3.2 90B |
|---|---|---|---|---|---|
| MMMU (CoT) | 64.0 | 68.6 | 68.4 | 66.3 | 53.7 |
| MathVista (CoT) | 69.4 | 65.4 | 67.1 | 67.8 | 49.1 |
| ChartQA (CoT) | 88.1 | 85.2 | 89.1 | 83.8 | 70.8 |
| DocVQA (ANLS) | 93.3 | 88.5 | 88.6 | 92.3 | 85.7 |
| VQAv2 | 80.9 | 76.4 | 69.5 | 70.6 | 67.0 |
| AI2D (BBox) | 93.8 | 93.2 | 76.9 | 94.6 | -- |
| MM-MT-Bench | 7.4 | 6.7 | 7.3 | 6.8 | 5.5 |
Pixtral Large scored highest on MathVista (69.4%) among the models listed. On DocVQA it scored 93.3, above all listed competitors. On MM-MT-Bench, its score of 7.4 exceeded GPT-4o (6.7), Gemini 1.5 Pro (6.8), and Claude 3.5 Sonnet (7.3). Claude 3.5 Sonnet retained a narrow edge on ChartQA (89.1 vs 88.1) and GPT-4o and Gemini 1.5 Pro retained leads on MMMU (68.6 and 66.3 vs 64.0).
Pixtral's variable-resolution architecture makes it well-suited for document processing tasks. Because it does not resize images before tokenization, text in PDFs, scanned forms, and financial statements is presented to the model at full resolution. The DocVQA benchmark scores for both models reflect this strength: Pixtral 12B scored 90.7 and Pixtral Large scored 93.3, both competitive with or better than larger closed-source models.
Practical applications include extracting structured data from invoices, reading handwritten fields in scanned forms, and answering questions over lengthy PDFs uploaded alongside textual queries.
ChartQA measures a model's ability to answer questions about bar charts, line plots, pie charts, and other visualizations. Pixtral 12B's score of 81.8 and Pixtral Large's score of 88.1 indicate strong chart reasoning. Applications include summarizing financial dashboards, extracting trend lines from research graphs, and comparing figures across multiple charts in a single conversation.
MathVista tests models on geometry problems, algebra embedded in figures, and statistical reasoning over plots. Pixtral Large's 69.4% on MathVista was the highest in Mistral's benchmark comparison at launch. This makes it useful for educational technology applications, automated grading of visual math problems, and scientific literature analysis where equations and diagrams appear together.
Both Pixtral models support multiple images within a single context window. At 128K tokens, the practical limit is roughly 30 high-resolution images or more lower-resolution images. This makes the models suitable for comparative analysis tasks such as comparing product photos, reviewing sequences of UI screenshots, or analyzing a series of medical images with accompanying text.
Pixtral can read code screenshots, interpret software architecture diagrams, and describe UI mockups. Developers have used it to convert Figma wireframes into HTML/CSS descriptions, extract code from screenshots when copy-paste is unavailable, and explain data flow diagrams.
Mistral highlighted multilingual OCR and multilingual document understanding as use cases for Pixtral Large. The underlying Mistral Large 2 decoder has strong multilingual text capabilities, and those carry over when processing non-English documents. This makes Pixtral Large suitable for reading scanned government documents in French or Spanish, extracting data from Chinese financial filings, or answering questions about Arabic-language medical records, tasks that require both vision and language components to handle non-Latin scripts correctly.
The combination of strong document understanding, a large context window, and API availability on La Plateforme makes Pixtral useful in enterprise document workflows. Common integration patterns include attaching Pixtral to PDF upload pipelines for automated data extraction, using it as a back-end for chatbots that need to answer questions about uploaded images, and routing visual questions through Pixtral in mixed text-image RAG (retrieval-augmented generation) systems. The Apache 2.0 license on Pixtral 12B means teams can fine-tune it on proprietary document formats without any licensing complications.
Pixtral 12B is available free for local deployment under the Apache 2.0 license. Via La Plateforme's API, pricing at launch was approximately $0.15 per million input tokens and $0.15 per million output tokens. Some provider comparisons have cited pricing as low as $0.10 per million tokens for both input and output on certain tiers, making it one of the more affordable multimodal API options available.
For cost comparison, GPT-4o was priced at $2.50 per million input tokens and $10.00 per million output tokens at the time, meaning Pixtral 12B via La Plateforme cost roughly 15-25x less per token for API access.
Pixtral Large via La Plateforme is priced at $2.00 per million input tokens and $6.00 per million output tokens. This places it roughly in the same tier as mid-range frontier models. The open weights can be self-hosted, which eliminates per-token API costs but requires infrastructure capable of running a 124B parameter model.
The Pixtral 12B announcement generated considerable interest in the open-source AI community. The combination of strong document-understanding benchmarks and a fully permissive Apache 2.0 license was highlighted in coverage by TechCrunch, VentureBeat, and SiliconAngle. VentureBeat described it as a model that "can analyze images without any limits," referring to the variable-resolution capability. At the time of release, Mistral was valued at $6 billion, and the multimodal expansion was seen as an important step in the company's effort to remain competitive with OpenAI and Anthropic.
One frequently noted point in reviews was the model's strength on DocVQA relative to its size. Achieving 90.7 ANLS while using 12B parameters, and doing so under an open license, was considered notable because most models with comparable document scores were either larger or proprietary.
Pixtral Large's November 2024 release drew commentary about the LMSys Vision Arena results in particular. Being the top open-weights model by a 50 ELO margin placed it clearly above Llama 3.2 90B and other open alternatives at the time, and its performance exceeding GPT-4o on the MM-MT-Bench instruction-following suite was cited by researchers as evidence that open models were narrowing the gap with proprietary frontier models on multimodal tasks.
Both models have since been marked as deprecated on Mistral's website, replaced by newer multimodal releases. Mistral has continued developing its vision capabilities and has described further multimodal models as part of its roadmap.
Researchers evaluating Pixtral have identified several consistent limitations.
Spatial reasoning over three-dimensional scenes is a known weakness. When images require understanding of depth, occlusion, or three-dimensional object placement, Pixtral (in common with most vision-language models) performs below its level on flat document tasks. Error analyses describe two primary failure modes: encoding errors where visual elements like colors or shapes are misidentified, and visio-semantic errors where the model fails to reason correctly about spatial relationships.
On MMMU, which includes graduate-level questions across academic disciplines, both Pixtral 12B and Pixtral Large score below GPT-4o and Claude 3.5 Sonnet. MMMU tests broad scientific reasoning with images, and the gap suggests that Pixtral's training was more optimized for document-style tasks than for general academic visual reasoning.
The model lacks built-in content moderation. The Hugging Face model card notes that Pixtral has no built-in moderation mechanisms, and deployers are expected to add their own safety layers. This is typical for open-weight models but is a practical consideration for production deployments.
For very large images that generate thousands of patch tokens, inference speed and memory consumption can be significant. The flexible resolution that makes Pixtral strong on documents also means users must manage token budgets when submitting high-resolution inputs.
Training data details remain undisclosed. It is unclear which image datasets were used, what data governance practices were applied, or whether any copyright review was performed on training images. This ambiguity is a limitation for regulated industries where data provenance must be documented.