MathVista is a benchmark for evaluating the mathematical reasoning capabilities of foundation models in visual contexts. It was introduced by Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao from the University of California, Los Angeles, the University of Washington, and Microsoft Research. The benchmark consolidates 6,141 examples drawn from 28 existing multimodal datasets and three newly created datasets. Posted to arXiv in October 2023 (arXiv:2310.02255) and accepted as an oral presentation at ICLR 2024 (one of only 85 orals out of roughly 7,300 submissions, a 1.2% acceptance rate), MathVista has become one of the most widely used benchmarks for measuring how well multimodal AI systems combine visual perception with mathematical problem solving. It is reported in nearly every major frontier model release and forms a standard component of the multimodal evaluation suite alongside MMMU, MathVerse, and MATH-Vision.
MathVista was the first large-scale, unified benchmark to combine challenging visual perception (charts, plots, geometry diagrams, scientific figures, document images, puzzle test patterns) with rigorous mathematical reasoning (algebra, arithmetic, geometry, logic, statistics, scientific computation). Earlier benchmarks tended to test either text-based math (such as GSM8K, MATH, MathQA) or pure visual question answering, but not the compositional skill of reading a diagram and solving a math problem grounded in it. By aggregating 31 source datasets and adding three new ones, MathVista created a benchmark broad enough to test general multimodal mathematical competence rather than narrow domain skill.
Large language models and large multimodal models have shown strong problem-solving abilities across many tasks, ranging from text generation and code completion to image captioning and visual question answering. Before MathVista, however, the ability of these models to perform mathematical reasoning in visual contexts had not been studied in a systematic, unified way. The lack of a shared benchmark made cross-model comparisons difficult and obscured a critical question: can a model that excels at text math actually solve a geometry problem when the diagram is the primary input?
Existing benchmarks tended to focus on either textual math problems or visual question answering separately, without combining the two in a way that tests genuine compositional reasoning. A model might be able to solve algebraic equations from text or describe the contents of a chart, but the question of whether it could look at a geometry diagram and compute an unknown angle, or interpret a bar chart and perform statistical calculations, remained largely unaddressed at scale.
The MathVista authors identified several specific gaps in the existing evaluation landscape. First, many visual math datasets covered only narrow domains, such as geometry or chart reading, without spanning the full range of visual mathematical tasks. Second, certain visual contexts were missing entirely from existing benchmarks, including puzzle test figures that require inductive reasoning, functional plots that demand algebraic analysis, and scientific figures from academic papers. Third, there was no unified framework for comparing model performance across diverse mathematical reasoning types and visual contexts simultaneously. Fourth, evaluation pipelines were inconsistent across studies, with some teams using exact match, others using free-form scoring, and others using human evaluation, making leaderboard comparisons unreliable.
To address these gaps, the research team designed MathVista as a meta-benchmark that aggregates problems from a broad range of sources while filling missing gaps with three entirely new datasets. The resulting benchmark requires fine-grained, deep visual understanding and compositional reasoning, skills that all state-of-the-art foundation models at the time of its release found challenging.
MathVista comprises 6,141 examples sourced from 31 datasets in total: 28 existing multimodal datasets and three newly created ones. The existing datasets are divided into 9 math-targeted question answering (MathQA) datasets and 19 general visual question answering (VQA) datasets. To maintain balance, each source dataset contributes up to 400 examples to MathVista. The final dataset contains 2,666 examples from MathQA sources, 2,739 examples from VQA sources, and 736 examples from the three new datasets.
The following table lists all 31 source datasets incorporated into MathVista, along with their category, primary task type, grade level, and subject focus.
| Dataset | Category | Task Type | Grade Level | Subject |
|---|---|---|---|---|
| GEOS | Math-targeted | Geometry Problem Solving | High School | Geometry |
| GeoQA+ | Math-targeted | Geometry Problem Solving | High School | Geometry |
| Geometry3K | Math-targeted | Geometry Problem Solving | High School | Geometry |
| UniGeo | Math-targeted | Geometry Problem Solving | High School | Geometry |
| CLEVR-Math | Math-targeted | Math Word Problem | Elementary | Arithmetic |
| TabMWP | Math-targeted | Math Word Problem | Elementary/High School | Arithmetic |
| IconQA | Math-targeted | Math Word Problem | Elementary | Miscellaneous |
| SciBench | Math-targeted | Textbook QA | College | Miscellaneous |
| TheoremQA | Math-targeted | Textbook QA | College | Miscellaneous |
| FunctionQA | Math-targeted (New) | Textbook QA | College | Algebra |
| PaperQA | Math-targeted (New) | Figure QA | College | Miscellaneous |
| IQTest | Math-targeted (New) | Figure QA | Elementary | IQ Test |
| PlotQA | General VQA | Figure QA | N/A | Miscellaneous |
| ChartQA | General VQA | Figure QA | N/A | Miscellaneous |
| FigureQA | General VQA | Figure QA | N/A | Miscellaneous |
| DVQA | General VQA | Figure QA | N/A | Data Visualization |
| MapQA | General VQA | Figure QA | High School | Geography |
| DocVQA | General VQA | Figure QA | N/A | Miscellaneous |
| ScienceQA | General VQA | Textbook QA | Elementary/High School | Miscellaneous |
| TQA | General VQA | Textbook QA | High School | Biology |
| AI2D | General VQA | Textbook QA | High School | Miscellaneous |
| VQA2.0 | General VQA | Visual QA | N/A | Miscellaneous |
| VQA-AS | General VQA | Visual QA | N/A | Miscellaneous |
| Super-CLEVR | General VQA | Visual QA | N/A | Miscellaneous |
| A-OKVQA | General VQA | Visual QA | N/A | Miscellaneous |
| TextVQA | General VQA | Visual QA | N/A | Miscellaneous |
| VizWiz | General VQA | Visual QA | N/A | Miscellaneous |
| KVQA | General VQA | Visual QA | N/A | Miscellaneous |
| PMC-VQA | General VQA | Visual QA | College | Medicine |
| VQA-RAD | General VQA | Visual QA | College | Medicine |
| ParsVQA-Caps | General VQA | Visual QA | N/A | Miscellaneous |
The table illustrates one of MathVista's key design choices: rather than restricting itself to clean, math-focused inputs, the benchmark deliberately incorporates messy real-world visual contexts including medical scans (PMC-VQA, VQA-RAD), document scans (DocVQA), natural photographs (VQA2.0, A-OKVQA, TextVQA), and accessibility images (VizWiz). The intuition is that any model claiming general multimodal competence should be able to perform mathematical reasoning even when the visual context is not explicitly designed for math.
To fill specific gaps in existing benchmarks, the MathVista team created three entirely new datasets. All questions in these datasets were manually annotated by graduate students in STEM fields and further refined through a rigorous review process involving three independent reviewers with 99.2% inter-annotation consistency.
IQTest contains 228 examples and evaluates logical and inductive reasoning using puzzle test figures sourced from online learning platforms. The problems require pattern recognition, abstract thinking, and calculations based on visual sequences and matrices. These problems test a form of visual intelligence that is fundamentally different from the domain-specific math tested in most other datasets, and they remain among the hardest items in MathVista even for current frontier models.
FunctionQA contains 400 examples and focuses on algebraic reasoning over functional plots. Problems require the model to analyze graphs of mathematical functions and answer questions involving variables, expressions, equations, and function properties such as domain, range, intercepts, asymptotes, monotonicity, and symmetry. FunctionQA demands subtle visual perception of plot features (axis labels, tick marks, gridlines) combined with formal algebraic knowledge, and was the first dataset to specifically target this combination at scale.
PaperQA contains 107 examples and presents questions derived from informative illustrations in academic papers, including tables, figures, and charts. The source papers were released in August 2023, after the training cutoff of GPT-4V, ensuring that the content is not present in the training data of models evaluated on MathVista at release. PaperQA tests scientific reasoning and the ability to extract and compute with quantitative information from research publications, a skill increasingly relevant for AI assistants used in research workflows.
MathVista organizes its problems along three complementary taxonomic dimensions: mathematical reasoning type, task type, and visual context type. This multi-dimensional classification enables fine-grained analysis of where models succeed and where they fail, rather than collapsing performance into a single accuracy score.
The benchmark encompasses seven distinct categories of mathematical reasoning:
| Reasoning Type | Abbreviation | Description |
|---|---|---|
| Algebraic | ALG | Reasoning about variables, expressions, equations, and functions |
| Arithmetic | ARI | Basic numerical computation including addition, subtraction, multiplication, and division |
| Geometric | GEO | Reasoning about shapes, angles, areas, volumes, and spatial relationships |
| Logical | LOG | Deductive and inductive reasoning, pattern recognition, and logical inference |
| Numeric | NUM | Numeric common sense and estimation in everyday contexts |
| Scientific | SCI | Reasoning about scientific concepts, formulas, and phenomena |
| Statistical | STA | Interpreting data distributions, calculating statistics, and reading charts |
The annotation of reasoning types was performed through a combination of automatic classification and manual labeling. On a random sample of 1,000 examples, automatic annotations agreed with human labels 94.1% of the time, confirming the reliability of the classification process. Individual label accuracy reached 98.79% on a per-example basis.
Problems in MathVista fall into five main task categories. Figure question answering (FQA) covers questions that require statistical reasoning over charts, plots, and figures: reading values from bar charts, identifying trends in line plots, or comparing data points in scatter plots. Geometry problem solving (GPS) deals with geometric topics such as computing angles, areas, or side lengths from diagrams, and is sourced largely from standardized math tests and textbook exercises. Math word problems (MWP) are arithmetic reasoning problems set in everyday scenarios where the visual component may include synthetic scenes, abstract diagrams, or tables. Textbook question answering (TQA) covers knowledge-intensive reasoning problems based on scientific topics and figures from educational textbooks, often requiring domain-specific concepts in addition to mathematical reasoning. Visual question answering (VQA) refers to general visual reasoning questions that involve mathematical components, such as counting objects in a natural image, estimating quantities, or performing arithmetic based on visual information.
MathVista includes a diverse array of visual contexts spanning more than a dozen distinct categories: geometry diagrams, synthetic scenes (such as CLEVR-style rendered images), bar charts, natural images, scientific figures, tables, function plots, abstract scenes, puzzle test figures, scatter plots, line plots, pie charts, document images, medical images, and map charts. This diversity ensures that models cannot rely on a single type of visual understanding and must generalize across fundamentally different image types. A model that excels at chart reading but cannot interpret a CT scan, or one that handles geometry diagrams but stumbles on hand-drawn puzzle figures, will see its performance differ markedly across the visual context breakdown.
Problems span four difficulty tiers based on the educational level required to solve them: elementary school (basic counting, simple arithmetic, pattern recognition), high school (geometry, algebra, data interpretation), college (calculus, advanced algebra, scientific reasoning, statistical analysis), and daily life or not applicable (practical numeric reasoning and estimation). Many of the most difficult problems come from college-level scientific contexts, particularly the SciBench, TheoremQA, FunctionQA, and PaperQA subsets.
MathVista includes two types of questions. Multiple-choice questions ask the model to select from a set of provided answer options, with between 2 and 8 choices per question. Free-form questions require the model to generate a specific answer, which can be an integer, a floating-point number, or a text string. The mix of question formats is intentional: multiple-choice items can be partly guessed and benefit from process-of-elimination, while free-form items demand a precise answer and provide a stricter test of true reasoning.
The dataset is available in three languages: English, Chinese (from GeoQA+), and Persian (from ParsVQA-Caps), though the majority of examples are in English.
MathVista is divided into two official splits. The testmini split contains 1,000 examples and serves as a development and validation subset designed for fast iteration. The testmini split was constructed to be statistically representative of the full test set, with a KL Divergence of only 0.008 and a Total Variation distance of 0.035 from the full distribution. Ground-truth answers for testmini are publicly available, which makes it the de facto standard for reported scores in model release papers.
The test split contains 5,141 examples and is the main evaluation split with private ground-truth labels. Researchers must submit predictions to the MathVista evaluation server (or the Kaggle leaderboard) to receive scores on this split. The hidden test labels prevent overfitting and make the test split a more rigorous benchmark for academic comparison, although in practice most published numbers refer to testmini because of its open ground truth and shorter evaluation time.
This split design allows researchers to develop and debug their methods on testmini while preventing overfitting to the full test set. It also encourages a healthy separation between development scores (testmini) and held-out generalization scores (test), although the testmini split has itself become the more frequently cited number, partly because of its public ground truth and partly because the 1,000-example size keeps API costs manageable when evaluating expensive proprietary models.
The evaluation pipeline for MathVista consists of three stages: response generation, answer extraction, and score calculation.
During the first stage, each model generates responses given an input query. The query incorporates the task description, the question text, the answer choices (for multiple-choice problems), and relevant metadata. For augmented LLM approaches that cannot directly process images, visual information is converted to text through image captions and OCR (optical character recognition) before being fed to the model. This caption-and-OCR augmentation produced the strongest text-only baselines reported in the original paper, narrowing but not closing the gap with native multimodal models.
Because different models produce answers in different formats, a standardized answer extraction step is necessary. The MathVista team developed an answer extractor based on GPT-4 that parses model outputs and identifies the final answer. A preliminary study on 200 examples showed that GPT-4 can extract the answer text with more than 99.5% accuracy, making this approach highly reliable. The extractor handles a variety of output formats: chain-of-thought traces ending with "the answer is X," multiple-choice selections ("A"), boxed expressions (\boxed{42}), and short numerical free-form answers.
For both multiple-choice and free-form questions, accuracy is used as the evaluation metric. Multiple-choice questions are scored by exact match with the correct answer option, while free-form questions with numerical answers allow for small tolerance based on the specified precision level. There is no partial credit, no chain-of-thought scoring, and no reasoning grade: a problem is either right or wrong. The simplicity is deliberate, since it makes scores reproducible and easy to compare across labs.
The original paper evaluated models using several prompting strategies. Zero-shot prompting gives the model only the question and image without any examples. Few-shot prompting provides a small number of solved examples before the target question. Chain-of-thought (CoT) prompts the model to show its reasoning step by step before arriving at an answer. Program-of-thought (PoT) prompts the model to generate executable code (typically Python) that computes the answer, enabling precise mathematical calculations and offloading arithmetic to a deterministic interpreter. The PoT setting was particularly effective for the augmented LLM baselines because it shifted the burden of arithmetic from the language model to a Python interpreter.
The original MathVista paper evaluated 12 prominent foundation models on the testmini subset. The results revealed a significant performance gap between the best model and human-level performance.
| Model | Type | Overall Accuracy |
|---|---|---|
| Human Performance | Baseline | 60.3% |
| GPT-4V | Proprietary LMM | 49.9% |
| Multimodal Bard | Proprietary LMM | 34.8% |
| PoT GPT-4 (Caption+OCR) | Augmented LLM | 33.9% |
| CoT GPT-4 (Caption+OCR) | Augmented LLM | 33.2% |
| CoT Claude-2 (Caption+OCR) | Augmented LLM | 33.2% |
| CoT GPT-4 (text-only) | Text-only LLM | 29.2% |
| CoT ChatGPT (2-shot) | Text-only LLM | 26.8% |
| Frequent Guess | Baseline | 26.3% |
| LLaVA (LLaMA-2-13B) | Open-source LMM | 26.1% |
| Zero-shot GPT-4 (text-only) | Text-only LLM | 26.1% |
| InstructBLIP (Vicuna-7B) | Open-source LMM | 25.3% |
| LLaVAR | Open-source LMM | 25.2% |
| LLaMA-Adapter-V2-7B | Open-source LMM | 23.9% |
| miniGPT4 (LLaMA-2-7B) | Open-source LMM | 23.1% |
| mPLUG-Owl (LLaMA-7B) | Open-source LMM | 22.2% |
| IDEFICS-9B-Instruct | Open-source LMM | 19.8% |
| Random Chance | Baseline | 17.9% |
GPT-4V achieved an overall accuracy of 49.9%, substantially outperforming Multimodal Bard, the second-best performer, by 15.1 percentage points. However, GPT-4V still fell short of human performance by 10.4 percentage points, highlighting the remaining challenge of visual mathematical reasoning for AI systems at the time of the paper.
The gap between proprietary models (GPT-4V, Multimodal Bard) and open-source LLaVA-class models was equally striking. The best open-source LMM at release scored 26.1%, only slightly above the frequent-guess baseline of 26.3%. Open-source multimodal math reasoning was, in late 2023, essentially at chance for many subtasks.
GPT-4V's performance varied considerably across different reasoning types and task categories, revealing specific strengths and weaknesses:
| Category | GPT-4V Accuracy | Human Accuracy |
|---|---|---|
| Task: Textbook QA (TQA) | 65.2% | 63.2% |
| Task: Math Word Problems (MWP) | 57.5% | 73.0% |
| Task: Geometry Problem Solving (GPS) | 50.5% | 48.4% |
| Task: Figure QA (FQA) | 43.1% | 59.7% |
| Task: Visual QA (VQA) | 38.0% | 55.9% |
| Reasoning: Scientific (SCI) | 63.1% | N/A |
| Reasoning: Statistical (STA) | 55.8% | 63.9% |
| Reasoning: Algebraic (ALG) | 53.0% | 50.9% |
| Reasoning: Geometric (GEO) | 51.0% | N/A |
| Reasoning: Arithmetic (ARI) | 49.0% | 59.2% |
| Reasoning: Logical (LOG) | 21.6% | 40.7% |
| Reasoning: Numeric (NUM) | 20.1% | 53.8% |
GPT-4V surpassed human performance on Geometry Problem Solving (GPS) and Textbook QA (TQA) tasks, as well as on algebraic reasoning. The model struggled severely with logical reasoning (21.6% vs. 40.7% human) and numeric common sense (20.1% vs. 53.8% human), indicating that certain types of visual-mathematical reasoning remain particularly difficult for current AI systems. The Logical and Numeric categories largely overlap with the IQTest puzzle figures, where pattern recognition and abstract inference dominate, and where pretraining on web text and standardized math problems offers little help.
As newer and more powerful multimodal models have been released, performance on MathVista has improved dramatically. The following table shows selected results from the evolving leaderboard, drawn from official model release reports and the project's public leaderboard:
| Model | Developer | MathVista Accuracy (testmini) | Approximate Date |
|---|---|---|---|
| Kimi K2.5 | Moonshot AI | ~90.1% | 2026 |
| o3 | OpenAI | ~86.8% | 2025 |
| o4-mini | OpenAI | ~84.3% | 2025 |
| Qwen2.5-VL-72B | Alibaba | 74.8% | Jan 2025 |
| Llama 4 Maverick | Meta | 73.7% | 2025 |
| o1 | OpenAI | 73.9% | Sept 2024 |
| GPT-4.5 | OpenAI | 72.3% | 2025 |
| GPT-4.1 | OpenAI | 72.2% | 2025 |
| Pixtral Large | Mistral AI | 69.4% | 2024 |
| Grok-2 | xAI | 69.0% | Aug 2024 |
| Grok-2 mini | xAI | 68.1% | Aug 2024 |
| Gemini 1.5 Pro | 68.1% | 2024 | |
| Claude 3.5 Sonnet | Anthropic | 67.7% | June 2024 |
| LLaVA-OneVision | Open source | 67.5% | Aug 2024 |
| InternVL2-Pro | OpenGVLab | 65.8% | Sept 2024 |
| InternVL2-8B-MPO | OpenGVLab | 65.7% | Nov 2024 |
| InternVL2 | OpenGVLab | 66.3% | 2024 |
| GPT-4o (May 2024) | OpenAI | 63.8% | May 2024 |
| Gemini 1.0 Ultra | 53.0% | Dec 2023 | |
| Claude 3 Opus | Anthropic | 50.5% | March 2024 |
| GPT-4V | OpenAI | 49.9% | Oct 2023 |
| Multimodal Bard | 34.8% | Oct 2023 | |
| Human Performance | N/A | 60.3% | Oct 2023 |
| Random Chance | N/A | 17.9% | Oct 2023 |
Multiple models now surpass human-level performance on MathVista. GPT-4o was among the first widely deployed models to cross this threshold, achieving 63.8% compared to the 60.3% human baseline. OpenAI's reasoning-focused models (o1 and o3) have pushed performance even higher, and by 2026 several frontier multimodal systems sit comfortably above 80%. The 2024 leaderboard showed at least eight models above the human average, and the 2025 to 2026 generation has effectively saturated the easier portions of MathVista, with most remaining errors concentrated in the IQTest, FunctionQA, and PaperQA subsets and in college-level scientific reasoning.
The progress narrative on MathVista is one of the most striking in the field. From GPT-4V's 49.9% at launch in October 2023 to scores in the high 80s and low 90s by 2026, the benchmark went from "cutting-edge multimodal systems lag humans by 10 points" to "frontier models exceed humans by 25 to 30 points" in roughly two and a half years. Most of that gain came from three sources: native multimodal pretraining replacing OCR-and-caption pipelines, chain-of-thought and reasoning-time inference (especially for o1 and o3), and broader training data covering charts, geometry diagrams, and scientific figures.
One of the most significant findings from the original MathVista study was the prevalence of hallucination in model outputs. In a detailed analysis of 250 predictions from Multimodal Bard, the researchers found that 49.6% of responses contained hallucinations, where the model introduced facts not mentioned in the context of the image or question. Hallucinated content ranged from invented chart labels and fabricated geometric measurements to misattributed paper titles and made-up scientific formulas.
The error analysis revealed several distinct failure categories:
| Error Type | Percentage |
|---|---|
| Incorrect answer with incorrect explanation | 44.6% |
| Wrong calculations despite correct reasoning approach | 19.5% |
| Correct reasoning with hallucinated facts | 18.6% |
| Correct answer with partially or completely incorrect explanation | 14.9% |
| Correct answer obtained through wrong reasons | 8.1% |
The finding that 8.1% of correct answers were reached through entirely wrong reasoning is particularly noteworthy. It indicates that evaluation based solely on final answers can overestimate a model's true mathematical understanding, and it has motivated subsequent benchmarks (such as MathVerse) to design problems that are harder to game with surface-level pattern matching.
The paper also explored several techniques for improving model performance. Self-verification (allowing the model to verify and potentially correct its own answers) enhanced single-inference accuracies by up to 12.7%. Self-consistency sampling (using multiple inference samples and selecting the most common answer) reduced error rates by 19.3% on complex math problems. Multi-turn dialogue with guided human intervention through follow-up prompts allowed models to resolve 85.4% of advanced geometry inquiries, suggesting that interactive approaches could significantly improve performance even when single-shot accuracy was limited. These findings foreshadowed the rise of reasoning-time inference techniques that became central to o1, o3, and other reasoning models in 2024 and 2025.
The evaluation revealed clear patterns in where different model architectures excel. Augmented LLMs with Program-of-Thought performed best on tasks requiring precise computation, such as figure question answering and statistical reasoning, because the generated code could perform exact arithmetic. Multimodal Bard showed surprisingly strong performance on geometry problem solving (47.1%), approaching the human level of 48.4%, while struggling with logical reasoning (13.5%). Open-source LMMs like LLaVA (26.1%) showed a substantial gap compared to proprietary alternatives, though this gap has narrowed considerably with subsequent open-source developments such as InternVL2-Pro (65.8%), Qwen2.5-VL-72B (74.8%), and LLaVA-OneVision (67.5%).
The human performance baseline of 60.3% was established through Amazon Mechanical Turk using annotators with a minimum of a high school diploma. Workers were required to complete five questions within 20 minutes. The relatively moderate human accuracy (compared to, for example, near-perfect performance on simpler VQA benchmarks) confirms that MathVista captures genuinely difficult mathematical reasoning tasks. The 60.3% number represents an average lay-person score, not an expert score: PhD mathematicians or graduate students in STEM would likely score much higher.
Human performance varied significantly across task types:
| Task Type | Human Accuracy |
|---|---|
| Math Word Problems (MWP) | 73.0% |
| Textbook QA (TQA) | 63.2% |
| Figure QA (FQA) | 59.7% |
| Visual QA (VQA) | 55.9% |
| Geometry Problem Solving (GPS) | 48.4% |
Human performance also varied across reasoning types:
| Reasoning Type | Human Accuracy |
|---|---|
| Statistical (STA) | 63.9% |
| Arithmetic (ARI) | 59.2% |
| Numeric (NUM) | 53.8% |
| Algebraic (ALG) | 50.9% |
| Logical (LOG) | 40.7% |
The low human accuracy on logical reasoning problems (40.7%) and geometry problem solving (48.4%) indicates that these categories are intrinsically difficult, not just challenging for AI models. Human geometry performance in particular reflects the well-documented gap between recognizing a triangle on a page and computing an unknown angle from the diagram, a skill that depends heavily on prior training rather than general intelligence.
The construction of MathVista involved a careful, multi-stage curation process. The authors emphasized that benchmark quality, not benchmark size, was the primary design goal.
The team reviewed more than 70 publicly available datasets and selected 28 that contained examples requiring mathematical reasoning in visual contexts. For VQA-format datasets, heuristic rules were first applied to automatically identify examples with numeric answers or those containing quantity-related keywords. This yielded 4,949 candidate examples. Three expert annotators then manually reviewed each candidate to determine whether it genuinely involved mathematical reasoning, using majority voting to resolve disagreements. This process yielded 2,739 confirmed examples from VQA sources.
Every example in MathVista was enriched with structured metadata, including question type (multiple-choice or free-form), answer type (integer, float, or text), language, source dataset, category, task type, grade level, visual context type, and mathematical reasoning types. For the reasoning type labels, automatic annotation methods achieved 94.1% exact-match agreement with manual labels on a random sample of 1,000 examples, and individual label accuracy reached 98.79%. The richness of the metadata is one of MathVista's distinguishing features, since it allows researchers to slice the leaderboard along many axes rather than reporting only an aggregate accuracy number.
For the three new datasets, three independent reviewers annotated each question. Among the 736 new questions, only 6 exhibited disagreements in the annotated answers before final consensus was reached through discussion. This low disagreement rate of less than 1% reflects the high quality of the annotation process. Subsequent academic studies have noted that MathVista is one of the cleaner multimodal math benchmarks, with fewer ambiguous, mislabeled, or unsolvable items than several competing datasets.
MathVista occupies a specific niche in the broader multimodal evaluation landscape. The following table compares MathVista with several closely related text-only and multimodal math benchmarks:
| Benchmark | Modality | Examples | Focus | Year |
|---|---|---|---|---|
| GSM8K | Text | 8,500 | Grade school math word problems | 2021 |
| MathQA | Text | 37,000 | Multiple-choice text math | 2019 |
| MATH | Text | 12,500 | Competition math | 2021 |
| MMLU | Text | 15,908 | General knowledge (incl. math) | 2020 |
| GPQA | Text | 448 | Graduate-level science | 2023 |
| HumanEval | Text | 164 | Python coding | 2021 |
| ChartQA | Image+Text | 9,608 | Chart question answering | 2022 |
| TabMWP | Image+Text | 38,431 | Table-based math | 2022 |
| GeoQA+ | Image+Text | 5,010 | Geometry problem solving | 2022 |
| MathVista | Image+Text | 6,141 | Multimodal math reasoning | 2023 |
| MMMU | Image+Text | 11,500 | College-level multimodal exam | 2024 |
| MathVerse | Image+Text | 15,672 | Diagram dependence in math | 2024 |
| MATH-Vision | Image+Text | 3,040 | Competition multimodal math | 2024 |
| WE-MATH | Image+Text | 6,500 | Knowledge concept hierarchy | 2024 |
| DynaMath | Image+Text | 5,010 (variants) | Robustness via dynamic variants | 2024 |
| CharXiv | Image+Text | 2,323 | Chart QA in scientific papers | 2024 |
| MathBench | Image+Text | 3,709 | Hierarchical math reasoning | 2024 |
Published at NeurIPS 2024, MATH-Vision provides 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 mathematical disciplines across 5 difficulty levels, MATH-V focuses on competition-level problems that are considerably more challenging than those in MathVista. Where MathVista draws from diverse educational and everyday settings, MATH-Vision specifically targets the kind of advanced, multi-step reasoning found in mathematical olympiads and standardized competitions. Several frontier models that score above 80% on MathVista still struggle to reach 50% on MATH-Vision, illustrating that MathVista does not capture the full difficulty curve of multimodal math.
Presented at ECCV 2024, MathVerse addresses a specific limitation of existing benchmarks: the tendency for textual questions to contain enough information that models can answer correctly without truly interpreting the accompanying diagrams. MathVerse collects 2,612 high-quality multi-subject math problems with diagrams and transforms each problem into six distinct versions that offer varying degrees of information in the text versus the image. This design tests whether multimodal models genuinely extract information from visual inputs or simply rely on textual cues. The resulting 15,000 test samples provide a more equitable evaluation of true visual reasoning, and many models that perform well on MathVista see substantial drops when tested on MathVerse's vision-only or vision-dominant variants.
Accepted at ACL 2025, WE-MATH focuses on the underlying principles of mathematical knowledge acquisition rather than end-to-end performance. It consists of 6,500 visual math problems spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. WE-MATH introduces a four-dimensional evaluation metric that distinguishes between Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), providing deeper insight into how models learn and apply mathematical concepts. WE-MATH and MathVista are often reported together because they probe complementary aspects of multimodal math: aggregate task performance versus underlying knowledge mastery.
DynaMath focuses on evaluating the robustness of vision-language models by dynamically generating variants of visual math problems. Rather than using static test sets, DynaMath modifies problem parameters (such as numbers, colors, or geometric properties) to test whether models have genuinely learned mathematical reasoning or have merely memorized specific problem-answer pairs. This is one of the most direct attempts to address contamination concerns that affect static benchmarks like MathVista.
The Massive Multi-discipline Multimodal Understanding and Reasoning benchmark (MMMU), released in late 2023, evaluates multimodal models on college-exam-level problems across 30 subjects. Although MMMU is broader than MathVista (covering art history, law, medicine, and many other domains in addition to math and science), the two benchmarks are usually reported side by side in frontier model release notes because they jointly cover general multimodal competence (MMMU) and visual mathematical reasoning (MathVista).
MathVista's text-only counterparts include GSM8K (grade school math word problems), MATH (competition math), MathQA (multiple-choice text math), and parts of MMLU. Frontier models often score above 95% on GSM8K, in the high 80s on MATH, and near 90% on MMLU, while their MathVista scores typically lag by 5 to 20 points because of the additional visual perception burden. This gap has narrowed substantially in 2025 to 2026 as native multimodal pretraining has matured.
Despite its broad adoption, MathVista has several recognized limitations.
Multiple-choice items make up a substantial fraction of the dataset, and even strong-looking accuracies can be partly inflated by guessing. With 2 to 8 answer options per multiple-choice question, the random-guess baseline is 26.3%, well above pure chance for binary questions. Several published critiques have noted that overall MathVista scores can mask weaker free-form performance, and some authors prefer to report per-task or free-form-only results.
Data contamination is a real concern. Most of the 28 source datasets predate the training cutoff of frontier models released in 2024 and 2025, so direct memorization of test items is plausible. The PaperQA subset is the only portion explicitly designed to lie outside model training data (drawn from August 2023 papers), but for the rest, leakage between training corpora and test items cannot be ruled out. DynaMath was developed in part to address this concern by dynamically generating problem variants.
MathVista is largely English. Although Chinese (GeoQA+) and Persian (ParsVQA-Caps) problems are present, the benchmark cannot be used to assess multilingual multimodal math performance in any rigorous way.
The benchmark is increasingly saturated. With several frontier models above 80% on testmini and humans at 60.3%, the gradient that originally distinguished GPT-4V from open-source baselines has flattened at the top of the leaderboard. MATH-Vision, MathVerse, WE-MATH, and DynaMath have largely taken over as the harder benchmarks for distinguishing the next generation of multimodal reasoners.
The answer extraction step uses GPT-4 as a parser, which itself may fail in edge cases. The original paper estimated 99.5% extractor accuracy, but that residual error is non-zero and concentrates on long, ambiguous outputs from older models that produce verbose, unstructured responses.
The testmini split has become more popular than the held-out test split, partly because its labels are public. This convenience-driven preference reduces the value of the hidden test split as a contamination guardrail, since results on the public testmini are reported far more often.
Finally, MathVista does not capture every important dimension of visual math reasoning. Long multi-page proofs, interactive problem solving, multi-image reasoning across separate figures, and reasoning over video are all outside its scope. Newer benchmarks such as VideoMath, M3GIA, and MMMath fill some of these gaps.
MathVista has had a considerable impact on multimodal AI evaluation. Since its publication, it has become a standard benchmark reported in the technical papers of nearly every major multimodal model release, including GPT-4V, GPT-4o, GPT-4.1, GPT-4.5, o1, o3, Gemini 1.0/1.5/2.0, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3.2 Vision, Llama 4, Qwen-VL, Qwen2-VL, Qwen2.5-VL, Pixtral, Grok-2, InternVL, and LLaVA-OneVision. The benchmark is hosted on Hugging Face under the AI4Math organization, available on Kaggle for open benchmarking, and its evaluation code is maintained as an open-source repository on GitHub.
The benchmark's multi-dimensional taxonomy spanning task types, reasoning types, visual contexts, and grade levels, has proven especially valuable for diagnosis. It allows researchers not only to report a single accuracy number but also to identify specific areas of strength and weakness, guiding targeted improvements in model architecture, training data, and reasoning strategies. For example, the consistently low logical-reasoning scores led several teams to add IQ-puzzle-style synthetic data to their training mixes, and the gap on PaperQA highlighted the need for scientific-figure pretraining.
The rapid progress from GPT-4V's 49.9% at launch in October 2023 to scores above 85% on testmini by early 2026 also provides one of the clearest case studies of the pace of improvement in multimodal AI. The detailed error analysis categories introduced in the MathVista paper, particularly around hallucination in mathematical reasoning, have influenced how the field thinks about and evaluates the reliability of AI-generated mathematical solutions, and many subsequent papers cite MathVista's hallucination breakdown as a methodological template.
MathVista has also served as a launchpad for researchers in the field. Lead author Pan Lu, who started the project as a UCLA PhD student under Kai-Wei Chang, was awarded a Bloomberg Data Science PhD Fellowship in part for this work, and several co-authors have gone on to lead multimodal evaluation efforts at industry labs.
In the current evaluation landscape, MathVista is rarely reported alone. The standard "multimodal math suite" for a frontier model release in 2025 to 2026 typically includes MathVista, MMMU, MathVerse, MATH-Vision, and often DocVQA, ChartQA, and AI2D as supplementary chart and document understanding tasks. Reasoning-focused models such as OpenAI's o-series, Anthropic's extended-thinking Claude variants, and Google's Gemini Deep Think branch usually report inference-time scaling curves on MathVista showing how accuracy increases with reasoning budget.
For open-source multimodal models, MathVista remains one of the central public benchmarks. The Qwen-VL series, InternVL series, LLaVA-OneVision series, MiniCPM-V series, and DeepSeek-VL series all report MathVista numbers in their release notes, and the open-source frontier has effectively closed the gap with proprietary models on this benchmark by 2025.
MathVista has also been incorporated into evaluation harnesses such as VLMEvalKit, lm-evaluation-harness's multimodal extension, and EvalScope's VLM benchmarks, making it easy for any new model to produce a directly comparable score. This standardization has helped MathVista persist as a core benchmark even as it approaches saturation, since the cost of evaluation is low and the historical scores stretch back to GPT-4V.
Community practitioners have observed that MathVista performance correlates moderately well with general multimodal reasoning ability but only weakly with specialized scientific or olympiad-level math capability. As of 2026, a high MathVista score is necessary but not sufficient evidence that a multimodal model can serve as a useful research or engineering assistant on quantitative tasks.
MathVista is freely available under a CC-BY-SA-4.0 license. The license permits use of the dataset as a test set but prohibits using it as a training set, a restriction designed to limit contamination of future model evaluations.
Key resources include the Hugging Face dataset at AI4Math/MathVista, the GitHub repository at lupantech/MathVista, the project website and leaderboard at mathvista.github.io, the paper on arXiv at arXiv:2310.02255, and the ICLR 2024 proceedings.
The dataset can be loaded directly using the Hugging Face Datasets library:
from datasets import load_dataset
dataset = load_dataset("AI4Math/MathVista")
testmini = dataset["testmini"] # 1,000 examples
test = dataset["test"] # 5,141 examples
A typical evaluation run on testmini (1,000 examples) takes roughly 30 to 90 minutes for an API-served frontier model, depending on rate limits and prompt strategy. Running the full 5,141-example test split costs proportionally more and requires submitting predictions to the official evaluation server (or Kaggle) since the labels are not publicly distributed.