MathVista is a comprehensive benchmark for evaluating the mathematical reasoning capabilities of foundation models in visual contexts. Introduced by Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao from UCLA, the University of Washington, and Microsoft Research, the benchmark consolidates 6,141 examples drawn from 28 existing multimodal datasets and three newly created datasets. Published on arXiv in October 2023 and accepted as an oral presentation at ICLR 2024 (one of only 85 orals out of 7,304 submissions, a 1.2% acceptance rate), MathVista has become one of the most widely used benchmarks for measuring how well multimodal AI systems combine visual perception with mathematical problem-solving.
Large language models and large multimodal models have demonstrated impressive problem-solving abilities across many tasks and domains, ranging from text generation and code completion to image captioning and visual question answering. However, prior to MathVista, the ability of these models to perform mathematical reasoning in visual contexts had not been studied in a systematic, unified way.
Existing benchmarks tended to focus on either textual math problems or visual question answering separately, without combining the two in a way that tests genuine compositional reasoning. A model might be able to solve algebraic equations from text or describe the contents of a chart, but the question of whether it could look at a geometry diagram and compute an unknown angle, or interpret a bar chart and perform statistical calculations, remained largely unaddressed at scale.
The MathVista authors identified several specific gaps in the existing evaluation landscape. First, many visual math datasets covered only narrow domains, such as geometry or chart reading, without spanning the full range of visual mathematical tasks. Second, certain visual contexts were missing entirely from existing benchmarks, including puzzle test figures that require inductive reasoning, functional plots that demand algebraic analysis, and scientific figures from academic papers. Third, there was no unified framework for comparing model performance across diverse mathematical reasoning types and visual contexts simultaneously.
To address these gaps, the research team designed MathVista as a meta-benchmark that aggregates problems from a broad range of sources while also filling the missing gaps with three entirely new datasets. The result is a benchmark that requires fine-grained, deep visual understanding and compositional reasoning, skills that all state-of-the-art foundation models at the time of its release found challenging.
MathVista comprises 6,141 examples sourced from 31 datasets in total: 28 existing multimodal datasets and three newly created ones. The existing datasets are divided into 9 math-targeted question answering (MathQA) datasets and 19 general visual question answering (VQA) datasets. In order to maintain balance, each source dataset contributes up to 400 examples to MathVista. The final dataset contains 2,666 examples from MathQA sources, 2,739 examples from VQA sources, and 736 examples from the three new datasets.
The following table lists all 31 source datasets incorporated into MathVista, along with their category, primary task type, grade level, and subject focus.
| Dataset | Category | Task Type | Grade Level | Subject |
|---|---|---|---|---|
| GEOS | Math-targeted | Geometry Problem Solving | High School | Geometry |
| GeoQA+ | Math-targeted | Geometry Problem Solving | High School | Geometry |
| Geometry3K | Math-targeted | Geometry Problem Solving | High School | Geometry |
| UniGeo | Math-targeted | Geometry Problem Solving | High School | Geometry |
| CLEVR-Math | Math-targeted | Math Word Problem | Elementary | Arithmetic |
| TabMWP | Math-targeted | Math Word Problem | Elementary/High School | Arithmetic |
| IconQA | Math-targeted | Math Word Problem | Elementary | Miscellaneous |
| SciBench | Math-targeted | Textbook QA | College | Miscellaneous |
| TheoremQA | Math-targeted | Textbook QA | College | Miscellaneous |
| FunctionQA | Math-targeted (New) | Textbook QA | College | Algebra |
| PaperQA | Math-targeted (New) | Figure QA | College | Miscellaneous |
| IQTest | Math-targeted (New) | Figure QA | Elementary | IQ Test |
| PlotQA | General VQA | Figure QA | N/A | Miscellaneous |
| ChartQA | General VQA | Figure QA | N/A | Miscellaneous |
| FigureQA | General VQA | Figure QA | N/A | Miscellaneous |
| DVQA | General VQA | Figure QA | N/A | Data Visualization |
| MapQA | General VQA | Figure QA | High School | Geography |
| DocVQA | General VQA | Figure QA | N/A | Miscellaneous |
| ScienceQA | General VQA | Textbook QA | Elementary/High School | Miscellaneous |
| TQA | General VQA | Textbook QA | High School | Biology |
| AI2D | General VQA | Textbook QA | High School | Miscellaneous |
| VQA2.0 | General VQA | Visual QA | N/A | Miscellaneous |
| VQA-AS | General VQA | Visual QA | N/A | Miscellaneous |
| Super-CLEVR | General VQA | Visual QA | N/A | Miscellaneous |
| A-OKVQA | General VQA | Visual QA | N/A | Miscellaneous |
| TextVQA | General VQA | Visual QA | N/A | Miscellaneous |
| VizWiz | General VQA | Visual QA | N/A | Miscellaneous |
| KVQA | General VQA | Visual QA | N/A | Miscellaneous |
| PMC-VQA | General VQA | Visual QA | College | Medicine |
| VQA-RAD | General VQA | Visual QA | College | Medicine |
| ParsVQA-Caps | General VQA | Visual QA | N/A | Miscellaneous |
To fill specific gaps in existing benchmarks, the MathVista team created three entirely new datasets. All questions in these datasets were manually annotated by graduate students in STEM fields and further refined through a rigorous review process involving three independent reviewers with 99.2% inter-annotation consistency.
IQTest (228 examples): This dataset evaluates logical and inductive reasoning using puzzle test figures sourced from online learning platforms. The problems require pattern recognition, abstract thinking, and calculations based on visual sequences and matrices. These problems test a form of visual intelligence that is fundamentally different from the domain-specific math tested in most other datasets.
FunctionQA (400 examples): This dataset focuses on algebraic reasoning over functional plots. Problems require the model to analyze graphs of mathematical functions and answer questions involving variables, expressions, equations, and function properties such as domain, range, intercepts, and asymptotes. FunctionQA demands subtle visual perception of plot features combined with formal algebraic knowledge.
PaperQA (107 examples): This dataset presents questions derived from informative illustrations in academic papers, including tables, figures, and charts. The source papers were released in August 2023, ensuring that the content is not present in the training data of models evaluated on MathVista. PaperQA tests scientific reasoning and the ability to extract and compute with quantitative information from research publications.
MathVista organizes its problems along three complementary taxonomic dimensions: mathematical reasoning type, task type, and visual context type. This multi-dimensional classification enables fine-grained analysis of where models succeed and where they fail.
The benchmark encompasses seven distinct categories of mathematical reasoning:
| Reasoning Type | Abbreviation | Description |
|---|---|---|
| Algebraic | ALG | Reasoning about variables, expressions, equations, and functions |
| Arithmetic | ARI | Basic numerical computation including addition, subtraction, multiplication, and division |
| Geometric | GEO | Reasoning about shapes, angles, areas, volumes, and spatial relationships |
| Logical | LOG | Deductive and inductive reasoning, pattern recognition, and logical inference |
| Numeric | NUM | Numeric common sense and estimation in everyday contexts |
| Scientific | SCI | Reasoning about scientific concepts, formulas, and phenomena |
| Statistical | STA | Interpreting data distributions, calculating statistics, and reading charts |
The annotation of reasoning types was performed through a combination of automatic classification and manual labeling. On a random sample of 1,000 examples, automatic annotations agreed with human labels 94.1% of the time, confirming the reliability of the classification process.
Problems in MathVista fall into five main task categories:
Figure Question Answering (FQA): Questions that require statistical reasoning over charts, plots, and figures. These problems typically ask the model to read values from bar charts, identify trends in line plots, or compare data points in scatter plots.
Geometry Problem Solving (GPS): Problems that deal with geometric topics such as computing angles, areas, or side lengths from diagrams. These often come from standardized math tests and textbook exercises.
Math Word Problems (MWP): Arithmetic reasoning problems set in everyday scenarios. The visual component may include synthetic scenes, abstract diagrams, or tables that provide the data needed to solve the problem.
Textbook Question Answering (TQA): Knowledge-intensive reasoning problems based on scientific topics and figures, typically drawn from educational textbooks. These problems may require understanding domain-specific concepts in addition to mathematical reasoning.
Visual Question Answering (VQA): General visual reasoning questions that involve mathematical components, such as counting objects in a natural image, estimating quantities, or performing arithmetic based on visual information.
The benchmark includes a diverse array of visual contexts spanning 12 or more distinct categories:
This diversity ensures that models cannot rely on a single type of visual understanding and must generalize across fundamentally different image types.
Problems span four difficulty tiers based on the educational level required to solve them:
MathVista includes two types of questions:
The dataset is available in three languages: English, Chinese (from GeoQA+), and Persian (from ParsVQA-Caps), though the majority of examples are in English.
MathVista is divided into two official splits:
This design allows researchers to develop and debug their methods on testmini while preventing overfitting to the full test set.
The evaluation pipeline for MathVista consists of three stages: response generation, answer extraction, and score calculation.
During the first stage, each model generates responses given an input query. The query incorporates the task description, the question text, the answer choices (for multiple-choice problems), and relevant metadata. For augmented LLM approaches that cannot directly process images, visual information is converted to text through image captions and OCR (optical character recognition) before being fed to the model.
Because different models produce answers in different formats, a standardized answer extraction step is necessary. The MathVista team developed an answer extractor based on GPT-4 that parses model outputs and identifies the final answer. A preliminary study on 200 examples showed that GPT-4 can extract the answer text with more than 99.5% accuracy, making this approach highly reliable.
For both multiple-choice and free-form questions, accuracy is used as the evaluation metric. Multiple-choice questions are scored by exact match with the correct answer option, while free-form questions with numerical answers allow for small tolerance based on the specified precision level.
The original paper evaluated models using several prompting strategies:
The original MathVista paper evaluated 12 prominent foundation models on the testmini subset. The results revealed a significant performance gap between the best model and human-level performance.
| Model | Type | Overall Accuracy |
|---|---|---|
| Human Performance | Baseline | 60.3% |
| GPT-4V | Proprietary LMM | 49.9% |
| Multimodal Bard | Proprietary LMM | 34.8% |
| PoT GPT-4 (Caption+OCR) | Augmented LLM | 33.9% |
| CoT GPT-4 (Caption+OCR) | Augmented LLM | 33.2% |
| CoT Claude-2 (Caption+OCR) | Augmented LLM | 33.2% |
| CoT GPT-4 (text-only) | Text-only LLM | 29.2% |
| CoT ChatGPT (2-shot) | Text-only LLM | 26.8% |
| Frequent Guess | Baseline | 26.3% |
| LLaVA (LLaMA-2-13B) | Open-source LMM | 26.1% |
| Zero-shot GPT-4 (text-only) | Text-only LLM | 26.1% |
| InstructBLIP (Vicuna-7B) | Open-source LMM | 25.3% |
| LLaVAR | Open-source LMM | 25.2% |
| LLaMA-Adapter-V2-7B | Open-source LMM | 23.9% |
| miniGPT4 (LLaMA-2-7B) | Open-source LMM | 23.1% |
| mPLUG-Owl (LLaMA-7B) | Open-source LMM | 22.2% |
| IDEFICS-9B-Instruct | Open-source LMM | 19.8% |
| Random Chance | Baseline | 17.9% |
GPT-4V achieved an overall accuracy of 49.9%, substantially outperforming Multimodal Bard, the second-best performer, by 15.1 percentage points. However, GPT-4V still fell short of human performance by 10.4 percentage points, highlighting the remaining challenge of visual mathematical reasoning for AI systems.
GPT-4V's performance varied considerably across different reasoning types and task categories, revealing specific strengths and weaknesses:
| Category | GPT-4V Accuracy | Human Accuracy |
|---|---|---|
| Task Types | ||
| Textbook QA (TQA) | 65.2% | 63.2% |
| Math Word Problems (MWP) | 57.5% | 73.0% |
| Geometry Problem Solving (GPS) | 50.5% | 48.4% |
| Figure QA (FQA) | 43.1% | 59.7% |
| Visual QA (VQA) | 38.0% | 55.9% |
| Reasoning Types | ||
| Scientific (SCI) | 63.1% | N/A |
| Statistical (STA) | 55.8% | 63.9% |
| Algebraic (ALG) | 53.0% | 50.9% |
| Geometric (GEO) | 51.0% | N/A |
| Arithmetic (ARI) | 49.0% | 59.2% |
| Logical (LOG) | 21.6% | 40.7% |
| Numeric (NUM) | 20.1% | 53.8% |
Notably, GPT-4V surpassed human performance on Geometry Problem Solving (GPS) and Textbook QA (TQA) tasks, as well as on algebraic reasoning. However, the model struggled severely with logical reasoning (21.6% vs. 40.7% human) and numeric common sense (20.1% vs. 53.8% human), indicating that certain types of visual-mathematical reasoning remain particularly difficult for current AI systems.
As newer and more powerful multimodal models have been released, performance on MathVista has improved dramatically. The following table shows selected results from the evolving leaderboard as of early 2026:
| Model | Developer | MathVista Accuracy |
|---|---|---|
| o3 | OpenAI | 86.8% |
| o4-mini | OpenAI | 84.3% |
| Kimi K2.5 | Moonshot AI | 90.1% (testmini) |
| o1 | OpenAI | 73.9% (testmini) / 71.8% |
| GPT-4.5 | OpenAI | 72.3% |
| GPT-4.1 | OpenAI | 72.2% |
| Grok-2 | xAI | 69.0% |
| Pixtral Large | Mistral AI | 69.4% |
| Claude 3.5 Sonnet | Anthropic | 67.7% |
| Gemini 1.5 Pro | 68.1% | |
| GPT-4o (May 2024) | OpenAI | 63.8% |
| Llama 4 Maverick | Meta | 73.7% |
| InternVL2-Pro | Shanghai AI Lab | 65.8% |
| Human Performance | N/A | 60.3% |
Multiple models now surpass human-level performance on MathVista. GPT-4o was among the first to cross this threshold, achieving 63.8% compared to the 60.3% human baseline. OpenAI's reasoning-focused models (o1 and o3) have pushed performance even higher, with o3 reaching 86.8% accuracy. These results suggest that while MathVista was extremely challenging at the time of its introduction, rapid progress in multimodal reasoning has substantially closed and then exceeded the human-AI gap.
One of the most significant findings from the original MathVista study was the prevalence of hallucination in model outputs. In a detailed analysis of 250 predictions from Multimodal Bard, the researchers found that 49.6% of responses contained hallucinations, where the model introduced facts not mentioned in the context of the image or question.
The error analysis revealed several distinct failure categories:
| Error Type | Percentage |
|---|---|
| Incorrect answer with incorrect explanation | 44.6% |
| Wrong calculations despite correct reasoning approach | 19.5% |
| Correct reasoning with hallucinated facts | 18.6% |
| Correct answer with partially or completely incorrect explanation | 14.9% |
| Correct answer obtained through wrong reasons | 8.1% |
The finding that 8.1% of correct answers were reached through entirely wrong reasoning is particularly noteworthy. It indicates that evaluation based solely on final answers can overestimate a model's true mathematical understanding.
The paper also explored several techniques for improving model performance:
The evaluation revealed clear patterns in where different model architectures excel:
The human performance baseline of 60.3% was established through Amazon Mechanical Turk using annotators with a minimum of a high school diploma. Workers were required to complete five questions within 20 minutes. The relatively moderate human accuracy (compared to, for example, near-perfect performance on simpler VQA benchmarks) confirms that MathVista captures genuinely difficult mathematical reasoning tasks.
Human performance varied significantly across task types:
| Task Type | Human Accuracy |
|---|---|
| Math Word Problems (MWP) | 73.0% |
| Textbook QA (TQA) | 63.2% |
| Figure QA (FQA) | 59.7% |
| Visual QA (VQA) | 55.9% |
| Geometry Problem Solving (GPS) | 48.4% |
Human performance also varied across reasoning types:
| Reasoning Type | Human Accuracy |
|---|---|
| Statistical (STA) | 63.9% |
| Arithmetic (ARI) | 59.2% |
| Numeric (NUM) | 53.8% |
| Algebraic (ALG) | 50.9% |
| Logical (LOG) | 40.7% |
The low human accuracy on logical reasoning problems (40.7%) and geometry problem solving (48.4%) indicates that these categories are intrinsically difficult, not just challenging for AI models.
The construction of MathVista involved a careful, multi-stage curation process.
The team reviewed more than 70 publicly available datasets and selected 28 that contained examples requiring mathematical reasoning in visual contexts. For VQA-format datasets, heuristic rules were first applied to automatically identify examples with numeric answers or those containing quantity-related keywords. This yielded 4,949 candidate examples. Three expert annotators then manually reviewed each candidate to determine whether it genuinely involved mathematical reasoning, using majority voting to resolve disagreements. This process yielded 2,739 confirmed examples from VQA sources.
Every example in MathVista was enriched with structured metadata, including question type (multiple-choice or free-form), answer type (integer, float, or text), language, source dataset, category, task type, grade level, visual context type, and mathematical reasoning types. For the reasoning type labels, automatic annotation methods achieved 94.1% exact-match agreement with manual labels on a random sample of 1,000 examples, and individual label accuracy reached 98.79%.
For the three new datasets, three independent reviewers annotated each question. Among the 736 new questions, only 6 exhibited disagreements in the annotated answers before final consensus was reached through discussion. This low disagreement rate of less than 1% reflects the high quality of the annotation process.
MathVista's release inspired a wave of follow-up benchmarks that extend, refine, or complement its evaluation approach.
Published at NeurIPS 2024, MATH-Vision provides 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 mathematical disciplines across 5 difficulty levels, MATH-V focuses on competition-level problems that are considerably more challenging than those in MathVista. Where MathVista draws from diverse educational and everyday settings, MATH-Vision specifically targets the kind of advanced, multi-step reasoning found in mathematical olympiads and standardized competitions.
Presented at ECCV 2024, MathVerse addresses a specific limitation of existing benchmarks: the tendency for textual questions to contain enough information that models can answer correctly without truly interpreting the accompanying diagrams. MathVerse collects 2,612 high-quality multi-subject math problems with diagrams and transforms each problem into six distinct versions that offer varying degrees of information in the text versus the image. This design tests whether multimodal models genuinely extract information from visual inputs or simply rely on textual cues. The resulting 15,000 test samples provide a more equitable evaluation of true visual reasoning.
Accepted at ACL 2025, WE-MATH focuses on the underlying principles of mathematical knowledge acquisition rather than end-to-end performance. It consists of 6,500 visual math problems spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. WE-MATH introduces a four-dimensional evaluation metric that distinguishes between Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), providing deeper insight into how models learn and apply mathematical concepts.
DynaMath focuses on evaluating the robustness of vision-language models by dynamically generating variants of visual math problems. Rather than using static test sets, DynaMath modifies problem parameters (such as numbers, colors, or geometric properties) to test whether models have genuinely learned mathematical reasoning or have merely memorized specific problem-answer pairs.
MathVista has had a considerable impact on the field of multimodal AI evaluation. Since its publication, it has become a standard benchmark reported in the technical papers of nearly every major multimodal model release, including GPT-4o, Gemini, Claude, Llama, Qwen-VL, and InternVL. The benchmark is hosted on Hugging Face (under the AI4Math organization), available on Kaggle for open benchmarking, and its evaluation code is maintained as an open-source repository on GitHub.
The benchmark's multi-dimensional taxonomy, spanning task types, reasoning types, visual contexts, and grade levels, has proven especially valuable. It allows researchers not only to report a single accuracy number but also to identify specific areas of strength and weakness, guiding targeted improvements in model architecture, training data, and reasoning strategies.
The rapid progress from GPT-4V's 49.9% at launch to o3's 86.8% in early 2026 also provides a compelling narrative about the pace of improvement in multimodal AI. At the same time, the detailed error analysis categories introduced in the MathVista paper, particularly around hallucination in mathematical reasoning, have influenced how the field thinks about and evaluates the reliability of AI-generated mathematical solutions.
MathVista is freely available under a CC-BY-SA-4.0 license. Key resources include:
The dataset can be loaded directly using the Hugging Face Datasets library:
from datasets import load_dataset
dataset = load_dataset("AI4Math/MathVista")
testmini = dataset["testmini"] # 1,000 examples
test = dataset["test"] # 5,141 examples