MathVista

MathVista is a benchmark for evaluating the mathematical reasoning capabilities of foundation models in visual contexts. It was introduced by Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao from the University of California, Los Angeles, the University of Washington, and Microsoft Research. The benchmark consolidates 6,141 examples drawn from 28 existing multimodal datasets and three newly created datasets. Posted to arXiv in October 2023 (arXiv:2310.02255) and accepted as an oral presentation at ICLR 2024 (one of only 85 orals out of roughly 7,300 submissions, a 1.2% acceptance rate), MathVista has become one of the most widely used benchmarks for measuring how well multimodal AI systems combine visual perception with mathematical problem solving. It is reported in nearly every major frontier model release and forms a standard component of the multimodal evaluation suite alongside MMMU, MathVerse, and MATH-Vision.

MathVista was the first large-scale, unified benchmark to combine challenging visual perception (charts, plots, geometry diagrams, scientific figures, document images, puzzle test patterns) with rigorous mathematical reasoning (algebra, arithmetic, geometry, logic, statistics, scientific computation). Earlier benchmarks tended to test either text-based math (such as GSM8K, MATH, MathQA) or pure visual question answering, but not the compositional skill of reading a diagram and solving a math problem grounded in it. By aggregating 31 source datasets and adding three new ones, MathVista created a benchmark broad enough to test general multimodal mathematical competence rather than narrow domain skill.

Background and motivation

Large language models and large multimodal models have shown strong problem-solving abilities across many tasks, ranging from text generation and code completion to image captioning and visual question answering. Before MathVista, however, the ability of these models to perform mathematical reasoning in visual contexts had not been studied in a systematic, unified way. The lack of a shared benchmark made cross-model comparisons difficult and obscured a critical question: can a model that excels at text math actually solve a geometry problem when the diagram is the primary input?

Existing benchmarks tended to focus on either textual math problems or visual question answering separately, without combining the two in a way that tests genuine compositional reasoning. A model might be able to solve algebraic equations from text or describe the contents of a chart, but the question of whether it could look at a geometry diagram and compute an unknown angle, or interpret a bar chart and perform statistical calculations, remained largely unaddressed at scale.

The MathVista authors identified several specific gaps in the existing evaluation landscape. First, many visual math datasets covered only narrow domains, such as geometry or chart reading, without spanning the full range of visual mathematical tasks. Second, certain visual contexts were missing entirely from existing benchmarks, including puzzle test figures that require inductive reasoning, functional plots that demand algebraic analysis, and scientific figures from academic papers. Third, there was no unified framework for comparing model performance across diverse mathematical reasoning types and visual contexts simultaneously. Fourth, evaluation pipelines were inconsistent across studies, with some teams using exact match, others using free-form scoring, and others using human evaluation, making leaderboard comparisons unreliable.

To address these gaps, the research team designed MathVista as a meta-benchmark that aggregates problems from a broad range of sources while filling missing gaps with three entirely new datasets. The resulting benchmark requires fine-grained, deep visual understanding and compositional reasoning, skills that all state-of-the-art foundation models at the time of its release found challenging.

Dataset composition

MathVista comprises 6,141 examples sourced from 31 datasets in total: 28 existing multimodal datasets and three newly created ones. The existing datasets are divided into 9 math-targeted question answering (MathQA) datasets and 19 general visual question answering (VQA) datasets. To maintain balance, each source dataset contributes up to 400 examples to MathVista. The final dataset contains 2,666 examples from MathQA sources, 2,739 examples from VQA sources, and 736 examples from the three new datasets.

Source datasets

The following table lists all 31 source datasets incorporated into MathVista, along with their category, primary task type, grade level, and subject focus.

Dataset	Category	Task Type	Grade Level	Subject
GEOS	Math-targeted	Geometry Problem Solving	High School	Geometry
GeoQA+	Math-targeted	Geometry Problem Solving	High School	Geometry
Geometry3K	Math-targeted	Geometry Problem Solving	High School	Geometry
UniGeo	Math-targeted	Geometry Problem Solving	High School	Geometry
CLEVR-Math	Math-targeted	Math Word Problem	Elementary	Arithmetic
TabMWP	Math-targeted	Math Word Problem	Elementary/High School	Arithmetic
IconQA	Math-targeted	Math Word Problem	Elementary	Miscellaneous
SciBench	Math-targeted	Textbook QA	College	Miscellaneous
TheoremQA	Math-targeted	Textbook QA	College	Miscellaneous
FunctionQA	Math-targeted (New)	Textbook QA	College	Algebra
PaperQA	Math-targeted (New)	Figure QA	College	Miscellaneous
IQTest	Math-targeted (New)	Figure QA	Elementary	IQ Test
PlotQA	General VQA	Figure QA	N/A	Miscellaneous
ChartQA	General VQA	Figure QA	N/A	Miscellaneous
FigureQA	General VQA	Figure QA	N/A	Miscellaneous
DVQA	General VQA	Figure QA	N/A	Data Visualization
MapQA	General VQA	Figure QA	High School	Geography
DocVQA	General VQA	Figure QA	N/A	Miscellaneous
ScienceQA	General VQA	Textbook QA	Elementary/High School	Miscellaneous
TQA	General VQA	Textbook QA	High School	Biology
AI2D	General VQA	Textbook QA	High School	Miscellaneous
VQA2.0	General VQA	Visual QA	N/A	Miscellaneous
VQA-AS	General VQA	Visual QA	N/A	Miscellaneous
Super-CLEVR	General VQA	Visual QA	N/A	Miscellaneous
A-OKVQA	General VQA	Visual QA	N/A	Miscellaneous
TextVQA	General VQA	Visual QA	N/A	Miscellaneous
VizWiz	General VQA	Visual QA	N/A	Miscellaneous
KVQA	General VQA	Visual QA	N/A	Miscellaneous
PMC-VQA	General VQA	Visual QA	College	Medicine
VQA-RAD	General VQA	Visual QA	College	Medicine
ParsVQA-Caps	General VQA	Visual QA	N/A	Miscellaneous

The table illustrates one of MathVista's key design choices: rather than restricting itself to clean, math-focused inputs, the benchmark deliberately incorporates messy real-world visual contexts including medical scans (PMC-VQA, VQA-RAD), document scans (DocVQA), natural photographs (VQA2.0, A-OKVQA, TextVQA), and accessibility images (VizWiz). The intuition is that any model claiming general multimodal competence should be able to perform mathematical reasoning even when the visual context is not explicitly designed for math.

Three new datasets

To fill specific gaps in existing benchmarks, the MathVista team created three entirely new datasets. All questions in these datasets were manually annotated by graduate students in STEM fields and further refined through a rigorous review process involving three independent reviewers with 99.2% inter-annotation consistency.

IQTest contains 228 examples and evaluates logical and inductive reasoning using puzzle test figures sourced from online learning platforms. The problems require pattern recognition, abstract thinking, and calculations based on visual sequences and matrices. These problems test a form of visual intelligence that is fundamentally different from the domain-specific math tested in most other datasets, and they remain among the hardest items in MathVista even for current frontier models.

FunctionQA contains 400 examples and focuses on algebraic reasoning over functional plots. Problems require the model to analyze graphs of mathematical functions and answer questions involving variables, expressions, equations, and function properties such as domain, range, intercepts, asymptotes, monotonicity, and symmetry. FunctionQA demands subtle visual perception of plot features (axis labels, tick marks, gridlines) combined with formal algebraic knowledge, and was the first dataset to specifically target this combination at scale.

PaperQA contains 107 examples and presents questions derived from informative illustrations in academic papers, including tables, figures, and charts. The source papers were released in August 2023, after the training cutoff of GPT-4V, ensuring that the content is not present in the training data of models evaluated on MathVista at release. PaperQA tests scientific reasoning and the ability to extract and compute with quantitative information from research publications, a skill increasingly relevant for AI assistants used in research workflows.

Task taxonomy

MathVista organizes its problems along three complementary taxonomic dimensions: mathematical reasoning type, task type, and visual context type. This multi-dimensional classification enables fine-grained analysis of where models succeed and where they fail, rather than collapsing performance into a single accuracy score.

Mathematical reasoning types

The benchmark encompasses seven distinct categories of mathematical reasoning:

Reasoning Type	Abbreviation	Description
Algebraic	ALG	Reasoning about variables, expressions, equations, and functions
Arithmetic	ARI	Basic numerical computation including addition, subtraction, multiplication, and division
Geometric	GEO	Reasoning about shapes, angles, areas, volumes, and spatial relationships
Logical	LOG	Deductive and inductive reasoning, pattern recognition, and logical inference
Numeric	NUM	Numeric common sense and estimation in everyday contexts
Scientific	SCI	Reasoning about scientific concepts, formulas, and phenomena
Statistical	STA	Interpreting data distributions, calculating statistics, and reading charts

The annotation of reasoning types was performed through a combination of automatic classification and manual labeling. On a random sample of 1,000 examples, automatic annotations agreed with human labels 94.1% of the time, confirming the reliability of the classification process. Individual label accuracy reached 98.79% on a per-example basis.

Task types

Problems in MathVista fall into five main task categories. Figure question answering (FQA) covers questions that require statistical reasoning over charts, plots, and figures: reading values from bar charts, identifying trends in line plots, or comparing data points in scatter plots. Geometry problem solving (GPS) deals with geometric topics such as computing angles, areas, or side lengths from diagrams, and is sourced largely from standardized math tests and textbook exercises. Math word problems (MWP) are arithmetic reasoning problems set in everyday scenarios where the visual component may include synthetic scenes, abstract diagrams, or tables. Textbook question answering (TQA) covers knowledge-intensive reasoning problems based on scientific topics and figures from educational textbooks, often requiring domain-specific concepts in addition to mathematical reasoning. Visual question answering (VQA) refers to general visual reasoning questions that involve mathematical components, such as counting objects in a natural image, estimating quantities, or performing arithmetic based on visual information.

Visual context types

MathVista includes a diverse array of visual contexts spanning more than a dozen distinct categories: geometry diagrams, synthetic scenes (such as CLEVR-style rendered images), bar charts, natural images, scientific figures, tables, function plots, abstract scenes, puzzle test figures, scatter plots, line plots, pie charts, document images, medical images, and map charts. This diversity ensures that models cannot rely on a single type of visual understanding and must generalize across fundamentally different image types. A model that excels at chart reading but cannot interpret a CT scan, or one that handles geometry diagrams but stumbles on hand-drawn puzzle figures, will see its performance differ markedly across the visual context breakdown.

Grade levels

Problems span four difficulty tiers based on the educational level required to solve them: elementary school (basic counting, simple arithmetic, pattern recognition), high school (geometry, algebra, data interpretation), college (calculus, advanced algebra, scientific reasoning, statistical analysis), and daily life or not applicable (practical numeric reasoning and estimation). Many of the most difficult problems come from college-level scientific contexts, particularly the SciBench, TheoremQA, FunctionQA, and PaperQA subsets.

Question and answer types

MathVista includes two types of questions. Multiple-choice questions ask the model to select from a set of provided answer options, with between 2 and 8 choices per question. Free-form questions require the model to generate a specific answer, which can be an integer, a floating-point number, or a text string. The mix of question formats is intentional: multiple-choice items can be partly guessed and benefit from process-of-elimination, while free-form items demand a precise answer and provide a stricter test of true reasoning.

The dataset is available in three languages: English, Chinese (from GeoQA+), and Persian (from ParsVQA-Caps), though the majority of examples are in English.

Data splits

MathVista is divided into two official splits. The testmini split contains 1,000 examples and serves as a development and validation subset designed for fast iteration. The testmini split was constructed to be statistically representative of the full test set, with a KL Divergence of only 0.008 and a Total Variation distance of 0.035 from the full distribution. Ground-truth answers for testmini are publicly available, which makes it the de facto standard for reported scores in model release papers.

The test split contains 5,141 examples and is the main evaluation split with private ground-truth labels. Researchers must submit predictions to the MathVista evaluation server (or the Kaggle leaderboard) to receive scores on this split. The hidden test labels prevent overfitting and make the test split a more rigorous benchmark for academic comparison, although in practice most published numbers refer to testmini because of its open ground truth and shorter evaluation time.

This split design allows researchers to develop and debug their methods on testmini while preventing overfitting to the full test set. It also encourages a healthy separation between development scores (testmini) and held-out generalization scores (test), although the testmini split has itself become the more frequently cited number, partly because of its public ground truth and partly because the 1,000-example size keeps API costs manageable when evaluating expensive proprietary models.

Evaluation methodology

The evaluation pipeline for MathVista consists of three stages: response generation, answer extraction, and score calculation.

Response generation

During the first stage, each model generates responses given an input query. The query incorporates the task description, the question text, the answer choices (for multiple-choice problems), and relevant metadata. For augmented LLM approaches that cannot directly process images, visual information is converted to text through image captions and OCR (optical character recognition) before being fed to the model. This caption-and-OCR augmentation produced the strongest text-only baselines reported in the original paper, narrowing but not closing the gap with native multimodal models.

Answer extraction

Because different models produce answers in different formats, a standardized answer extraction step is necessary. The MathVista team developed an answer extractor based on GPT-4 that parses model outputs and identifies the final answer. A preliminary study on 200 examples showed that GPT-4 can extract the answer text with more than 99.5% accuracy, making this approach highly reliable. The extractor handles a variety of output formats: chain-of-thought traces ending with "the answer is X," multiple-choice selections ("A"), boxed expressions (\boxed{42}), and short numerical free-form answers.

Score calculation

For both multiple-choice and free-form questions, accuracy is used as the evaluation metric. Multiple-choice questions are scored by exact match with the correct answer option, while free-form questions with numerical answers allow for small tolerance based on the specified precision level. There is no partial credit, no chain-of-thought scoring, and no reasoning grade: a problem is either right or wrong. The simplicity is deliberate, since it makes scores reproducible and easy to compare across labs.

Prompting strategies

The original paper evaluated models using several prompting strategies. Zero-shot prompting gives the model only the question and image without any examples. Few-shot prompting provides a small number of solved examples before the target question. Chain-of-thought (CoT) prompts the model to show its reasoning step by step before arriving at an answer. Program-of-thought (PoT) prompts the model to generate executable code (typically Python) that computes the answer, enabling precise mathematical calculations and offloading arithmetic to a deterministic interpreter. The PoT setting was particularly effective for the augmented LLM baselines because it shifted the burden of arithmetic from the language model to a Python interpreter.

Model performance

Original paper results (October 2023)

The original MathVista paper evaluated 12 prominent foundation models on the testmini subset. The results revealed a significant performance gap between the best model and human-level performance.

Model	Type	Overall Accuracy
Human Performance	Baseline	60.3%
GPT-4V	Proprietary LMM	49.9%
Multimodal Bard	Proprietary LMM	34.8%
PoT GPT-4 (Caption+OCR)	Augmented LLM	33.9%
CoT GPT-4 (Caption+OCR)	Augmented LLM	33.2%
CoT Claude-2 (Caption+OCR)	Augmented LLM	33.2%
CoT GPT-4 (text-only)	Text-only LLM	29.2%
CoT ChatGPT (2-shot)	Text-only LLM	26.8%
Frequent Guess	Baseline	26.3%
LLaVA (LLaMA-2-13B)	Open-source LMM	26.1%
Zero-shot GPT-4 (text-only)	Text-only LLM	26.1%
InstructBLIP (Vicuna-7B)	Open-source LMM	25.3%
LLaVAR	Open-source LMM	25.2%
LLaMA-Adapter-V2-7B	Open-source LMM	23.9%
miniGPT4 (LLaMA-2-7B)	Open-source LMM	23.1%
mPLUG-Owl (LLaMA-7B)	Open-source LMM	22.2%
IDEFICS-9B-Instruct	Open-source LMM	19.8%
Random Chance	Baseline	17.9%

GPT-4V achieved an overall accuracy of 49.9%, substantially outperforming Multimodal Bard, the second-best performer, by 15.1 percentage points. However, GPT-4V still fell short of human performance by 10.4 percentage points, highlighting the remaining challenge of visual mathematical reasoning for AI systems at the time of the paper.

The gap between proprietary models (GPT-4V, Multimodal Bard) and open-source LLaVA-class models was equally striking. The best open-source LMM at release scored 26.1%, only slightly above the frequent-guess baseline of 26.3%. Open-source multimodal math reasoning was, in late 2023, essentially at chance for many subtasks.

GPT-4V performance by category

GPT-4V's performance varied considerably across different reasoning types and task categories, revealing specific strengths and weaknesses:

Category	GPT-4V Accuracy	Human Accuracy
Task: Textbook QA (TQA)	65.2%	63.2%
Task: Math Word Problems (MWP)	57.5%	73.0%
Task: Geometry Problem Solving (GPS)	50.5%	48.4%
Task: Figure QA (FQA)	43.1%	59.7%
Task: Visual QA (VQA)	38.0%	55.9%
Reasoning: Scientific (SCI)	63.1%	N/A
Reasoning: Statistical (STA)	55.8%	63.9%
Reasoning: Algebraic (ALG)	53.0%	50.9%
Reasoning: Geometric (GEO)	51.0%	N/A
Reasoning: Arithmetic (ARI)	49.0%	59.2%
Reasoning: Logical (LOG)	21.6%	40.7%
Reasoning: Numeric (NUM)	20.1%	53.8%

GPT-4V surpassed human performance on Geometry Problem Solving (GPS) and Textbook QA (TQA) tasks, as well as on algebraic reasoning. The model struggled severely with logical reasoning (21.6% vs. 40.7% human) and numeric common sense (20.1% vs. 53.8% human), indicating that certain types of visual-mathematical reasoning remain particularly difficult for current AI systems. The Logical and Numeric categories largely overlap with the IQTest puzzle figures, where pattern recognition and abstract inference dominate, and where pretraining on web text and standardized math problems offers little help.

Leaderboard evolution (2024 to 2026)

As newer and more powerful multimodal models have been released, performance on MathVista has improved dramatically. The following table shows selected results from the evolving leaderboard, drawn from official model release reports and the project's public leaderboard:

Model	Developer	MathVista Accuracy (testmini)	Approximate Date
Kimi K2.5	Moonshot AI	~90.1%	2026
o3	OpenAI	~86.8%	2025
o4-mini	OpenAI	~84.3%	2025
Qwen2.5-VL-72B	Alibaba	74.8%	Jan 2025
Llama 4 Maverick	Meta	73.7%	2025
o1	OpenAI	73.9%	Sept 2024
GPT-4.5	OpenAI	72.3%	2025
GPT-4.1	OpenAI	72.2%	2025
Pixtral Large	Mistral AI	69.4%	2024
Grok-2	xAI	69.0%	Aug 2024
Grok-2 mini	xAI	68.1%	Aug 2024
Gemini 1.5 Pro	Google	68.1%	2024
Claude 3.5 Sonnet	Anthropic	67.7%	June 2024
LLaVA-OneVision	Open source	67.5%	Aug 2024
InternVL2-Pro	OpenGVLab	65.8%	Sept 2024
InternVL2-8B-MPO	OpenGVLab	65.7%	Nov 2024
InternVL2	OpenGVLab	66.3%	2024
GPT-4o (May 2024)	OpenAI	63.8%	May 2024
Gemini 1.0 Ultra	Google	53.0%	Dec 2023
Claude 3 Opus	Anthropic	50.5%	March 2024
GPT-4V	OpenAI	49.9%	Oct 2023
Multimodal Bard	Google	34.8%	Oct 2023
Human Performance	N/A	60.3%	Oct 2023
Random Chance	N/A	17.9%	Oct 2023

Multiple models now surpass human-level performance on MathVista. GPT-4o was among the first widely deployed models to cross this threshold, achieving 63.8% compared to the 60.3% human baseline. OpenAI's reasoning-focused models (o1 and o3) have pushed performance even higher, and by 2026 several frontier multimodal systems sit comfortably above 80%. The 2024 leaderboard showed at least eight models above the human average, and the 2025 to 2026 generation has effectively saturated the easier portions of MathVista, with most remaining errors concentrated in the IQTest, FunctionQA, and PaperQA subsets and in college-level scientific reasoning.

The progress narrative on MathVista is one of the most striking in the field. From GPT-4V's 49.9% at launch in October 2023 to scores in the high 80s and low 90s by 2026, the benchmark went from "cutting-edge multimodal systems lag humans by 10 points" to "frontier models exceed humans by 25 to 30 points" in roughly two and a half years. Most of that gain came from three sources: native multimodal pretraining replacing OCR-and-caption pipelines, chain-of-thought and reasoning-time inference (especially for o1 and o3), and broader training data covering charts, geometry diagrams, and scientific figures.

Error analysis and findings

Hallucination in model responses

One of the most significant findings from the original MathVista study was the prevalence of hallucination in model outputs. In a detailed analysis of 250 predictions from Multimodal Bard, the researchers found that 49.6% of responses contained hallucinations, where the model introduced facts not mentioned in the context of the image or question. Hallucinated content ranged from invented chart labels and fabricated geometric measurements to misattributed paper titles and made-up scientific formulas.

The error analysis revealed several distinct failure categories:

Error Type	Percentage
Incorrect answer with incorrect explanation	44.6%
Wrong calculations despite correct reasoning approach	19.5%
Correct reasoning with hallucinated facts	18.6%
Correct answer with partially or completely incorrect explanation	14.9%
Correct answer obtained through wrong reasons	8.1%

The finding that 8.1% of correct answers were reached through entirely wrong reasoning is particularly noteworthy. It indicates that evaluation based solely on final answers can overestimate a model's true mathematical understanding, and it has motivated subsequent benchmarks (such as MathVerse) to design problems that are harder to game with surface-level pattern matching.

Performance enhancement techniques

The paper also explored several techniques for improving model performance. Self-verification (allowing the model to verify and potentially correct its own answers) enhanced single-inference accuracies by up to 12.7%. Self-consistency sampling (using multiple inference samples and selecting the most common answer) reduced error rates by 19.3% on complex math problems. Multi-turn dialogue with guided human intervention through follow-up prompts allowed models to resolve 85.4% of advanced geometry inquiries, suggesting that interactive approaches could significantly improve performance even when single-shot accuracy was limited. These findings foreshadowed the rise of reasoning-time inference techniques that became central to o1, o3, and other reasoning models in 2024 and 2025.

Model strengths and weaknesses

The evaluation revealed clear patterns in where different model architectures excel. Augmented LLMs with Program-of-Thought performed best on tasks requiring precise computation, such as figure question answering and statistical reasoning, because the generated code could perform exact arithmetic. Multimodal Bard showed surprisingly strong performance on geometry problem solving (47.1%), approaching the human level of 48.4%, while struggling with logical reasoning (13.5%). Open-source LMMs like LLaVA (26.1%) showed a substantial gap compared to proprietary alternatives, though this gap has narrowed considerably with subsequent open-source developments such as InternVL2-Pro (65.8%), Qwen2.5-VL-72B (74.8%), and LLaVA-OneVision (67.5%).

Human performance baseline

The human performance baseline of 60.3% was established through Amazon Mechanical Turk using annotators with a minimum of a high school diploma. Workers were required to complete five questions within 20 minutes. The relatively moderate human accuracy (compared to, for example, near-perfect performance on simpler VQA benchmarks) confirms that MathVista captures genuinely difficult mathematical reasoning tasks. The 60.3% number represents an average lay-person score, not an expert score: PhD mathematicians or graduate students in STEM would likely score much higher.

Human performance varied significantly across task types:

Task Type	Human Accuracy
Math Word Problems (MWP)	73.0%
Textbook QA (TQA)	63.2%
Figure QA (FQA)	59.7%
Visual QA (VQA)	55.9%
Geometry Problem Solving (GPS)	48.4%

Human performance also varied across reasoning types:

Reasoning Type	Human Accuracy
Statistical (STA)	63.9%
Arithmetic (ARI)	59.2%
Numeric (NUM)	53.8%
Algebraic (ALG)	50.9%
Logical (LOG)	40.7%

The low human accuracy on logical reasoning problems (40.7%) and geometry problem solving (48.4%) indicates that these categories are intrinsically difficult, not just challenging for AI models. Human geometry performance in particular reflects the well-documented gap between recognizing a triangle on a page and computing an unknown angle from the diagram, a skill that depends heavily on prior training rather than general intelligence.

Data collection and quality control

The construction of MathVista involved a careful, multi-stage curation process. The authors emphasized that benchmark quality, not benchmark size, was the primary design goal.

Selection from existing datasets

The team reviewed more than 70 publicly available datasets and selected 28 that contained examples requiring mathematical reasoning in visual contexts. For VQA-format datasets, heuristic rules were first applied to automatically identify examples with numeric answers or those containing quantity-related keywords. This yielded 4,949 candidate examples. Three expert annotators then manually reviewed each candidate to determine whether it genuinely involved mathematical reasoning, using majority voting to resolve disagreements. This process yielded 2,739 confirmed examples from VQA sources.

Metadata annotation

Every example in MathVista was enriched with structured metadata, including question type (multiple-choice or free-form), answer type (integer, float, or text), language, source dataset, category, task type, grade level, visual context type, and mathematical reasoning types. For the reasoning type labels, automatic annotation methods achieved 94.1% exact-match agreement with manual labels on a random sample of 1,000 examples, and individual label accuracy reached 98.79%. The richness of the metadata is one of MathVista's distinguishing features, since it allows researchers to slice the leaderboard along many axes rather than reporting only an aggregate accuracy number.

Quality assurance

For the three new datasets, three independent reviewers annotated each question. Among the 736 new questions, only 6 exhibited disagreements in the annotated answers before final consensus was reached through discussion. This low disagreement rate of less than 1% reflects the high quality of the annotation process. Subsequent academic studies have noted that MathVista is one of the cleaner multimodal math benchmarks, with fewer ambiguous, mislabeled, or unsolvable items than several competing datasets.

MathVista occupies a specific niche in the broader multimodal evaluation landscape. The following table compares MathVista with several closely related text-only and multimodal math benchmarks:

Benchmark	Modality	Examples	Focus	Year
GSM8K	Text	8,500	Grade school math word problems	2021
MathQA	Text	37,000	Multiple-choice text math	2019
MATH	Text	12,500	Competition math	2021
MMLU	Text	15,908	General knowledge (incl. math)	2020
GPQA	Text	448	Graduate-level science	2023
HumanEval	Text	164	Python coding	2021
ChartQA	Image+Text	9,608	Chart question answering	2022
TabMWP	Image+Text	38,431	Table-based math	2022
GeoQA+	Image+Text	5,010	Geometry problem solving	2022
MathVista	Image+Text	6,141	Multimodal math reasoning	2023
MMMU	Image+Text	11,500	College-level multimodal exam	2024
MathVerse	Image+Text	15,672	Diagram dependence in math	2024
MATH-Vision	Image+Text	3,040	Competition multimodal math	2024
WE-MATH	Image+Text	6,500	Knowledge concept hierarchy	2024
DynaMath	Image+Text	5,010 (variants)	Robustness via dynamic variants	2024
CharXiv	Image+Text	2,323	Chart QA in scientific papers	2024
MathBench	Image+Text	3,709	Hierarchical math reasoning	2024

MATH-Vision (MATH-V)

Published at NeurIPS 2024, MATH-Vision provides 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 mathematical disciplines across 5 difficulty levels, MATH-V focuses on competition-level problems that are considerably more challenging than those in MathVista. Where MathVista draws from diverse educational and everyday settings, MATH-Vision specifically targets the kind of advanced, multi-step reasoning found in mathematical olympiads and standardized competitions. Several frontier models that score above 80% on MathVista still struggle to reach 50% on MATH-Vision, illustrating that MathVista does not capture the full difficulty curve of multimodal math.

MathVerse

Presented at ECCV 2024, MathVerse addresses a specific limitation of existing benchmarks: the tendency for textual questions to contain enough information that models can answer correctly without truly interpreting the accompanying diagrams. MathVerse collects 2,612 high-quality multi-subject math problems with diagrams and transforms each problem into six distinct versions that offer varying degrees of information in the text versus the image. This design tests whether multimodal models genuinely extract information from visual inputs or simply rely on textual cues. The resulting 15,000 test samples provide a more equitable evaluation of true visual reasoning, and many models that perform well on MathVista see substantial drops when tested on MathVerse's vision-only or vision-dominant variants.

WE-MATH

Accepted at ACL 2025, WE-MATH focuses on the underlying principles of mathematical knowledge acquisition rather than end-to-end performance. It consists of 6,500 visual math problems spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. WE-MATH introduces a four-dimensional evaluation metric that distinguishes between Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), providing deeper insight into how models learn and apply mathematical concepts. WE-MATH and MathVista are often reported together because they probe complementary aspects of multimodal math: aggregate task performance versus underlying knowledge mastery.

DynaMath

DynaMath focuses on evaluating the robustness of vision-language models by dynamically generating variants of visual math problems. Rather than using static test sets, DynaMath modifies problem parameters (such as numbers, colors, or geometric properties) to test whether models have genuinely learned mathematical reasoning or have merely memorized specific problem-answer pairs. This is one of the most direct attempts to address contamination concerns that affect static benchmarks like MathVista.

MMMU

The Massive Multi-discipline Multimodal Understanding and Reasoning benchmark (MMMU), released in late 2023, evaluates multimodal models on college-exam-level problems across 30 subjects. Although MMMU is broader than MathVista (covering art history, law, medicine, and many other domains in addition to math and science), the two benchmarks are usually reported side by side in frontier model release notes because they jointly cover general multimodal competence (MMMU) and visual mathematical reasoning (MathVista).

Text-only siblings

MathVista's text-only counterparts include GSM8K (grade school math word problems), MATH (competition math), MathQA (multiple-choice text math), and parts of MMLU. Frontier models often score above 95% on GSM8K, in the high 80s on MATH, and near 90% on MMLU, while their MathVista scores typically lag by 5 to 20 points because of the additional visual perception burden. This gap has narrowed substantially in 2025 to 2026 as native multimodal pretraining has matured.

Limitations and criticisms

Despite its broad adoption, MathVista has several recognized limitations.

Multiple-choice items make up a substantial fraction of the dataset, and even strong-looking accuracies can be partly inflated by guessing. With 2 to 8 answer options per multiple-choice question, the random-guess baseline is 26.3%, well above pure chance for binary questions. Several published critiques have noted that overall MathVista scores can mask weaker free-form performance, and some authors prefer to report per-task or free-form-only results.

Data contamination is a real concern. Most of the 28 source datasets predate the training cutoff of frontier models released in 2024 and 2025, so direct memorization of test items is plausible. The PaperQA subset is the only portion explicitly designed to lie outside model training data (drawn from August 2023 papers), but for the rest, leakage between training corpora and test items cannot be ruled out. DynaMath was developed in part to address this concern by dynamically generating problem variants.

MathVista is largely English. Although Chinese (GeoQA+) and Persian (ParsVQA-Caps) problems are present, the benchmark cannot be used to assess multilingual multimodal math performance in any rigorous way.

The benchmark is increasingly saturated. With several frontier models above 80% on testmini and humans at 60.3%, the gradient that originally distinguished GPT-4V from open-source baselines has flattened at the top of the leaderboard. MATH-Vision, MathVerse, WE-MATH, and DynaMath have largely taken over as the harder benchmarks for distinguishing the next generation of multimodal reasoners.

The answer extraction step uses GPT-4 as a parser, which itself may fail in edge cases. The original paper estimated 99.5% extractor accuracy, but that residual error is non-zero and concentrates on long, ambiguous outputs from older models that produce verbose, unstructured responses.

The testmini split has become more popular than the held-out test split, partly because its labels are public. This convenience-driven preference reduces the value of the hidden test split as a contamination guardrail, since results on the public testmini are reported far more often.

Finally, MathVista does not capture every important dimension of visual math reasoning. Long multi-page proofs, interactive problem solving, multi-image reasoning across separate figures, and reasoning over video are all outside its scope. Newer benchmarks such as VideoMath, M3GIA, and MMMath fill some of these gaps.

Impact and significance

MathVista has had a considerable impact on multimodal AI evaluation. Since its publication, it has become a standard benchmark reported in the technical papers of nearly every major multimodal model release, including GPT-4V, GPT-4o, GPT-4.1, GPT-4.5, o1, o3, Gemini 1.0/1.5/2.0, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3.2 Vision, Llama 4, Qwen-VL, Qwen2-VL, Qwen2.5-VL, Pixtral, Grok-2, InternVL, and LLaVA-OneVision. The benchmark is hosted on Hugging Face under the AI4Math organization, available on Kaggle for open benchmarking, and its evaluation code is maintained as an open-source repository on GitHub.

The benchmark's multi-dimensional taxonomy spanning task types, reasoning types, visual contexts, and grade levels, has proven especially valuable for diagnosis. It allows researchers not only to report a single accuracy number but also to identify specific areas of strength and weakness, guiding targeted improvements in model architecture, training data, and reasoning strategies. For example, the consistently low logical-reasoning scores led several teams to add IQ-puzzle-style synthetic data to their training mixes, and the gap on PaperQA highlighted the need for scientific-figure pretraining.

The rapid progress from GPT-4V's 49.9% at launch in October 2023 to scores above 85% on testmini by early 2026 also provides one of the clearest case studies of the pace of improvement in multimodal AI. The detailed error analysis categories introduced in the MathVista paper, particularly around hallucination in mathematical reasoning, have influenced how the field thinks about and evaluates the reliability of AI-generated mathematical solutions, and many subsequent papers cite MathVista's hallucination breakdown as a methodological template.

MathVista has also served as a launchpad for researchers in the field. Lead author Pan Lu, who started the project as a UCLA PhD student under Kai-Wei Chang, was awarded a Bloomberg Data Science PhD Fellowship in part for this work, and several co-authors have gone on to lead multimodal evaluation efforts at industry labs.

Modern context (2024 to 2026)

In the current evaluation landscape, MathVista is rarely reported alone. The standard "multimodal math suite" for a frontier model release in 2025 to 2026 typically includes MathVista, MMMU, MathVerse, MATH-Vision, and often DocVQA, ChartQA, and AI2D as supplementary chart and document understanding tasks. Reasoning-focused models such as OpenAI's o-series, Anthropic's extended-thinking Claude variants, and Google's Gemini Deep Think branch usually report inference-time scaling curves on MathVista showing how accuracy increases with reasoning budget.

For open-source multimodal models, MathVista remains one of the central public benchmarks. The Qwen-VL series, InternVL series, LLaVA-OneVision series, MiniCPM-V series, and DeepSeek-VL series all report MathVista numbers in their release notes, and the open-source frontier has effectively closed the gap with proprietary models on this benchmark by 2025.

MathVista has also been incorporated into evaluation harnesses such as VLMEvalKit, lm-evaluation-harness's multimodal extension, and EvalScope's VLM benchmarks, making it easy for any new model to produce a directly comparable score. This standardization has helped MathVista persist as a core benchmark even as it approaches saturation, since the cost of evaluation is low and the historical scores stretch back to GPT-4V.

Community practitioners have observed that MathVista performance correlates moderately well with general multimodal reasoning ability but only weakly with specialized scientific or olympiad-level math capability. As of 2026, a high MathVista score is necessary but not sufficient evidence that a multimodal model can serve as a useful research or engineering assistant on quantitative tasks.

Availability and resources

MathVista is freely available under a CC-BY-SA-4.0 license. The license permits use of the dataset as a test set but prohibits using it as a training set, a restriction designed to limit contamination of future model evaluations.

Key resources include the Hugging Face dataset at AI4Math/MathVista, the GitHub repository at lupantech/MathVista, the project website and leaderboard at mathvista.github.io, the paper on arXiv at arXiv:2310.02255, and the ICLR 2024 proceedings.

The dataset can be loaded directly using the Hugging Face Datasets library:

from datasets import load_dataset

dataset = load_dataset("AI4Math/MathVista")
testmini = dataset["testmini"]  # 1,000 examples
test = dataset["test"]          # 5,141 examples

A typical evaluation run on testmini (1,000 examples) takes roughly 30 to 90 minutes for an API-served frontier model, depending on rate limits and prompt strategy. Running the full 5,141-example test split costs proportionally more and requires submitting predictions to the official evaluation server (or Kaggle) since the labels are not publicly distributed.

References

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., & Gao, J. (2024). "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts." *Proceedings of the International Conference on Learning Representations (ICLR 2024)*. arXiv:2310.02255.
MathVista Project Website and Leaderboard. mathvista.github.io.
Lu, P. et al. (2023). MathVista GitHub repository. github.com/lupantech/MathVista.
AI4Math/MathVista Hugging Face dataset. huggingface.co/datasets/AI4Math/MathVista.
Wang, K., et al. (2024). "Measuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset." *Proceedings of NeurIPS 2024, Datasets and Benchmarks Track*.
Zhang, R., et al. (2024). "MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?" *Proceedings of ECCV 2024*. arXiv:2403.14624.
Qiao, S., et al. (2025). "WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?" *Proceedings of ACL 2025*. arXiv:2407.01284.
Guo, Z., et al. (2024). "DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models." arXiv:2411.00836.
Yue, X., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." *Proceedings of CVPR 2024*. arXiv:2311.16502.
OpenAI (2024). "GPT-4o System Card." OpenAI Technical Report.
Anthropic (2024). "Claude 3.5 Sonnet Model Card." Anthropic Technical Report.
OpenAI (2024). "OpenAI o1 System Card."
Google DeepMind (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530.
Bai, S., et al. (2025). "Qwen2.5-VL Technical Report." Alibaba.
Chen, Z., et al. (2024). "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks." CVPR 2024.
Li, B., et al. (2024). "LLaVA-OneVision: Easy Visual Task Transfer." arXiv:2408.03326.
xAI (2024). "Grok-2 Beta Release." xAI Blog.
ICLR 2024 Oral Presentation Page. iclr.cc/virtual/2024/oral/19768.
Open Benchmarks MathVista Leaderboard on Kaggle. kaggle.com/benchmarks/open-benchmarks/mathvista.
MathVista-Mini Leaderboard, llm-stats.com. llm-stats.com/benchmarks/mathvista-mini.

Background and motivation

Dataset composition

Source datasets

Three new datasets

Task taxonomy

Mathematical reasoning types

Task types

Visual context types

Grade levels

Question and answer types

Data splits

Evaluation methodology

Response generation

Answer extraction

Score calculation

Prompting strategies

Model performance

Original paper results (October 2023)

GPT-4V performance by category

Leaderboard evolution (2024 to 2026)

Error analysis and findings

Hallucination in model responses

Performance enhancement techniques

Model strengths and weaknesses

Human performance baseline

Data collection and quality control

Selection from existing datasets

Metadata annotation

Quality assurance

Comparison with related benchmarks

MATH-Vision (MATH-V)

MathVerse

WE-MATH

DynaMath

MMMU

Text-only siblings

Limitations and criticisms

Impact and significance

Modern context (2024 to 2026)

Availability and resources

See also

References

Improve this article

Related Articles

Paper2Video

MMMU

PaLM-E: An Embodied Multimodal Language Model

Pika (video generation)

Reka AI

CLIP Score

Background and motivation

Dataset composition

Source datasets

Three new datasets

Task taxonomy

Mathematical reasoning types

Task types

Visual context types

Grade levels

Question and answer types

Data splits

Evaluation methodology

Response generation

Answer extraction

Score calculation

Prompting strategies

Model performance

Original paper results (October 2023)

GPT-4V performance by category

Leaderboard evolution (2024 to 2026)

Error analysis and findings

Hallucination in model responses

Performance enhancement techniques

Model strengths and weaknesses

Human performance baseline

Data collection and quality control

Selection from existing datasets

Metadata annotation

Quality assurance

Comparison with related benchmarks