CharXiv

AI Benchmarks Data & Datasets

12 min read

Updated May 10, 2026

Suggest edit History Talk

RawGraph

Last edited

May 10, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 2,479 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

CharXiv

CharXiv
Overview
Full name	Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Abbreviation	CharXiv
Description	An evaluation suite for assessing chart understanding capabilities in multimodal large language models, built from real charts in arXiv papers
Release date	2024-06-26 (arXiv preprint), 2024-07-24 (v1.0 evaluation pipeline)
Latest version	1.0
Benchmark updated	Leaderboard updated 2024-12-25; adopted by frontier labs in 2025-04
Authors	Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen
Organization	Princeton Language and Intelligence (PLI), Princeton University; University of Wisconsin-Madison; The University of Hong Kong
Conference	NeurIPS 2024 Datasets and Benchmarks Track; spotlight at NeurIPS 2024 Multimodal Algorithmic Reasoning Workshop
Technical Details
Type	Chart Understanding, Visual Reasoning, Multimodal Evaluation
Modality	Vision, Text
Task format	Open-vocabulary short-answer question answering
Number of question types	2 (descriptive and reasoning)
Total examples	2,323 charts; 1,000 validation charts with 5,000 questions; remaining 1,323 charts in held-out test split
Evaluation metric	Accuracy graded by GPT-4o (LLM-as-judge)
Domains	Scientific charts from arXiv, primarily cs.CV, cs.LG and cs.CL papers
Languages	English
Performance
Human performance	80.5% reasoning, 92.1% descriptive
Best proprietary (paper)	GPT-4o, 47.1% reasoning
Best open-source (paper)	InternVL Chat V1.5, 29.2% reasoning
Leaderboard top (Dec 2024)	Claude 3.5 Sonnet, 60.2% reasoning, 84.3% descriptive
Saturated	No
Resources
Website	charxiv.github.io
Paper	arXiv:2406.18521
GitHub	princeton-nlp/CharXiv
Dataset	Hugging Face: princeton-nlp/CharXiv
License	CC BY-SA 4.0 (annotations), Apache 2.0 (code), original chart copyrights retained by paper authors

CharXiv is a benchmark for evaluating chart understanding in multimodal large language models (MLLMs), built by researchers at Princeton Language and Intelligence with collaborators at the University of Wisconsin-Madison and the University of Hong Kong^[1]^[2]. The paper, "CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs" by Wang et al., was posted to arXiv as 2406.18521 on 26 June 2024 and accepted at the NeurIPS 2024 Datasets and Benchmarks Track on 10 October 2024, with a spotlight at the NeurIPS 2024 Multimodal Algorithmic Reasoning Workshop^[2]^[3].

The benchmark uses 2,323 charts handpicked from real arXiv papers, pairing each chart with four descriptive questions plus one reasoning question. Answers are open-ended short text and grading is done with GPT-4o as judge. The paper reported that GPT-4o, the strongest proprietary model at submission, reached only 47.1% on reasoning while the strongest open-source model, InternVL Chat V1.5, reached 29.2%, both far below human accuracy of 80.5%^[1]^[4].

Background

Chart understanding requires reading text, decoding spatial layouts, parsing legends and axes, and combining numbers across regions of an image. Earlier benchmarks such as ChartQA, PlotQA, FigureQA and DVQA used templated questions and synthetically rendered charts, so models could memorize patterns and post high scores without genuine visual reasoning ability^[1]. The CharXiv authors argue this leads to an over-optimistic measure of progress, and the paper shows that simple stress tests with slightly different charts or questions can drop accuracy by up to 34.5% on those benchmarks^[1]. CharXiv was designed as a corrective: scale, diversity and visual messiness are baked in by sourcing charts from real scientific papers rather than generating them.

The paper has thirteen authors led by Zirui Wang and including Mengzhou Xia, Haotian Liu (known for LLaVA, based at Wisconsin during the project), Sanjeev Arora and Danqi Chen. Most authors are based at Princeton's PLI lab^[3]^[5].

Dataset construction

Source charts

All charts come from arXiv papers, with a strong skew toward computer science subfields, primarily cs.CV (computer vision), cs.LG (machine learning) and cs.CL (computational linguistics)^[5]^[6]. The team converted figures from arXiv vector files to raster images with widths between roughly 399 and 1,024 pixels, then ran a manual review to discard unreadable or ambiguous charts. Each chart's metadata records the paper ID, primary arXiv category and figure index^[6]. A single CharXiv chart can be a bar plot, multi-axis line graph, scatter plot with regression overlays, heatmap, ROC or precision-recall curve, confusion matrix, or multi-panel composite (up to 120 subplots in the extreme), which distinguishes it from previous chart QA datasets where multi-panel composition was rare^[6].

Splits

The benchmark splits 2,323 charts into a 1,000-chart validation set and a 1,323-chart held-out test set^[1]^[6]. Each chart carries four descriptive questions and one reasoning question, so the validation set contains 5,000 questions. Test answers are stored as null to prevent leakage; researchers run their model on the test split locally and submit responses for grading^[6]^[7].

Split	Charts	Descriptive questions	Reasoning questions	Answers public
Validation	1,000	4,000	1,000	Yes
Test	1,323	~5,292	~1,323	No (held out)
Total	2,323	~9,292	~2,323	Mixed

Every question is hand-written and verified by a second human annotator. No templates were used at any stage; quality was prioritized over scale^[1]^[7].

Question taxonomy

CharXiv's two question types break down further into named categories the leaderboard reports as subscores^[7].

Descriptive questions (4 per chart)

Descriptive questions test whether a model reads the chart correctly without arithmetic or interpretation beyond the visible content. The five subscore categories:

Code	Subcategory	What it tests
INEX	Information Extraction	Locating a specific value, label or annotation
ENUM	Enumeration	Counting categories, panels or named series
PATT	Pattern Recognition	Recognizing shapes (monotonic increase, oscillation, peak)
CNTG	Counting	Counting bars, lines or markers
COMP	Compositionality	Combining two or more retrievals (e.g., "is the red line at x=10 larger than the blue line at x=20?")

Compositionality is the hardest descriptive subcategory: even Claude 3.5 Sonnet drops to 48.66% on COMP while scoring above 80% on the others^[7].

Reasoning questions (1 per chart)

Reasoning questions require integrating chart content with external knowledge or multi-step computation, organized by the source of the answer:

Code	Subcategory	What it tests
TC	Text-in-Chart	Short text answer that appears in or is implied by the chart
TG	Text-in-General	Short text answer requiring external context (e.g., naming the best-performing method)
NC	Number-in-Chart	Numeric answer read from or computed using the chart
NG	Number-in-General	Numeric answer combining the chart with external knowledge

The reasoning split is balanced across these four buckets. Number-in-General is the hardest category for almost every system tested^[1]^[7].

Evaluation methodology

The official pipeline (v1.0, released 24 July 2024) is a three-step Python flow on the validation split^[8]: generate.py calls the model with each chart-question pair and writes responses; evaluate.py calls the OpenAI API and asks GPT-4o to judge each response; get_stats.py aggregates the overall, descriptive and reasoning scores shown on the leaderboard. The choice of GPT-4o as judge is deliberate: exact-string matching is too strict for open-vocabulary answers, and a strong LLM judge correlates better with human grading^[8]. The leaderboard reports descriptive and reasoning subscores separately, plus per-category breakdowns within each^[7].

Results

Original paper results

The team evaluated 34 vision-language models covering proprietary APIs, generalist open models and chart-specific systems. Headline numbers on the validation split:

Model	Reasoning	Descriptive	Notes
Human	80.5%	92.1%	Expert upper bound
GPT-4o	47.1%	84.45%	Best proprietary at submission
Gemini 1.5 Pro	43.3%	71.97%	Best Google at submission
GPT-4V	37.1%	79.92%	Earlier OpenAI model
Claude 3 Opus	30.2%	71.55%	Pre-3.5 Anthropic flagship
Claude 3 Sonnet	32.2%	73.65%	Mid-tier Claude 3
Phi-3 Vision	31.6%	60.48%	Microsoft small VLM
InternVL Chat V1.5	29.2%	58.5%	Best open-source at submission
Reka Core	28.9%	55.6%	Reka flagship
Cambrian 34B	27.3%	59.73%	NYU open-source
Qwen-VL Max	24.7%	41.48%	Alibaba proprietary
LLaVA 1.6 Yi 34B	22.5%	51.05%	Open-source baseline

A random baseline (GPT-4o sampling without seeing the chart) scores 10.8% on reasoning and 19.85% on descriptive, providing a sanity floor^[7]. Two patterns stand out: the descriptive-reasoning gap is large for every model (GPT-4o was within 8 points of humans on descriptive but over 30 points behind on reasoning), and the proprietary-open gap was close to 18 points on reasoning at submission, one of the wider gaps on any 2024 multimodal benchmark^[1].

Updated leaderboard (December 2024)

On 25 December 2024 the team posted a leaderboard refresh with newer 2024 systems^[7]:

Model	Reasoning	Descriptive	Released
Claude 3.5 Sonnet	60.2%	84.3%	2024-06
InternVL Chat V2.0 Pro	39.8%	76.83%	2024
InternVL Chat V2.0 76B	38.9%	75.17%	2024
Gemini 1.5 Flash	33.9%	n/a	2024
GPT-4o Mini	34.1%	74.92%	2024-07
InternVL Chat V2.0 26B	33.4%	62.4%	2024
MiniCPM-V2.6 (Upsize+CoT)	31.0%	57.05%	2024

Claude 3.5 Sonnet's 60.2% on reasoning was a step change, opening a 13-point gap over the next system. Newer InternVL V2 models narrowed the open-source gap to about 21 points behind Claude on reasoning, with much smaller gaps on descriptive questions where the open-source 76B model came within 9 points of Claude^[7].

Adoption by frontier labs

In an update on 14 April 2025, the CharXiv team noted that the benchmark had been formally incorporated into the official evaluation suites of several frontier model releases, including GPT-4.1, Qwen2.5-VL, InternVL2.5 and Doubao 1.5 Pro^[8]. These adoptions converted CharXiv from a third-party academic benchmark into part of the standard set vendors report at launch, alongside MMMU, MathVista and ChartQA. Through 2025 and into 2026, independent leaderboard trackers list scores from later flagship systems substantially above 80%, with some systems augmented by code execution or Python tools crossing 85% on reasoning^[9]. The benchmark is therefore not yet saturated when models are evaluated end-to-end without tools.

Failure modes and robustness

Four recurring failure patterns show up across systems^[1]^[7]: hallucinated values (numbers close to plausible but not visible, especially with log scales); legend confusion (swapping colors or line styles, particularly with more than four series); compositional errors (combining two retrievals fails even when individual points are read correctly); and unit and scale errors (misreading exponents, percentages or unit labels, contributing to low Number-in-General scores). The authors interpret these as evidence that current MLLMs lack a genuine spatial-symbolic representation of charts; they treat figures as text-rich images and lean on OCR-style decoding plus surface pattern matching, rather than building a structured internal model of the plot. The paper also shows that perturbing existing chart QA datasets with small changes drops accuracy by up to 34.5%, while CharXiv already includes this realistic variation by design^[1]^[3].

Benchmark	Charts	Source	Question style
ChartQA	~9,600	Statista, OWID, OECD	Mostly templated
PlotQA	224K	Synthetic from World Bank tables	Templated
FigureQA	1.3M	Synthetic	Yes/no
DVQA	3.5M	Synthetic bar charts	Templated
ChartBench	~67K	Synthetic and web	Mixed
MultiChartQA	~3,500	Mixed	Multi-chart reasoning
CharXiv	2,323	Real arXiv papers	Free-form short answers, hand-authored

CharXiv has roughly a quarter of ChartQA's questions, but the charts are much harder: multi-panel layouts, log scales, error bars and dense legends are common. Compared to PlotQA or FigureQA, CharXiv is about two orders of magnitude smaller, but those datasets are fully synthetic and do not discriminate between strong models. The design choice was deliberate: prioritize realism and difficulty over volume^[1].

License, reception, and limitations

Annotations are released under CC BY-SA 4.0, evaluation code under Apache 2.0, and chart images remain under the copyright of the original arXiv authors^[8]. The dataset is hosted on Hugging Face at princeton-nlp/CharXiv along with raw responses and graded outputs for all 34 models in the paper, useful for fine-grained meta-analysis^[6]. The authors request CharXiv be used for evaluation rather than training; the test split's hidden answers enforce this.

CharXiv was one of the more cited 2024 multimodal evaluation papers. Beyond the NeurIPS spotlight, the benchmark has been used in subsequent chart reasoning research including a follow-up paper from 24 October 2024 (arXiv 2410.18798) proposing training methods specifically targeting CharXiv-style charts^[8].

Limitations: CharXiv is English-only; charts come predominantly from arXiv computer science papers, so non-CS, business and journalism charts are underrepresented; and the answer format is restricted to short text. Validation answers are public, so models could indirectly memorize them through web scraping; the test split guards against this but depends on labs honoring the held-out convention^[1]^[6]. LLM-as-judge grading via GPT-4o also couples the benchmark to a specific model, though the team validates GPT-4o judgments against human grading on a sample.

References

Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, H., Malladi, S., Chevalier, A., Arora, S., and Chen, D. (2024). "CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs." arXiv:2406.18521. https://arxiv.org/abs/2406.18521 ↩
NeurIPS 2024 Datasets and Benchmarks Track poster page: "CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs," https://neurips.cc/virtual/2024/poster/97598 ↩
OpenReview discussion and acceptance record: https://openreview.net/forum?id=cy8mq7QYae ↩
NeurIPS 2024 Proceedings PDF: https://proceedings.neurips.cc/paper_files/paper/2024/file/cdf6f8e9fd9aeaf79b6024caec24f15b-Paper-Datasets_and_Benchmarks_Track.pdf ↩
Princeton Language and Intelligence event listing for CharXiv: https://pli.princeton.edu/events/2024/charxiv-charting-gaps-realistic-chart-understanding-multimodal-llms ↩
Hugging Face dataset card for princeton-nlp/CharXiv: https://huggingface.co/datasets/princeton-nlp/CharXiv ↩
Official CharXiv project site and leaderboard: https://charxiv.github.io/ (mirror at https://princeton-nlp.github.io/CharXiv/) ↩
CharXiv GitHub repository (princeton-nlp/CharXiv), README, news log and license: https://github.com/princeton-nlp/CharXiv ↩
Independent CharXiv reasoning leaderboard tracker: https://llm-stats.com/benchmarks/charxiv-r ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

GeoBench SuperGLUE

Background

Dataset construction

Source charts

Splits

Question taxonomy

Descriptive questions (4 per chart)

Reasoning questions (1 per chart)

Evaluation methodology

Results

Original paper results

Updated leaderboard (December 2024)

Adoption by frontier labs

Failure modes and robustness

Comparison with related benchmarks

License, reception, and limitations

See also

References

Improve this article

Related Articles

MMMLU

CIFAR-10

SuperGLUE

PASCAL VOC

Iris dataset

HotpotQA

What links here

Related Articles

MMMLU

CIFAR-10

SuperGLUE

PASCAL VOC

Iris dataset

HotpotQA

What links here