CharXiv
Last reviewed
May 10, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 · 2,479 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 · 2,479 words
Add missing citations, update stale details, or suggest a clearer explanation.
CharXiv
| CharXiv | |
|---|---|
| Overview | |
| Full name | Charting Gaps in Realistic Chart Understanding in Multimodal LLMs |
| Abbreviation | CharXiv |
| Description | An evaluation suite for assessing chart understanding capabilities in multimodal large language models, built from real charts in arXiv papers |
| Release date | 2024-06-26 (arXiv preprint), 2024-07-24 (v1.0 evaluation pipeline) |
| Latest version | 1.0 |
| Benchmark updated | Leaderboard updated 2024-12-25; adopted by frontier labs in 2025-04 |
| Authors | Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen |
| Organization | Princeton Language and Intelligence (PLI), Princeton University; University of Wisconsin-Madison; The University of Hong Kong |
| Conference | NeurIPS 2024 Datasets and Benchmarks Track; spotlight at NeurIPS 2024 Multimodal Algorithmic Reasoning Workshop |
| Technical Details | |
| Type | Chart Understanding, Visual Reasoning, Multimodal Evaluation |
| Modality | Vision, Text |
| Task format | Open-vocabulary short-answer question answering |
| Number of question types | 2 (descriptive and reasoning) |
| Total examples | 2,323 charts; 1,000 validation charts with 5,000 questions; remaining 1,323 charts in held-out test split |
| Evaluation metric | Accuracy graded by GPT-4o (LLM-as-judge) |
| Domains | Scientific charts from arXiv, primarily cs.CV, cs.LG and cs.CL papers |
| Languages | English |
| Performance | |
| Human performance | 80.5% reasoning, 92.1% descriptive |
| Best proprietary (paper) | GPT-4o, 47.1% reasoning |
| Best open-source (paper) | InternVL Chat V1.5, 29.2% reasoning |
| Leaderboard top (Dec 2024) | Claude 3.5 Sonnet, 60.2% reasoning, 84.3% descriptive |
| Saturated | No |
| Resources | |
| Website | charxiv.github.io |
| Paper | arXiv:2406.18521 |
| GitHub | princeton-nlp/CharXiv |
| Dataset | Hugging Face: princeton-nlp/CharXiv |
| License | CC BY-SA 4.0 (annotations), Apache 2.0 (code), original chart copyrights retained by paper authors |
CharXiv is a benchmark for evaluating chart understanding in multimodal large language models (MLLMs), built by researchers at Princeton Language and Intelligence with collaborators at the University of Wisconsin-Madison and the University of Hong Kong[1][2]. The paper, "CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs" by Wang et al., was posted to arXiv as 2406.18521 on 26 June 2024 and accepted at the NeurIPS 2024 Datasets and Benchmarks Track on 10 October 2024, with a spotlight at the NeurIPS 2024 Multimodal Algorithmic Reasoning Workshop[2][3].
The benchmark uses 2,323 charts handpicked from real arXiv papers, pairing each chart with four descriptive questions plus one reasoning question. Answers are open-ended short text and grading is done with GPT-4o as judge. The paper reported that GPT-4o, the strongest proprietary model at submission, reached only 47.1% on reasoning while the strongest open-source model, InternVL Chat V1.5, reached 29.2%, both far below human accuracy of 80.5%[1][4].
Chart understanding requires reading text, decoding spatial layouts, parsing legends and axes, and combining numbers across regions of an image. Earlier benchmarks such as ChartQA, PlotQA, FigureQA and DVQA used templated questions and synthetically rendered charts, so models could memorize patterns and post high scores without genuine visual reasoning ability[1]. The CharXiv authors argue this leads to an over-optimistic measure of progress, and the paper shows that simple stress tests with slightly different charts or questions can drop accuracy by up to 34.5% on those benchmarks[1]. CharXiv was designed as a corrective: scale, diversity and visual messiness are baked in by sourcing charts from real scientific papers rather than generating them.
The paper has thirteen authors led by Zirui Wang and including Mengzhou Xia, Haotian Liu (known for LLaVA, based at Wisconsin during the project), Sanjeev Arora and Danqi Chen. Most authors are based at Princeton's PLI lab[3][5].
All charts come from arXiv papers, with a strong skew toward computer science subfields, primarily cs.CV (computer vision), cs.LG (machine learning) and cs.CL (computational linguistics)[5][6]. The team converted figures from arXiv vector files to raster images with widths between roughly 399 and 1,024 pixels, then ran a manual review to discard unreadable or ambiguous charts. Each chart's metadata records the paper ID, primary arXiv category and figure index[6]. A single CharXiv chart can be a bar plot, multi-axis line graph, scatter plot with regression overlays, heatmap, ROC or precision-recall curve, confusion matrix, or multi-panel composite (up to 120 subplots in the extreme), which distinguishes it from previous chart QA datasets where multi-panel composition was rare[6].
The benchmark splits 2,323 charts into a 1,000-chart validation set and a 1,323-chart held-out test set[1][6]. Each chart carries four descriptive questions and one reasoning question, so the validation set contains 5,000 questions. Test answers are stored as null to prevent leakage; researchers run their model on the test split locally and submit responses for grading[6][7].
| Split | Charts | Descriptive questions | Reasoning questions | Answers public |
|---|---|---|---|---|
| Validation | 1,000 | 4,000 | 1,000 | Yes |
| Test | 1,323 | ~5,292 | ~1,323 | No (held out) |
| Total | 2,323 | ~9,292 | ~2,323 | Mixed |
Every question is hand-written and verified by a second human annotator. No templates were used at any stage; quality was prioritized over scale[1][7].
CharXiv's two question types break down further into named categories the leaderboard reports as subscores[7].
Descriptive questions test whether a model reads the chart correctly without arithmetic or interpretation beyond the visible content. The five subscore categories:
| Code | Subcategory | What it tests |
|---|---|---|
| INEX | Information Extraction | Locating a specific value, label or annotation |
| ENUM | Enumeration | Counting categories, panels or named series |
| PATT | Pattern Recognition | Recognizing shapes (monotonic increase, oscillation, peak) |
| CNTG | Counting | Counting bars, lines or markers |
| COMP | Compositionality | Combining two or more retrievals (e.g., "is the red line at x=10 larger than the blue line at x=20?") |
Compositionality is the hardest descriptive subcategory: even Claude 3.5 Sonnet drops to 48.66% on COMP while scoring above 80% on the others[7].
Reasoning questions require integrating chart content with external knowledge or multi-step computation, organized by the source of the answer:
| Code | Subcategory | What it tests |
|---|---|---|
| TC | Text-in-Chart | Short text answer that appears in or is implied by the chart |
| TG | Text-in-General | Short text answer requiring external context (e.g., naming the best-performing method) |
| NC | Number-in-Chart | Numeric answer read from or computed using the chart |
| NG | Number-in-General | Numeric answer combining the chart with external knowledge |
The reasoning split is balanced across these four buckets. Number-in-General is the hardest category for almost every system tested[1][7].
The official pipeline (v1.0, released 24 July 2024) is a three-step Python flow on the validation split[8]: generate.py calls the model with each chart-question pair and writes responses; evaluate.py calls the OpenAI API and asks GPT-4o to judge each response; get_stats.py aggregates the overall, descriptive and reasoning scores shown on the leaderboard. The choice of GPT-4o as judge is deliberate: exact-string matching is too strict for open-vocabulary answers, and a strong LLM judge correlates better with human grading[8]. The leaderboard reports descriptive and reasoning subscores separately, plus per-category breakdowns within each[7].
The team evaluated 34 vision-language models covering proprietary APIs, generalist open models and chart-specific systems. Headline numbers on the validation split:
| Model | Reasoning | Descriptive | Notes |
|---|---|---|---|
| Human | 80.5% | 92.1% | Expert upper bound |
| GPT-4o | 47.1% | 84.45% | Best proprietary at submission |
| Gemini 1.5 Pro | 43.3% | 71.97% | Best Google at submission |
| GPT-4V | 37.1% | 79.92% | Earlier OpenAI model |
| Claude 3 Opus | 30.2% | 71.55% | Pre-3.5 Anthropic flagship |
| Claude 3 Sonnet | 32.2% | 73.65% | Mid-tier Claude 3 |
| Phi-3 Vision | 31.6% | 60.48% | Microsoft small VLM |
| InternVL Chat V1.5 | 29.2% | 58.5% | Best open-source at submission |
| Reka Core | 28.9% | 55.6% | Reka flagship |
| Cambrian 34B | 27.3% | 59.73% | NYU open-source |
| Qwen-VL Max | 24.7% | 41.48% | Alibaba proprietary |
| LLaVA 1.6 Yi 34B | 22.5% | 51.05% | Open-source baseline |
A random baseline (GPT-4o sampling without seeing the chart) scores 10.8% on reasoning and 19.85% on descriptive, providing a sanity floor[7]. Two patterns stand out: the descriptive-reasoning gap is large for every model (GPT-4o was within 8 points of humans on descriptive but over 30 points behind on reasoning), and the proprietary-open gap was close to 18 points on reasoning at submission, one of the wider gaps on any 2024 multimodal benchmark[1].
On 25 December 2024 the team posted a leaderboard refresh with newer 2024 systems[7]:
| Model | Reasoning | Descriptive | Released |
|---|---|---|---|
| Claude 3.5 Sonnet | 60.2% | 84.3% | 2024-06 |
| InternVL Chat V2.0 Pro | 39.8% | 76.83% | 2024 |
| InternVL Chat V2.0 76B | 38.9% | 75.17% | 2024 |
| Gemini 1.5 Flash | 33.9% | n/a | 2024 |
| GPT-4o Mini | 34.1% | 74.92% | 2024-07 |
| InternVL Chat V2.0 26B | 33.4% | 62.4% | 2024 |
| MiniCPM-V2.6 (Upsize+CoT) | 31.0% | 57.05% | 2024 |
Claude 3.5 Sonnet's 60.2% on reasoning was a step change, opening a 13-point gap over the next system. Newer InternVL V2 models narrowed the open-source gap to about 21 points behind Claude on reasoning, with much smaller gaps on descriptive questions where the open-source 76B model came within 9 points of Claude[7].
In an update on 14 April 2025, the CharXiv team noted that the benchmark had been formally incorporated into the official evaluation suites of several frontier model releases, including GPT-4.1, Qwen2.5-VL, InternVL2.5 and Doubao 1.5 Pro[8]. These adoptions converted CharXiv from a third-party academic benchmark into part of the standard set vendors report at launch, alongside MMMU, MathVista and ChartQA. Through 2025 and into 2026, independent leaderboard trackers list scores from later flagship systems substantially above 80%, with some systems augmented by code execution or Python tools crossing 85% on reasoning[9]. The benchmark is therefore not yet saturated when models are evaluated end-to-end without tools.
Four recurring failure patterns show up across systems[1][7]: hallucinated values (numbers close to plausible but not visible, especially with log scales); legend confusion (swapping colors or line styles, particularly with more than four series); compositional errors (combining two retrievals fails even when individual points are read correctly); and unit and scale errors (misreading exponents, percentages or unit labels, contributing to low Number-in-General scores). The authors interpret these as evidence that current MLLMs lack a genuine spatial-symbolic representation of charts; they treat figures as text-rich images and lean on OCR-style decoding plus surface pattern matching, rather than building a structured internal model of the plot. The paper also shows that perturbing existing chart QA datasets with small changes drops accuracy by up to 34.5%, while CharXiv already includes this realistic variation by design[1][3].
| Benchmark | Charts | Source | Question style |
|---|---|---|---|
| ChartQA | ~9,600 | Statista, OWID, OECD | Mostly templated |
| PlotQA | 224K | Synthetic from World Bank tables | Templated |
| FigureQA | 1.3M | Synthetic | Yes/no |
| DVQA | 3.5M | Synthetic bar charts | Templated |
| ChartBench | ~67K | Synthetic and web | Mixed |
| MultiChartQA | ~3,500 | Mixed | Multi-chart reasoning |
| CharXiv | 2,323 | Real arXiv papers | Free-form short answers, hand-authored |
CharXiv has roughly a quarter of ChartQA's questions, but the charts are much harder: multi-panel layouts, log scales, error bars and dense legends are common. Compared to PlotQA or FigureQA, CharXiv is about two orders of magnitude smaller, but those datasets are fully synthetic and do not discriminate between strong models. The design choice was deliberate: prioritize realism and difficulty over volume[1].
Annotations are released under CC BY-SA 4.0, evaluation code under Apache 2.0, and chart images remain under the copyright of the original arXiv authors[8]. The dataset is hosted on Hugging Face at princeton-nlp/CharXiv along with raw responses and graded outputs for all 34 models in the paper, useful for fine-grained meta-analysis[6]. The authors request CharXiv be used for evaluation rather than training; the test split's hidden answers enforce this.
CharXiv was one of the more cited 2024 multimodal evaluation papers. Beyond the NeurIPS spotlight, the benchmark has been used in subsequent chart reasoning research including a follow-up paper from 24 October 2024 (arXiv 2410.18798) proposing training methods specifically targeting CharXiv-style charts[8].
Limitations: CharXiv is English-only; charts come predominantly from arXiv computer science papers, so non-CS, business and journalism charts are underrepresented; and the answer format is restricted to short text. Validation answers are public, so models could indirectly memorize them through web scraping; the test split guards against this but depends on labs honoring the held-out convention[1][6]. LLM-as-judge grading via GPT-4o also couples the benchmark to a specific model, though the team validates GPT-4o judgments against human grading on a sample.