CharXiv

CharXiv
Overview
Full name	Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Abbreviation	CharXiv
Description	A comprehensive evaluation suite for assessing chart understanding capabilities in multimodal large language models
Release date	2024-06-26
Latest version	1.0
Benchmark updated	2024-06
Authors	Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen
Organization	Princeton NLP, University of Wisconsin-Madison, The University of Hong Kong
Technical Details
Type	Chart Understanding, Visual Reasoning, Multimodal Evaluation
Modality	Vision, Text
Task format	Open-vocabulary question answering
Number of tasks	2 (descriptive and reasoning)
Total examples	2,323 charts with 10,000+ questions
Evaluation metric	Accuracy (GPT-based evaluation)
Domains	Scientific charts from multiple disciplines
Languages	English
Performance
Human performance	80.5%
Baseline	Varies by model type
SOTA score	60.2%
SOTA model	Claude 3.5 Sonnet
SOTA date	2024
Saturated	No
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Download
License	CC BY-SA 4.0 (data), Apache 2.0 (code) ;

CharXiv is a comprehensive benchmark designed to evaluate chart understanding capabilities in Multimodal Large Language Models (MLLMs). Released on June 26, 2024, by researchers from Princeton NLP, University of Wisconsin-Madison, and The University of Hong Kong^[1], CharXiv addresses critical limitations in existing chart understanding benchmarks by providing realistic, challenging charts manually sourced from scientific papers. The benchmark reveals a substantial performance gap between current AI models and human capabilities, with the best model (Claude 3.5 Sonnet) achieving only 60.2% accuracy compared to 80.5% human performance.

Overview

CharXiv represents a paradigm shift in evaluating visual reasoning capabilities of AI systems by focusing on realistic chart understanding scenarios. Unlike existing benchmarks that rely on oversimplified, synthetic, or homogeneous charts, CharXiv features 2,323 high-resolution charts manually extracted from arXiv scientific papers, accompanied by over 10,000 carefully crafted questions. The benchmark evaluates two fundamental aspects of chart comprehension: descriptive understanding (examining basic chart elements) and reasoning capabilities (synthesizing information across complex visual elements)^[1].

The benchmark's name, a portmanteau of "Chart" and "arXiv," reflects its foundation in real scientific visualizations. By sourcing charts directly from academic papers, CharXiv ensures that models are tested on the types of complex, nuanced visualizations they would encounter in real-world applications, from scientific research to business analytics.

Significance

CharXiv's importance in the field of AI evaluation stems from several key contributions:

**Realistic Complexity**: First benchmark to systematically evaluate models on natural, challenging charts from actual scientific papers
**Quality Assurance**: All charts and questions manually curated and verified by human experts
**Robustness Testing**: Reveals model fragility with performance drops up to 34.5% from simple variations
**Performance Gap Exposure**: Uncovers ~20% gap between best models and human performance
**Evaluation Standard**: Adopted by major AI labs for frontier model assessment

Dataset Composition

Chart Collection and Curation

CharXiv's charts undergo a rigorous selection process^[2]:

Stage	Process	Quality Control
Source Selection	ArXiv papers across disciplines	Ensure diversity
Chart Extraction	High-resolution image capture	Maintain visual clarity
Manual Review	Expert examination	Remove low-quality/ambiguous charts
Annotation	Question generation by experts	Multiple review rounds
Verification	Answer validation	Ensure correctness

Dataset Statistics

The benchmark comprises carefully balanced components:

Component	Quantity	Description
Total Charts	2,323	High-resolution scientific visualizations
Validation Set	1,000 charts	5,000 questions for development
Test Set	1,323 charts	Hidden test for evaluation
Questions per Chart	5	4 descriptive + 1 reasoning
Total Questions	10,000+	Open-vocabulary short answers
Unanswerable Questions	1 per chart	Tests model calibration

Chart Types and Domains

CharXiv encompasses diverse visualization types from multiple scientific fields:

Chart Category	Examples	Complexity Features
Statistical Plots	Bar charts, histograms, box plots	Multiple series, error bars, annotations
Line Graphs	Time series, trend lines	Multiple axes, logarithmic scales
Scatter Plots	Correlations, distributions	Overlapping points, regression lines
Heatmaps	Matrices, correlations	Color gradients, hierarchical clustering
Specialized	Confusion matrices, ROC curves	Domain-specific notations
Compound	Multi-panel figures	Cross-panel relationships

Question Design

Descriptive Questions

Descriptive questions test basic chart comprehension^[1]:

Question Type	Focus Area	Example
Data Retrieval	Extracting specific values	"What is the value at x=5?"
Element Identification	Recognizing chart components	"What does the blue line represent?"
Trend Recognition	Identifying patterns	"Is the trend increasing or decreasing?"
Comparison	Relating multiple elements	"Which category has the highest value?"

Reasoning Questions

Reasoning questions require deeper analysis:

Reasoning Type	Cognitive Requirement	Example Task
Synthesis	Combining multiple data points	"What conclusion can be drawn from comparing trends?"
Inference	Drawing implications	"What might explain this pattern?"
Calculation	Performing computations	"What is the percentage change?"
Extrapolation	Extending beyond data	"What would likely happen if this trend continues?"

Unanswerable Questions

Each chart includes one intentionally unanswerable question to test model calibration:

Tests ability to recognize information limitations
Prevents overconfident responses
Evaluates uncertainty handling

Evaluation Methodology

Evaluation Framework

CharXiv employs a sophisticated evaluation pipeline^[3]:

Component	Implementation	Purpose
Setting	Zero-shot with natural instructions	Real-world usage simulation
Scoring	GPT API-based evaluation	Consistent semantic matching
Metrics	Accuracy percentage	Clear performance measurement
Quality Control	Human verification samples	Ensure scoring reliability

Technical Implementation

The evaluation process follows three steps:

```python

Step 1: Generate responses

python src/generate.py --model [model_name]

Step 2: Evaluate responses

python src/evaluate.py --responses [response_file]

Step 3: Calculate statistics

python src/get_stats.py --evaluation [eval_file] ```

Scoring Criteria

Responses are evaluated based on:

**Semantic Correctness**: Answer conveys correct information
**Precision**: Appropriate level of detail
**Relevance**: Directly addresses the question
**Format Compliance**: Follows answer format requirements

Performance Analysis

Current Leaderboard

Performance as of 2024^[1]:

Rank	Model	Overall Accuracy	Descriptive	Reasoning	Gap to Human
-	Human Performance	80.5%	85.2%	71.3%	0%
1	Claude 3.5 Sonnet	60.2%	65.8%	48.9%	-20.3%
2	GPT-4o	47.1%	52.3%	36.7%	-33.4%
3	InternVL Chat V2.0 76B	38.9%	43.2%	30.1%	-41.6%
4	Gemini-1.5-Pro	35.7%	40.1%	27.2%	-44.8%
5	InternVL Chat V1.5	29.2%	33.5%	20.8%	-51.3%

Performance Insights

Key findings from evaluations reveal:

Finding	Implication	Research Direction
20-40% Human Gap	Significant room for improvement	Enhanced visual reasoning
Descriptive > Reasoning	Basic tasks easier than synthesis	Better integration capabilities
Model Fragility	34.5% drop from variations	Robustness improvements
Proprietary Advantage	Closed models outperform open	Open-source advancement needed

Model Adoption

CharXiv has become a standard evaluation for frontier models:

Model Family	Adoption Status	Use Case
GPT-4 variants	Official benchmark	Performance tracking
Qwen2.5-VL	Integrated	Development validation
InternVL series	Standard eval	Version comparison
Llama 3.2 Vision	Included	Capability assessment
Molmo, NVLM	Adopted	Multimodal evaluation

Robustness Analysis

Stress Testing Results

CharXiv includes systematic robustness evaluations^[1]:

Perturbation Type	Average Performance Drop	Most Affected Models
Question Rephrasing	-12.3%	Smaller models
Chart Color Changes	-8.7%	Vision encoders
Axis Relabeling	-15.2%	All models
Data Point Removal	-34.5%	Reasoning tasks

Failure Modes

Common failure patterns identified:

**Hallucination**: Inventing data points not present
**Misalignment**: Confusing chart elements
**Overcounting**: Errors in element enumeration
**Scale Misreading**: Incorrect axis interpretation

Technical Specifications

Repository Structure

Directory	Contents	Purpose
`data/`	QA pairs and metadata	Core dataset
`images/`	Chart images	Visual inputs
`results/`	Model outputs	Evaluation tracking
`src/`	Python evaluation code	Processing pipeline
`scripts/`	Utility scripts	Automation tools

Usage Guidelines

Aspect	Specification	Rationale
Intended Use	Evaluation only	Prevent overfitting
Training Prohibition	Not for model training	Maintain test integrity
Data License	CC BY-SA 4.0	Academic sharing
Code License	Apache 2.0	Open development
Chart Rights	Original authors	Respect copyright

Research Impact

Influence on Field

CharXiv has catalyzed several research directions:

Area	Impact	Active Research
Visual Reasoning	New architectures	Enhanced encoders
Robustness	Stress testing adoption	Invariance training
Multimodal Integration	Better fusion methods	Cross-modal attention
Evaluation Standards	Realistic benchmarks	Domain-specific tests

Related Benchmarks

Benchmark	Focus	Relation to CharXiv
ChartQA	Simple charts	Less complex, synthetic
PlotQA	Plot understanding	Narrower scope
FigureQA	Figure reasoning	Binary questions only
DVQA	Bar chart QA	Single chart type
CharXiv	Realistic scientific charts	Comprehensive, natural

Limitations and Future Work

Current Limitations

Limitation	Description	Impact
English Only	Single language	Limited global applicability
Scientific Focus	ArXiv papers primarily	May miss business charts
Static Evaluation	Fixed question set	Potential memorization
Answer Format	Short answers only	Misses explanations

Future Directions

Planned and potential extensions:

1. **Multilingual Support**: Expanding to non-English charts 2. **Interactive Evaluation**: Multi-turn chart discussions 3. **Domain Expansion**: Business, journalism, education charts 4. **Explanation Requirements**: Reasoning process evaluation 5. **Dynamic Generation**: Procedural question creation

Significance

CharXiv represents a crucial advancement in evaluating multimodal AI systems' chart understanding capabilities. By introducing realistic, challenging charts from actual scientific papers and revealing a substantial 20-40% performance gap between current models and human performance, it exposes previously hidden weaknesses in visual reasoning systems. The benchmark's stress testing reveals that even top models suffer performance drops up to 34.5% from simple variations, highlighting the fragility of current approaches.

As data visualization becomes increasingly central to communication across science, business, and media, CharXiv provides essential infrastructure for developing AI systems capable of genuine chart comprehension. Its adoption by major AI labs and integration into frontier model evaluations establishes it as a critical milestone for measuring progress toward human-level visual understanding capabilities.

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, H., Malladi, S., Chevalier, A., Arora, S., & Chen, D. (2024). "CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs". arXiv:2406.18521. Retrieved from https://arxiv.org/abs/2406.18521
↑ Princeton NLP. (2024). "CharXiv: Chart Understanding Benchmark". GitHub. Retrieved from https://github.com/princeton-nlp/CharXiv
↑ Princeton NLP. (2024). "CharXiv: Official Website". Retrieved from https://princeton-nlp.github.io/CharXiv/

[charxiv_paper-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, H., Malladi, S., Chevalier, A., Arora, S., & Chen, D. (2024). "CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs". arXiv:2406.18521. Retrieved from https://arxiv.org/abs/2406.18521

[charxiv_github-2] Princeton NLP. (2024). "CharXiv: Chart Understanding Benchmark". GitHub. Retrieved from https://github.com/princeton-nlp/CharXiv

[charxiv_website-3] Princeton NLP. (2024). "CharXiv: Official Website". Retrieved from https://princeton-nlp.github.io/CharXiv/

[1]

[2]

[3]

Overview

Significance

Dataset Composition

Chart Collection and Curation

Dataset Statistics

Chart Types and Domains

Question Design

Descriptive Questions

Reasoning Questions

Unanswerable Questions

Evaluation Methodology

Evaluation Framework

Technical Implementation

Scoring Criteria

Performance Analysis

Current Leaderboard

Performance Insights

Model Adoption

Robustness Analysis

Stress Testing Results

Failure Modes

Technical Specifications

Repository Structure

Usage Guidelines

Research Impact

Influence on Field

Related Benchmarks

Limitations and Future Work

Current Limitations

Future Directions

Significance

See Also

References