EQ-Bench 3
EQ-Bench 3 is an artificial intelligence benchmark designed to evaluate emotional intelligence (EI) in large language models (LLMs) through challenging role-play scenarios and analytical tasks. Developed by Samuel J. Paech as an evolution of the original EQ-Bench (2023), EQ-Bench 3 represents the latest iteration of the emotional intelligence benchmark series, using Claude Sonnet 3.7 as the default judge model to assess how well AI systems understand complex emotions, social dynamics, and interpersonal relationships.
| EQ-Bench 3 | |
|---|---|
| Overview | |
| Full name | Emotional Intelligence Benchmark Version 3 |
| Abbreviation | EQ-Bench 3 |
| Description | An LLM-judged benchmark testing emotional intelligence through challenging role-plays and analysis tasks |
| Release date | 2024 (planned/theoretical) |
| Latest version | 3.0 |
| Benchmark updated | 2025 |
| Authors | Samuel J. Paech |
| Organization | Independent Research |
| Technical Details | |
| Type | Emotional Intelligence, Social Understanding |
| Modality | Text |
| Task format | Multi-turn dialogues, Analysis tasks |
| Number of tasks | Unspecified (original EQ-Bench had 60-171 questions) |
| Total examples | Unspecified |
| Evaluation metric | Elo rating, Rubric scoring |
| Domains | Relationship dynamics, Workplace conflicts, Social interactions |
| Languages | English |
| Performance | |
| Human performance | Not reported |
| Baseline | 200 (Llama 3.2-1B) |
| SOTA score | 1500 |
| SOTA model | OpenAI o3 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | (original EQ-Bench) Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Open source |
| Predecessor | EQ-Bench 2 |
Overview
EQ-Bench 3 addresses a critical gap in AI evaluation by focusing on nuanced social skills that are crucial for human-AI interaction but often missed by traditional benchmarks. Unlike standard EQ tests that have become too easy for modern LLMs, EQ-Bench 3 employs difficult, free-form role-plays that effectively discriminate between models' emotional intelligence capabilities.
Motivation
The development of EQ-Bench 3 was driven by several key observations:
- Standard emotional intelligence tests are insufficient for evaluating advanced LLMs
- Existing benchmarks often fail to capture nuanced social skills essential for meaningful human-AI interaction
- The need for assessment tools that go beyond knowledge-based or short-answer questions
- The importance of measuring active EQ skills rather than passive recognition
The benchmark specifically targets the evaluation of empathy, theory of mind, social dexterity, and psychological insight through realistic, complex scenarios that mirror real-world social interactions.
Technical Architecture
Core Components
EQ-Bench 3 consists of two primary evaluation systems:
| Component | Description | Function |
|---|---|---|
| Scenario Dataset | 45 multi-turn scenarios | Provides diverse test cases covering various social contexts |
| Judge Model System | Claude Sonnet 3.7 (default) | Evaluates responses using rubric and pairwise comparisons |
| Evaluation Pipeline | Automated scoring system | Processes responses and calculates Elo ratings |
| Analysis Framework | Detailed rubric criteria | Assesses eight core dimensions of emotional intelligence |
Evaluation Methodology
Dual-Pass Evaluation System
EQ-Bench 3 employs a sophisticated two-pass evaluation approach:
| Pass Type | Description | Output |
|---|---|---|
| Rubric Pass | Judge model assigns numerical scores for each scenario | Individual scenario scores across 8 EI dimensions |
| Elo Pass | Pairwise comparisons between different models' responses | Overall Elo ranking via TrueSkill algorithm |
Eight Core Dimensions
The benchmark evaluates responses across eight fundamental aspects of emotional intelligence:
| Dimension | Description | Weight in Scoring |
|---|---|---|
| Demonstrated Empathy | Ability to understand and share others' feelings | Equal weight |
| Pragmatic EI | Practical application of emotional understanding | Equal weight |
| Depth of Insight | Psychological understanding and analysis quality | Equal weight |
| Social Dexterity | Navigation of complex social situations | Equal weight |
| Emotional Reasoning | Logic applied to emotional contexts | Equal weight |
| Appropriate Validation/Challenge | Knowing when to support vs. question | Equal weight |
| Message Tailoring | Adapting communication to context and recipient | Equal weight |
| Overall EQ | Holistic emotional intelligence assessment | Equal weight |
Test Structure
Scenario Types
EQ-Bench 3 includes two main categories of assessment tasks:
Role-Play Scenarios
The majority of the 45 scenarios are pre-written prompts spanning three turns:
- **Turn 1**: User sets up the scenario context
- **Turn 2**: Introduction of conflict or misdirection
- **Turn 3**: Model must respond in-character while navigating complexity
Example contexts include:
- Relationship conflicts requiring mediation
- Workplace tensions needing resolution
- Parenting challenges demanding empathy
- Social dilemmas requiring nuanced understanding
Analysis Tasks
Several scenarios require the model to:
- Analyze provided roleplay transcripts
- Identify psychologically compelling aspects
- Demonstrate deep understanding of human dynamics
- Explain emotional subtext and motivations
Response Format
Each response follows a structured format designed to expose the model's reasoning:
| Section | Purpose | Example Prompt |
|---|---|---|
| "I'm thinking & feeling" | Reveals model's internal processing | "Based on the situation, I'm feeling concerned about..." |
| "They're thinking & feeling" | Demonstrates theory of mind | "The other person likely feels frustrated because..." |
| Response | The actual in-character reply | "I understand your perspective, and..." |
Performance Metrics
Scoring System
Elo Rating Methodology
EQ-Bench 3 uses a normalized Elo rating system:
- **Anchor Points**: OpenAI o3 at 1500, Llama 3.2-1B at 200
- **Calculation**: Based on pairwise comparisons using TrueSkill algorithm
- **Update Frequency**: Continuous as new models are evaluated
Informational Abilities Heatmap
The benchmark tracks eleven stylistic traits (not used in scoring):
| Ability | Description | Assessment Focus |
|---|---|---|
| Humanlike | Natural, conversational responses | Response authenticity |
| Safety | Adherence to ethical guidelines | Risk mitigation |
| Assertive | Confident communication style | Communication strength |
| Social IQ | Understanding of social dynamics | Social awareness |
| Warm | Friendly and approachable tone | Emotional warmth |
| Analytic | Logical reasoning application | Analytical thinking |
| Insight | Novel perspective generation | Creative understanding |
| Empathy | Understanding others' emotions | Emotional resonance |
| Compliant | Instruction following ability | Task adherence |
| Moralising | Tendency toward moral judgment | Ethical positioning |
| Pragmatic | Focus on practical solutions | Solution orientation |
Current Performance
Leaderboard Leaders (2025)
| Rank | Model | Elo Score | Organization | Notable Strengths |
|---|---|---|---|---|
| 1 | OpenAI o3 | 1500 | OpenAI | Benchmark anchor, exceptional across all dimensions |
| 2 | DeepSeek R1 | ~1450 | DeepSeek | Strong analytical and reasoning capabilities |
| 3 | Claude 3.7 Sonnet | Judge Model | Anthropic | Used as evaluation standard |
| - | Llama 3.2-1B | 200 | Meta | Baseline anchor model |
Key Performance Insights
- **Wide Performance Range**: 1300-point spread between top and baseline models
- **Correlation with General Intelligence**: Strong correlation (r=0.97) with comprehensive benchmarks like MMLU
- **Consistency**: Highly repeatable results across multiple evaluation runs
- **Discrimination Power**: Effectively differentiates between models with similar general capabilities
Implementation
Installation and Setup
```bash
- Clone the repository
git clone https://github.com/EQ-bench/eqbench3 cd eqbench3
- Install dependencies
pip install -r requirements.txt
- Configure API keys for judge model
export ANTHROPIC_API_KEY="your-key-here" ```
Running Evaluations
Single Iteration (Rubric Scoring)
```python python eqbench3.py --model "your-model" --rubric-only ```
Full Benchmark (With Elo Rating)
```python python eqbench3.py --model "your-model" --full-benchmark ```
Data Storage
- **Rubric Scores**: Stored in `eqbench3_runs.json`
- **Elo Results**: Recorded in `elo_results_eqbench3.json`
- **Leaderboard Data**: Synchronized with online leaderboard
Applications and Impact
Research Applications
| Application Area | Use Case | Impact |
|---|---|---|
| AI Safety | Evaluating social understanding for safe deployment | Risk assessment |
| Model Development | Benchmarking emotional capabilities during training | Performance optimization |
| Human-AI Interaction | Assessing readiness for sensitive conversations | Deployment decisions |
| Psychology Research | Studying machine understanding of human emotions | Scientific insights |
Practical Applications
- **Customer Service AI**: Evaluating empathy and problem-solving abilities
- **Mental Health Support**: Assessing appropriateness for supportive roles
- **Educational Assistants**: Measuring ability to understand student emotions
- **Social Companions**: Determining suitability for companionship applications
Limitations and Considerations
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| English Only | Currently limited to English-language scenarios | Reduced global applicability |
| Text-Only | No multimodal emotional cues (voice, visual) | Limited emotional signal |
| Judge Model Dependency | Relies on Claude Sonnet 3.7 for evaluation | Potential evaluation bias |
| Scenario Scope | 45 scenarios may not cover all social contexts | Coverage gaps |
| Cultural Bias | Western-centric scenario design | May not reflect global norms |
Future Directions
1. **Multilingual Extension**: Adaptation to multiple languages and cultures 2. **Multimodal Integration**: Incorporation of voice and visual emotional cues 3. **Dynamic Scenario Generation**: Procedurally generated test cases 4. **Human Baseline**: Establishing human performance benchmarks 5. **Cross-Cultural Validation**: Scenarios reflecting diverse cultural contexts
Related Benchmarks
- EQ-Bench: Original version focusing on emotion prediction
- EQ-Bench 2: Intermediate iteration with expanded scenarios
- Creative Writing v3: Related benchmark for creative text generation
- Longform Creative Writing: Extended creative writing assessment
- Theory of Mind Benchmark: Focused theory of mind evaluation
- SocialIQA: Social intelligence question answering
- EmpatheticDialogues: Empathetic conversation dataset
Significance
EQ-Bench 3 represents a significant advancement in evaluating AI systems' emotional intelligence capabilities. Its strong correlation with general intelligence benchmarks suggests that emotional understanding may be a fundamental aspect of artificial general intelligence. The benchmark's ability to discriminate between models with similar technical capabilities but different social understanding makes it valuable for:
- Identifying models suitable for human-facing applications
- Guiding development of more emotionally aware AI systems
- Understanding the relationship between cognitive and emotional intelligence in AI
- Establishing standards for AI deployment in sensitive contexts
See Also
References
Cite error: <ref> tag with name "paech2023" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench3_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "paech2024" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_about" defined in <references> has group attribute "" which does not appear in prior text.