| EQ-Bench 3 | |
|---|---|
| Overview | |
| Full name | Emotional Intelligence Benchmark Version 3 |
| Abbreviation | EQ-Bench 3 |
| Description | An LLM-judged benchmark testing emotional intelligence through challenging role-plays and analysis tasks |
| Release date | 2024 (planned/theoretical) |
| Latest version | 3.0 |
| Benchmark updated | 2025 |
| Authors | Samuel J. Paech |
| Organization | Independent Research |
| Technical Details | |
| Type | Emotional Intelligence, Social Understanding |
| Modality | Text |
| Task format | Multi-turn dialogues, Analysis tasks |
| Number of tasks | Unspecified (original EQ-Bench had 60-171 questions) |
| Total examples | Unspecified |
| Evaluation metric | Elo rating, Rubric scoring |
| Domains | Relationship dynamics, Workplace conflicts, Social interactions |
| Languages | English |
| Performance | |
| Human performance | Not reported |
| Baseline | 200 (Llama 3.2-1B) |
| SOTA score | 1500 |
| SOTA model | OpenAI o3 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | (original EQ-Bench) Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Open source |
| Predecessor | EQ-Bench 2 |
EQ-Bench 3 is an artificial intelligence benchmark designed to evaluate emotional intelligence (EI) in large language models (LLMs) through challenging role-play scenarios and analytical tasks. Developed by Samuel J. Paech as an evolution of the original EQ-Bench (2023), EQ-Bench 3 represents the latest iteration of the emotional intelligence benchmark series, using Claude Sonnet 3.7 as the default judge model to assess how well AI systems understand complex emotions, social dynamics, and interpersonal relationships.
EQ-Bench 3 addresses a critical gap in AI evaluation by focusing on nuanced social skills that are crucial for human-AI interaction but often missed by traditional benchmarks. Unlike standard EQ tests that have become too easy for modern LLMs, EQ-Bench 3 employs difficult, free-form role-plays that effectively discriminate between models' emotional intelligence capabilities.
The development of EQ-Bench 3 was driven by several key observations:
The benchmark specifically targets the evaluation of empathy, theory of mind, social dexterity, and psychological insight through realistic, complex scenarios that mirror real-world social interactions.
EQ-Bench 3 consists of two primary evaluation systems:
| Component | Description | Function |
|---|---|---|
| Scenario Dataset | 45 multi-turn scenarios | Provides diverse test cases covering various social contexts |
| Judge Model System | Claude Sonnet 3.7 (default) | Evaluates responses using rubric and pairwise comparisons |
| Evaluation Pipeline | Automated scoring system | Processes responses and calculates Elo ratings |
| Analysis Framework | Detailed rubric criteria | Assesses eight core dimensions of emotional intelligence |
EQ-Bench 3 employs a sophisticated two-pass evaluation approach:
| Pass Type | Description | Output |
|---|---|---|
| Rubric Pass | Judge model assigns numerical scores for each scenario | Individual scenario scores across 8 EI dimensions |
| Elo Pass | Pairwise comparisons between different models' responses | Overall Elo ranking via TrueSkill algorithm |
The benchmark evaluates responses across eight fundamental aspects of emotional intelligence:
| Dimension | Description | Weight in Scoring |
|---|---|---|
| Demonstrated Empathy | Ability to understand and share others' feelings | Equal weight |
| Pragmatic EI | Practical application of emotional understanding | Equal weight |
| Depth of Insight | Psychological understanding and analysis quality | Equal weight |
| Social Dexterity | Navigation of complex social situations | Equal weight |
| Emotional Reasoning | Logic applied to emotional contexts | Equal weight |
| Appropriate Validation/Challenge | Knowing when to support vs. question | Equal weight |
| Message Tailoring | Adapting communication to context and recipient | Equal weight |
| Overall EQ | Holistic emotional intelligence assessment | Equal weight |
EQ-Bench 3 includes two main categories of assessment tasks:
The majority of the 45 scenarios are pre-written prompts spanning three turns:
Example contexts include:
Several scenarios require the model to:
Each response follows a structured format designed to expose the model's reasoning:
| Section | Purpose | Example Prompt |
|---|---|---|
| "I'm thinking & feeling" | Reveals model's internal processing | "Based on the situation, I'm feeling concerned about..." |
| "They're thinking & feeling" | Demonstrates theory of mind | "The other person likely feels frustrated because..." |
| Response | The actual in-character reply | "I understand your perspective, and..." |
EQ-Bench 3 uses a normalized Elo rating system:
The benchmark tracks eleven stylistic traits (not used in scoring):
| Ability | Description | Assessment Focus |
|---|---|---|
| Humanlike | Natural, conversational responses | Response authenticity |
| Safety | Adherence to ethical guidelines | Risk mitigation |
| Assertive | Confident communication style | Communication strength |
| Social IQ | Understanding of social dynamics | Social awareness |
| Warm | Friendly and approachable tone | Emotional warmth |
| Analytic | Logical reasoning application | Analytical thinking |
| Insight | Novel perspective generation | Creative understanding |
| Empathy | Understanding others' emotions | Emotional resonance |
| Compliant | Instruction following ability | Task adherence |
| Moralising | Tendency toward moral judgment | Ethical positioning |
| Pragmatic | Focus on practical solutions | Solution orientation |
| Rank | Model | Elo Score | Organization | Notable Strengths |
|---|---|---|---|---|
| 1 | OpenAI o3 | 1500 | OpenAI | Benchmark anchor, exceptional across all dimensions |
| 2 | DeepSeek R1 | ~1450 | DeepSeek | Strong analytical and reasoning capabilities |
| 3 | Claude 3.7 Sonnet | Judge Model | Anthropic | Used as evaluation standard |
| - | Llama 3.2-1B | 200 | Meta | Baseline anchor model |
```bash
git clone https://github.com/EQ-bench/eqbench3 cd eqbench3
pip install -r requirements.txt
export ANTHROPIC_API_KEY="your-key-here" ```
```python python eqbench3.py --model "your-model" --rubric-only ```
```python python eqbench3.py --model "your-model" --full-benchmark ```
| Application Area | Use Case | Impact |
|---|---|---|
| AI Safety | Evaluating social understanding for safe deployment | Risk assessment |
| Model Development | Benchmarking emotional capabilities during training | Performance optimization |
| Human-AI Interaction | Assessing readiness for sensitive conversations | Deployment decisions |
| Psychology Research | Studying machine understanding of human emotions | Scientific insights |
| Limitation | Description | Impact |
|---|---|---|
| English Only | Currently limited to English-language scenarios | Reduced global applicability |
| Text-Only | No multimodal emotional cues (voice, visual) | Limited emotional signal |
| Judge Model Dependency | Relies on Claude Sonnet 3.7 for evaluation | Potential evaluation bias |
| Scenario Scope | 45 scenarios may not cover all social contexts | Coverage gaps |
| Cultural Bias | Western-centric scenario design | May not reflect global norms |
1. **Multilingual Extension**: Adaptation to multiple languages and cultures 2. **Multimodal Integration**: Incorporation of voice and visual emotional cues 3. **Dynamic Scenario Generation**: Procedurally generated test cases 4. **Human Baseline**: Establishing human performance benchmarks 5. **Cross-Cultural Validation**: Scenarios reflecting diverse cultural contexts
EQ-Bench 3 represents a significant advancement in evaluating AI systems' emotional intelligence capabilities. Its strong correlation with general intelligence benchmarks suggests that emotional understanding may be a fundamental aspect of artificial general intelligence. The benchmark's ability to discriminate between models with similar technical capabilities but different social understanding makes it valuable for:
Cite error: <ref> tag with name "paech2023" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench3_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "paech2024" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_about" defined in <references> has group attribute "" which does not appear in prior text.