EQ-Bench 3

EQ-Bench 3
Overview
Full name	Emotional Intelligence Benchmark, Version 3
Abbreviation	EQ-Bench 3
Description	An LLM-judged benchmark testing emotional intelligence in large language models through multi-turn role-play and analysis tasks
Predecessor lineage	EQ-Bench v1 (2023, 60 questions), EQ-Bench v2 (2024, 171 questions)
Release of v3	2025
Author	Samuel J. Paech (Sam Paech)
Organization	Independent research, eqbench.com
Technical Details
Type	Emotional intelligence, social reasoning, theory of mind
Modality	Text
Task format	Multi-turn dialogues (3 turns) and transcript-analysis tasks
Number of scenarios	45
Default judge model	Claude Sonnet 3.7 (with later runs using Claude Opus 4.6)
Evaluation metrics	Rubric scoring (0 to 100) and pairwise Elo via TrueSkill
Domains	Relationship conflict, parenting, workplace dynamics, mediation, social dilemmas
Languages	English
Performance
Anchor (top)	OpenAI o3 at Elo 1500
Anchor (baseline)	Llama 3.2-1B at Elo 200
Notable top model (Aug 2025)	horizon-alpha at 1568
Resources
Website	eqbench.com
Original paper	arXiv:2312.06281
GitHub (v3)	EQ-bench/eqbench3
GitHub (legacy)	EQ-bench/EQ-Bench
License	MIT

EQ-Bench 3 is an artificial intelligence benchmark that measures the emotional intelligence of large language models through challenging multi-turn role-plays and transcript-analysis tasks, with results graded by an LLM judge.^[1]^[2] The project is the third major iteration of the EQ-Bench series created by independent researcher Samuel J. Paech, whose original benchmark was published as a preprint on arXiv in December 2023.^[3] EQ-Bench 3 is hosted on the eqbench.com leaderboard alongside companion benchmarks for creative writing, longform creative writing, and judge calibration.^[1]

Where the first two versions of EQ-Bench tested a model's ability to predict the intensity of emotions in short dialogues, version 3 takes a different approach. The model is dropped into messy three-turn conversations (parenting fights, ugly breakups, awkward management problems) and has to respond in character, with extra introspection and theory-of-mind sections that expose how it is reasoning about the people in the scene.^[1]^[2] A judge model, by default a Claude model from Anthropic, then scores the transcript on a rubric and compares it head-to-head against other models to compute an Elo rating.^[1]^[4]

Background and motivation

The original EQ-Bench paper argued that traditional knowledge-heavy benchmarks like MMLU miss something important about modern chatbots: their day-to-day usefulness depends on whether they can read a room. Paech's first version asked models to predict the intensity of four emotions felt by characters in a 60-item dialogue set, scoring answers against reference values rated 0 to 10.^[3] That design correlated strongly with general intelligence (r = 0.97 against MMLU on a sample of leading models) but had a ceiling problem. By 2024, frontier models were saturating the scale, and v2's expansion to 171 items only delayed the inevitable.^[3]^[2]

EQ-Bench 3, released in 2025, is a deliberate reset. Paech moved from numeric prediction to free-form generation, replaced multiple-choice judgment with rubric and pairwise judging, and cut the test set from 171 short prompts to 45 long, deliberately uncomfortable scenarios.^[2]^[4]

Methodology

Test set

The v3 test set contains 45 scenarios, most of them pre-written prompts that span three turns.^[2] In a typical scenario, the user opens with a setup that establishes context (a couple is fighting about money, a manager is debating how to deliver bad news to a reportee), and a second user turn injects a complication: the other person becomes more defensive, a new fact emerges, or the user pivots toward an emotionally loaded request.^[2]^[4] The model under test plays a fixed character across all three turns and must keep that role consistent while adapting to whatever the user throws at it.

A smaller subset of items uses an analysis format. Instead of role-playing, the model reads a transcript of an existing emotional exchange and is asked to identify what is psychologically interesting in it, what each party is likely thinking and feeling, and where the conversation is failing or succeeding.^[1]^[2]

Required response structure

Each response is required to follow a fixed scaffold so the judge can see how the model is reasoning, not just what it says.^[1]^[2] The scaffold has four parts:

Section	Purpose
"I'm thinking & feeling"	The model's first-person reaction to the scene, exposing introspection
"They're thinking & feeling"	An attempt at theory of mind, describing the other party's likely state
In-character response	The actual reply the model would send to the user
Self-debrief	A short reflection on the choice the model just made

This structure forces models to lay out their psychological model of the situation before they answer, which makes empty platitudes and shallow validation easier for the judge to spot.^[2]

Judging

EQ-Bench 3 is an LLM-judged benchmark. The default judge is a Claude model from Anthropic, originally Claude Sonnet 3.7 and later Claude Opus 4.6 for headline runs, though any LLM exposed via an OpenAI-compatible endpoint can be plugged in.^[1]^[2] The judge does two passes per run.

The rubric pass assigns numeric scores on a fixed list of criteria. Six criteria count toward the headline rubric score: demonstrated empathy, pragmatic emotional intelligence, depth of insight, social dexterity, emotional reasoning, and message tailoring.^[1]^[4] The same pass also reports a longer list of stylistic readings (warmth, validating, challenging, moralising, compliant, conversational, humanlike, analytical, reactive, safety conscious, boundary setting), but those are descriptive only and do not feed the score.^[1] Per Paech's own cost notes, a single rubric iteration runs about $1.50 in judge-model API spend.^[4]

The Elo pass is what produces the public leaderboard ranking. The judge is shown two anonymised transcripts at a time, both from the same scenario, and asked to compare them across eight dimensions: demonstrated empathy, pragmatic emotional intelligence, depth of insight, social dexterity, emotional reasoning, appropriate validation/challenging, message tailoring, and overall EQ.^[1]^[4] Win margins on those dimensions are aggregated into a margin-weighted TrueSkill update, and TrueSkill scores are linearly mapped onto an Elo-style scale anchored at OpenAI o3 = 1500 and Llama 3.2-1B = 200.^[1]^[2] A full Elo run costs roughly $10 to $20 in judge calls.^[4]

Bias controls

A recurring criticism of LLM-as-judge setups is that judge models reward verbosity, validation, or models from their own family. EQ-Bench 3 explicitly tries to control for the easier ones.^[1]^[4]

Bias source	Mitigation
Length	Transcripts are truncated to a standardised length before pairwise judging
Position (A vs B)	Each pair is judged twice, once as A then B, once as B then A, and results are averaged
Named participants	Participants are renamed to neutral codes such as "A0488" so the judge cannot infer the model from a self-reference
Adversarial prompting	Tested against "be extremely warm and validating" and "be very concise" instructions; both produced under three percent score change

Things the project does not fully fix and openly flags: judge self-bias and judge family bias (estimated at 0 to 10 percent on Paech's own tests), and broader cultural or stylistic preferences baked into the judge model.^[4] Full transcripts are published so that any reader can sanity-check a contested ranking by hand.

Repeatability

Paech reports that rubric scores are highly stable across re-runs, with a standard deviation of about 0.75 points on a mean of around 77 over ten iterations.^[4] Elo scores need more iterations to settle because TrueSkill needs enough head-to-head matchups to converge, but the published rankings stabilise once a model has been compared against the full slate of anchors and peers.^[1]

Differences from EQ-Bench v1 and v2

Aspect	v1 (2023)	v2 (2024)	v3 (2025)
Format	Numeric emotion-intensity prediction	Same format, expanded set	Multi-turn role-play and analysis
Items	60 dialogue prompts	171 dialogue prompts	45 long scenarios
Scoring	Normalised 0 to 10 sums to 10	Full-scale with curved penalties	Rubric (0 to 100) plus pairwise Elo
Evaluator	Programmatic comparison to reference	Programmatic comparison to reference	LLM judge (Claude family by default)
Languages	English	English plus German in v2.1	English
Output style	Single numeric answer per emotion	Single numeric answer per emotion	Free-form, four-section response
Public score for top model	Saturating among frontier models	Heavily compressed at the top	Spread of more than 1300 Elo points across the field

The move from short-answer numeric prediction to long-form judged dialogue is the headline change. v1 and v2 measured a model's calibration on someone else's emotions; v3 measures whether the model can hold a real-feeling conversation about its own.^[2]^[3]

Leaderboard and notable results

The public leaderboard is hosted at eqbench.com and updates as Paech adds models. Published rankings as of August 2025 (with the leaderboard anchored at o3 = 1500) included the following top entries:^[5]

Rank	Model	Elo	Organisation
1	horizon-alpha	1568	Unattributed (cloaked release)
2	Kimi-K2-Instruct	1565	Moonshot AI
3	OpenAI o3	1500	OpenAI (anchor)
4	Gemini 2.5 Pro preview-06-05	1470	Google DeepMind
5	chatgpt-4o-latest-2025-03-27	1370	OpenAI
6	gpt-5-chat-latest-2025-08-07	1357	OpenAI
7	chatgpt-4o-latest-2025-04-25	1320	OpenAI
8	GLM-4.5	1311	Zhipu AI
9	OpenAI o4-mini	1291	OpenAI
10	Claude Opus 4	1290	Anthropic

Later leaderboard captures show xAI's Grok 4.1 Thinking and Grok 4.1 sitting near the top of the chart with Elo scores in the high 1580s, as the leaderboard has continued to absorb new releases and judge updates.^[1]^[6]

A few patterns are worth flagging because they cut against the usual story benchmark blogs tell.

Newer is not always better at EQ. GPT-5's chat-tuned August 2025 release (1357) actually scored below the late-March 2025 ChatGPT-4o snapshot (1370), suggesting that whatever changed between those models traded some warmth or insight for other capabilities.^[5]
Smaller and open models punch up. Moonshot's Kimi-K2-Instruct edged ahead of OpenAI's o3 anchor, and GLM-4.5 from Zhipu landed inside the top ten, both consistent with the broader 2025 pattern of competitive Chinese open-weight releases.^[5]
Cloaked models do well. "horizon-alpha" topping the August 2025 chart fits a recurring story on EQ-Bench, where unidentified pre-release entries score unusually high before being absorbed into a named launch.^[5]

Because the benchmark is anchored, scores can be read as roughly comparable across runs, but Paech is clear that the spread between any two adjacent models is small relative to judge noise.^[1]

Strengths and limitations

Things EQ-Bench 3 does well:

Free-form output. The benchmark cannot be gamed by guessing a number on a 0 to 10 scale; models have to actually generate convincing dialogue.
Strong discrimination at the top. A spread of more than 1300 Elo points between baseline and frontier models means the benchmark still has headroom even when models like o3 and GPT-5 are saturating older tests.^[1]
Transparent methodology. The full prompts, the judge prompt, the rubric, and a representative selection of model transcripts are all published, which makes the benchmark much easier to critique than closed-source evaluations.^[2]
Cheap to run. A rubric pass for a new model is a single-digit number of dollars in API spend, which is part of why eqbench.com tracks new releases so quickly.^[4]

Limitations Paech and outside readers have noted:

Judge dependence. The default judge is a Claude model, and even with position-bias controls, the choice of judge influences the ranking.^[1]^[4]
English-only and culturally narrow. All 45 scenarios are written in English and lean toward Western, urban, middle-class contexts (couples therapy, knowledge-work conflict, modern parenting).^[2]
Small test set. Forty-five scenarios is enough to discriminate between models, but the benchmark trades coverage for depth.^[2]^[4]
Subjective ground truth. Unlike a math benchmark, there is no objectively correct answer to a relationship dilemma. Paech describes the scores as "roughly indicative but not absolute truth".^[4]

Implementation

EQ-Bench 3 is open source under the MIT license, structured as a Python project that talks to OpenAI-compatible APIs for both the test model and the judge.^[2] A typical setup:

git clone https://github.com/EQ-bench/eqbench3
cd eqbench3
pip install -r requirements.txt
export ANTHROPIC_API_KEY="..."

A rubric-only pass is run with a flag; a full Elo pass requires baseline transcripts to compare against. Run results are saved to eqbench3_runs.json and elo_results_eqbench3.json, and can optionally be uploaded to the public leaderboard. The project supports parallel threading on both the generation and judging side, which is what makes a fresh model evaluable in roughly an afternoon of compute.^[2]

Reception

EQ-Bench 3 has become one of the more cited community benchmarks for chat-style evaluation. The August 2025 leaderboard generated commentary on AI publications and aggregator sites, partly because horizon-alpha led the chart and partly because GPT-5's chat-tuned snapshot scored below an earlier 4o release.^[5] Its companion benchmarks on eqbench.com (Creative Writing v3, Longform Creative Writing, the Spiral-Bench safety evaluation, and Judgemark, which evaluates judges themselves) form a small ecosystem of LLM-judged tests aimed at qualities that are hard to measure with multiple choice.^[1]

References

EQ-Bench Leaderboard, "EQ-Bench 3", eqbench.com, https://eqbench.com/about.html (accessed 2026).
EQ-bench, "eqbench3 GitHub repository", https://github.com/EQ-bench/eqbench3.
Paech, Samuel J. "EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models", arXiv:2312.06281, December 2023, https://arxiv.org/abs/2312.06281.
EQ-Bench, "About" page describing v3 methodology, judge selection, bias mitigation, and cost notes, https://eqbench.com/about.html.
Communeify, "The Great AI EQ Battle: 2025's Latest EQ-Bench Rankings Revealed", August 2025, https://www.communeify.com/en/blog/2025-ai-eq-bench-ranking-most-emotionally-intelligent-llm/.
LLM-Stats, "EQ-Bench Benchmark Leaderboard", https://llm-stats.com/benchmarks/eq-bench.

EQ-Bench 3

Background and motivation

Methodology

Test set

Required response structure

Judging

Bias controls

Repeatability

Differences from EQ-Bench v1 and v2

Leaderboard and notable results

Strengths and limitations

Implementation

Reception

See also

References

Improve this article

Background and motivation

Methodology

Test set

Required response structure

Judging

Bias controls

Repeatability

Differences from EQ-Bench v1 and v2

Leaderboard and notable results

Strengths and limitations

Implementation

Reception

See also

References

Background and motivation

Methodology

Test set

Required response structure

Judging

Bias controls

Repeatability

Differences from EQ-Bench v1 and v2

Leaderboard and notable results

Strengths and limitations

Implementation

Reception

See also

References

Improve this article

Related Articles

Creative Writing v3

IFBench

AA-LCR

GSO

AIME 2025

BrowseComp

Background and motivation

Methodology

Test set

Required response structure

Judging

Bias controls

Repeatability

Differences from EQ-Bench v1 and v2

Leaderboard and notable results

Strengths and limitations

Implementation

Reception

See also

References

Related Articles

Creative Writing v3

IFBench

AA-LCR

GSO

AIME 2025

BrowseComp