Creative Writing v3

AI Benchmarks

12 min read

Updated May 10, 2026

Suggest edit History Talk

RawGraph

Last edited

May 10, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 2,419 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Creative Writing v3
Overview
Full name	EQ-Bench Creative Writing Benchmark Version 3
Abbreviation	CW v3
Description	An LLM-judged creative writing benchmark using a hybrid rubric and Elo scoring system for enhanced discrimination between top models
Release date	2025
Latest version	3.0
Authors	Samuel J. Paech
Organization	EQ-Bench (independent research)
Technical Details
Type	Creative Writing, Text Generation
Modality	Text
Task format	Generative writing prompts with rubric scoring and pairwise Elo
Number of tasks	32 prompts (96 iterations total)
Total examples	96
Evaluation metric	Normalized Elo, rubric score, repetition, slop score
Domains	Fiction writing, humor, romance, spatial awareness, unusual first-person perspectives
Languages	English
Performance
Human performance	Not reported
Baseline	DeepSeek R1 anchored to Elo 1500, ministral-3b anchored to 200
SOTA score	~1721 (normalized Elo, Grok 4.1 Thinking)
SOTA model	Grok 4.1 Thinking
SOTA date	2025-2026
Saturated	No
Resources
Website	Official leaderboard
GitHub	Repository
License	Open source
Predecessor	Creative Writing v2

Creative Writing v3 is an artificial intelligence benchmark that evaluates creative writing in large language models (LLMs) using a hybrid framework combining isolated rubric scoring with pairwise Elo comparisons. Released in 2025 by Samuel J. Paech under the EQ-Bench project, the benchmark uses a strong judge model (currently Claude Sonnet 4) to score outputs across thirty-two prompts that target known weak spots of language models: humor, romance, spatial reasoning, and unusual first-person perspectives.^[1]^[2]

The benchmark was built to fix the saturation problem of Creative Writing v2, where the judge could no longer separate top models. Every aspect of v3 is tuned to make discrimination easier, from harder prompts to head-to-head matchups using a Glicko-2 rating system. The leaderboard at eqbench.com is one of the most-cited public references for ranking creative writing ability in modern LLMs.^[1]^[3]

overview

Creative Writing v3 splits assessment into two parts. The judge model first scores each generated piece against a rubric covering coherence, originality, voice, and craft. The judge then compares pairs of outputs from different models in head-to-head matchups, producing an Elo rating that reflects relative quality. The two layers combine into a final ranking the leaderboard normalizes for stability across runs.^[1]^[2]

Paech chose prompts that push models into known weak spots. Humor is a long-standing weakness because most models default to puns rather than genuine comedy. Romance prompts test whether a model can produce emotional depth without sliding into cliche. Spatial awareness prompts expose how often LLMs lose track of who is standing where. First-person prompts request voices that are not the typical helpful narrator, like an unreliable witness or a non-human point of view.^[1]

motivation

The development of v3 was driven by several connected problems with earlier evaluations:

The need for better discrimination between high-performing models, since v2 had saturated at the top.
Limitations of pure rubric scoring once strong models cluster near the ceiling.
Known judge biases such as length, position, and stylistic preference.
The goal of exposing specific weaknesses rather than rewarding generic competence.

The leaderboard is intentionally pessimistic: a model that writes adequately on every prompt will land in the middle, not at the top.^[1]^[2]

technical architecture

core components

Component	Description
Prompt dataset	32 prompts in `creative_writing_prompts_v3.json`
Generation system	Temperature 0.7, min_p 0.1
Judge model	Claude Sonnet 4, recommended for leaderboard parity
Scoring framework	Hybrid rubric plus Elo using Glicko-2
Anchor models	DeepSeek R1 (1500), ministral-3b (200)

evaluation methodology

A full evaluation runs in four stages. The model under test generates three completions for each of the 32 prompts, producing 96 outputs. The judge scores every output against a rubric. Sparse pairwise matchups are then run between the new model and a small set of leaderboard neighbors, giving an initial Elo estimate. Finally, broader pairwise comparisons are performed and the Glicko-2 update is applied, with the resulting Elo score normalized so the anchor models keep their reference values.^[1]^[2]

The normalized Elo score (elo_norm) is the primary leaderboard metric. The rubric score is shown alongside it and is more directly interpretable, but less discriminative at the top end.^[1]^[3]

key metrics

Metric	Description
Rubric score	Aggregate across rubric criteria
Elo score (normalized)	Relative ranking from pairwise comparisons
Repetition	Frequency of repeated top words, bigrams, and trigrams
Slop score	Match against curated list of overused LLM phrases
Length	Average output length in characters

test structure

prompt categories

Category	Example challenge
Humor	Writing genuinely funny content rather than puns
Romance	Authentic emotional connection without cliche
Spatial awareness	Accurate spatial reasoning across a scene
Unique perspectives	Non-standard or non-human narrator voices
Character development	Multi-dimensional personalities under pressure
Plot construction	Coherent story progression in short word counts

The full prompt list is published in the GitHub repository.^[2]

generation parameters

The sampling configuration is fixed for leaderboard parity: 3 generations per prompt (96 outputs total), temperature 0.7, min_p 0.1, and output truncation to 4000 characters before scoring. Using the same sampling settings across models matters because creative output is unusually sensitive to temperature and decoding strategy.^[2]

evaluation criteria

rubric dimensions

Category	Criteria examples
Coherence	Logical flow, internal consistency, clarity
Creativity	Originality, unexpected elements, imagination
Style	Voice, tone, prose quality
Technical	Grammar, punctuation, structure
Engagement	Hook, pacing, reader interest
Character	Depth, believability, development
Dialogue	Natural speech, distinct voices
Description	Vivid imagery, sensory details

pairwise judging

In the pairwise stage the judge compares two outputs and picks a winner. The comparison prompt directs the judge to weigh character authenticity, originality, writing quality, plot coherence, instruction adherence, worldbuilding, cliche avoidance, verbosity control, and metaphor appropriateness. Each matchup feeds into the Glicko-2 update, with margin of victory factored into the rating change.^[3]

bias mitigation

controlled biases

Bias type	Mitigation
Length bias	Output truncation to 4000 characters
Position bias	A vs B and B vs A averaged
Verbosity bias	Judge prompted against padding
Forced metaphor	Rubric criteria penalize incoherent imagery
Anonymous comparison	Models unidentified during pairwise judging

uncontrolled biases

The documentation acknowledges remaining biases: judge self-bias (preferring prose in a similar style), positivity or negativity preference, NSFW content aversion ("smut bias"), stylistic preferences inherited from the judge's training, and slop bias rewarding familiar tropes. Paech has run cross-validation experiments using GPT-4.1 as an alternative judge; discrepancies are small but real, which is why the leaderboard recommends running with the same judge for parity.^[3]^[5]

version 3 improvements

key enhancements from v2

Improvement	Impact
Judge upgrade to Claude Sonnet 4	Better discrimination at the top
Metaphor detection in rubric	Catches forced or incoherent imagery
Paragraph scoring scaled for one-sentence paragraphs	Style normalization
Elo integration on top of rubric	Sharpens top-tier differences
Glicko-2 ratings with uncertainty	Robust rankings as new models join
Anchored Elo normalization	Scores comparable across leaderboard updates

slop detection

Creative Writing v3 includes a slop detection layer. Outputs are checked against a master list of phrases that appear unnaturally often in LLM-generated text, maintained in the slop-forensics toolkit derived from analysis of outputs from ten language models against human baselines. The scoring formula weights three components: roughly 60 percent slop words, 25 percent "not X but Y" patterns, and 15 percent slop trigrams. A high slop score does not directly lower a model's Elo, but is published alongside the leaderboard so readers can see which models lean on cliche even when they are technically competent.^[4]

performance analysis

top models on the leaderboard

As of late 2025 into 2026, the top of the leaderboard is dominated by Grok variants from xAI and large open-weight models from the Qwen 3 family at Alibaba Cloud. Rankings shift as new models are added, so the table below is a snapshot.^[1]^[3]

Rank	Model	Normalized Elo	Notes
1	Grok 4.1 Thinking	1721.9	Strong on humor and unusual perspectives
2	Grok 4.1	1708.6	Non-thinking variant, still in the top tier
anchor	DeepSeek R1	1500	Anchor model used to fix the scale
anchor	ministral-3b	200	Lower anchor for the Elo scale

The anchors are fixed pegs that hold the scale steady; when new models are evaluated, their Elo scores are normalized against these anchors so 1500 always means roughly the same thing. The llm-stats mirror surfaces additional Qwen 3 entries (Qwen3-235B-A22B-Instruct-2507, Qwen3-VL-235B-A22B variants, the Qwen3-Next-80B-A3B family) that score as strong open-weight competitors without quite reaching the Grok 4.1 line.^[2]^[3]

performance insights

Wide spread between top and bottom of the leaderboard, which was the goal of v3.
Distinct writing personalities; models with similar rubric scores still feel different in pairwise comparisons.
Consistent struggles with humor and spatial reasoning, even at the top.
Reasoning-augmented Thinking variants tend to outperform their non-thinking siblings on creative tasks, which is mildly surprising given that creative writing is not a math problem.

The DeepSeek family was the standout in early 2025 runs and remains the reference point; DeepSeek R1 is the chosen anchor at 1500 because of its consistent performance across both rubric and Elo stages.^[1]^[2]

implementation

The benchmark is run from the creative-writing-bench repository. After cloning, install dependencies with pip install -r requirements.txt and download the required NLTK data (punkt, cmudict). Dependencies include requests, python-dotenv, numpy, scipy, tqdm, glicko2, nltk, and joblib. API keys for the test model and the judge are configured in a .env file.^[2]

A standard evaluation run uses Claude Sonnet 4 as the judge for parity with the public leaderboard:

python3 creative_writing_bench.py \
    --test-model "provider/model-name" \
    --judge-model "anthropic/claude-sonnet-4" \
    --runs-file "creative_bench_runs.json" \
    --iterations 3 \
    --threads 500

The --runs-file argument controls where intermediate results and matchups are stored; the file shipped in the repository should be used to compare against the public leaderboard, because it accumulates pairwise results reused across model evaluations. A full evaluation typically costs around ten US dollars in API spend per model.^[2]

applications

Application	Use case
Model development	Tracking creative ability across training runs
Architecture comparison	Evaluating design choices across model families
Prompt engineering	Probing how prompt phrasing changes creative output
Bias studies	Surfacing AI writing patterns and slop tendencies
Judge meta-evaluation	Cross-validating judge models, feeding Judgemark

The leaderboard is used to assess model suitability for fiction or copywriting, vet AI writing assistants, test story generation in games and interactive fiction, and screen AI co-writing tools for slop. Writers and hobbyists use it to pick models for creative AI projects, since scores correlate well with subjective impressions.^[1]

challenges and failure modes

Common failure modes the rubric and slop detector are designed to catch include formulaic structure, cliche overuse ("shivers down the spine," "breath she didn't know she was holding"), emotional shallowness, forced creativity that substitutes odd word choices for real originality, and inconsistent tone that drifts mid-piece. Even top models struggle with genuine humor, emotional depth in romance scenes, spatial consistency across a scene, original narrator voice, and sustained complex metaphors.^[1]^[4]

limitations and future directions

Limitation	Description
Subjective nature	Creative quality is inherently subjective
Judge dependency	Relies on a single judge in the canonical run
English only	Prompts and judging are in English
Genre constraints	Limited coverage of poetry and screenwriting
Length limits	4000 character truncation may penalize slow builders
Cost	Around ten US dollars per full run

Future directions discussed by Paech include multi-judge systems with several models voting on each comparison, human baselines from paid writers, genre expansion (already partially handled by Longform Creative Writing), multilingual support, and tests of multi-turn collaboration between writer and model.^[1]^[2]

Creative Writing v3 sits inside a wider EQ-Bench family:

EQ-Bench 3: emotional intelligence in role-play scenarios.
Longform Creative Writing: extended narrative generation.
Spiral-Bench: a related benchmark by Paech.
BuzzBench: humor using British comedy transcripts.
DiploBench: strategic writing in Diplomacy.
Judgemark: meta-evaluation of LLM judges.
WritingBench: comprehensive writing evaluation.
MAGI-Hard: discriminative subset of MMLU and AGIEval.

significance

Creative Writing v3 has become one of the standard public references for ranking creative writing ability in LLMs, alongside Chatbot Arena. Its hybrid scoring is more discriminative than pure rubric or pure preference voting, and its slop and repetition metrics give a useful diagnostic layer beyond a single number. For a benchmark one person built and maintains, it has had outsized influence on how AI labs talk about writing ability.^[1]^[3]

references

Paech, Samuel J. "EQ-Bench Creative Writing v3 Leaderboard." EQ-Bench, 2025. https://eqbench.com/creative_writing.html ↩
Paech, Samuel J. "Creative Writing Bench v3 (GitHub Repository)." EQ-Bench organization on GitHub, 2025. https://github.com/EQ-bench/creative-writing-bench ↩
"Creative Writing v3 Leaderboard." llm-stats.com benchmarks index, 2025-2026. https://llm-stats.com/benchmarks/creative-writing-v3 ↩
Paech, Samuel J. "Slop Score." EQ-Bench, 2025. https://eqbench.com/slop-score.html ↩
Paech, Samuel J. "About EQ-Bench." EQ-Bench, 2025. https://eqbench.com/about.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

EQ-Bench 3