Longform Creative Writing

Longform Creative Writing is a capability area and benchmark family for large language models that measures how well a system can write coherent, engaging fiction or narrative non-fiction over thousands of words rather than a single short prompt. The most cited public artifact under this name is the Longform Creative Writing benchmark on EQ-Bench, authored by Samuel J. Paech, in which a model brainstorms a story, plans an eight-chapter outline, and then writes roughly 8,000 words of continuous fiction that an LLM judge scores chapter by chapter. The broader research field around longform creative writing also includes BooookScore, NovelQA, LongStory, ConStory-Bench, LongGenBench, HelloBench, LongEval, WritingBench and Lech Mazur's LLM Creative Story-Writing Benchmark, each probing a different facet of sustained narrative quality.

Longform Creative Writing
Overview
Full name	Longform Creative Writing Benchmark
Abbreviation	LCW
Description	An LLM-judged benchmark evaluating extended narrative generation across eight chapters of roughly 1,000 words each
Release date	2024
Latest version	v1.11 (interface), generation rubric v3
Benchmark updated	2026-02-19
Authors	Samuel J. Paech
Organization	EQ-Bench
Technical Details
Type	Creative writing, extended narrative
Modality	Text
Task format	Multi-turn story generation with planning and reflection
Number of tasks	1 story per run, 8 chapters
Total examples	8 chapters per evaluation
Evaluation metric	0 to 100 rubric score, degradation, slop, repetition
Domains	Fiction writing, narrative consistency, character development
Languages	English
Performance
Human performance	Not formally reported
Baseline	Variable by model
SOTA score	Approximately 85 to 90 on the rubric
Judge model	Claude Sonnet 4.6 (default since 2026)
Sampling defaults	temperature 0.7, min_p 0.1
Saturated	No
Resources
Website	Official website
Code (long form)	longform-writing-bench
Code (short form)	creative-writing-bench
License	Open source
Predecessor	Longform Creative Writing v2

Definition and scope

For large language model evaluation, the phrase longform creative writing has come to mean any task where the model must produce an original narrative whose length materially stresses planning, memory and prose stamina. There is no canonical word threshold in the literature, but several practical conventions have emerged. The EQ-Bench Longform Creative Writing run targets approximately 8,000 words across eight chapters. HelloBench treats 4,000 words as a soft ceiling that most contemporary models fail to reach with intact quality. LongGenBench evaluates outputs of 16,000 and 32,000 tokens, and the ConStory-Bench paper from 2026 targets stories of 8,000 to 10,000 words. In all of these settings the central question is the same: does the model still tell a good story near the end of the output, given everything it has already committed to.

Longform creative writing is distinct from three adjacent capabilities. It differs from short creative writing, where outputs are short enough that planning and consistency rarely fail. It differs from long context comprehension, where the model must understand long input rather than produce long output, which is the regime explored by long context benchmarks such as NovelQA, Loong, LongBench and Ruler. And it differs from structured longform generation of articles or reports, which trades narrative arc for factual coverage and is the focus of WritingBench, LongGenBench and LongEval. A model can be excellent at one of these and weak at the others.

Why longform creative writing matters

The shift from short prompts to longform stories changes the failure surface of an LLM in ways that short benchmarks systematically miss. A model that writes one sparkling paragraph can still drown on chapter five because every additional token adds a new opportunity for inconsistency, for stylistic drift and for the model to repeat phrases it has already used. Three classes of stakeholder have driven the rise of this benchmark family.

Researchers care because longform creative writing exposes weaknesses in long context handling, planning and reward modeling that are invisible at the prompt-response level. Practitioners care because real fiction tools such as Sudowrite, NovelCrafter and the writing rooms inside Claude and ChatGPT deliver value mostly when they can sustain quality over chapters. And model developers care because the rubric scores correlate informally with subscriber sentiment among writers, a vocal segment of consumer LLM users.

Reason	What it captures	Why short benchmarks miss it
Quality degradation	Whether prose decays late in the output	Single-prompt tests never reach the decay zone
Character consistency	Voice, motivation and detail stability	One paragraph cannot contradict itself
Plot tension	Pacing, foreshadowing, payoff	Short prompts are answered, not paced
Memory fidelity	Recall of small details across chapters	No prior chapters to remember
Style stamina	Avoiding slop and verbal tics over thousands of words	Slop is rare in short answers

Core challenges in longform creative writing

Coherence across thousands of tokens

The most consistent failure mode is coherence collapse. A character introduced as left-handed in chapter one fires a pistol right-handed in chapter four. A village that the model called Greythorne becomes Greythorpe. The Microsoft Research paper Lost in Stories: Consistency Bugs in Long Story Generation by LLMs accepted to ACL 2026 catalogues these errors across five categories and 19 subtypes, finding that they cluster near the middle of narratives and inside passages with higher token-level entropy. The accompanying ConStory-Bench benchmark uses a three-stage pipeline of error extraction, contradiction pairing and evidence chain construction to score 2,000 generated stories with a metric called Consistency Error Density.

Character consistency

Character drift is the single most common failure mode visible to human readers. Models routinely change a character's age, occupation or speech register between chapters, particularly after long stretches in which the character was off-page. The EQ-Bench rubric explicitly tracks character consistency as a major component of the overall score and penalises any chapter in which a named character behaves out of line with their established profile.

Plot tension and pacing

Language models trained on internet text often default to a flat narrative voltage in which every paragraph carries roughly equal weight. Real fiction relies on swings of intensity, foreshadowing and deliberate gaps. The EQ-Bench Longform rubric uses dimensions such as compelling plot, emotional engagement and tonal consistency to penalise output that reads like a uniform briefing document rather than a story.

Prose quality and slop

Slop is the EQ-Bench shorthand for distinctive phrasing tics common in LLM output. Sam Paech's slop score, a weighted composite drawn from his slop-score tool, attributes 60 percent of its score to overused individual words, 25 percent to the not just X but Y construction and 15 percent to overused trigrams. The same author's auto-antislop repository and the Antislop framework published on arXiv in October 2025 use Final Token Preference Optimization to reduce slop frequency by about 90 percent without harming GSM8K, MMLU or creative writing scores. Sam Paech has released antislop fine-tunes of Gemma 3 12B and 27B that score well on creative writing leaderboards despite their modest size.

Memory and long context fidelity

Longform creative writing places sustained pressure on the model's ability to retrieve and reuse small facts from earlier in its own output. Papers on perplexity in long context, including the LongPPL work in 2024 and 2025, have shown that ordinary perplexity is a poor predictor of long context behaviour because most tokens are easy and only a small fraction of key tokens matter. The same intuition applies to fiction. A model that nails 99 percent of tokens in chapter eight but forgets which character was carrying the locket has failed the reader regardless of its mean perplexity.

Benchmarks for longform creative writing

A family of related benchmarks has emerged since 2023, each emphasising a different slice of the problem. The summary below distinguishes generation benchmarks from comprehension benchmarks that are sometimes lumped together with longform creative writing.

Benchmark	Year	Output length	Focus	Judge / metric
EQ-Bench Longform Creative Writing	2024, v3 in 2025	About 8,000 words, 8 chapters	Sustained narrative quality and degradation	Claude Sonnet 4.6 rubric
EQ-Bench Creative Writing v3	2024, v3 in 2025	Short stories, 32 prompts and 3 iterations	Discriminating top tier creative ability	Sonnet 4 rubric plus Elo
BooookScore	ICLR 2024	Book-length summaries of greater than 100K-token novels	Coherence of long summary chains	LLM-detected error rate
NovelQA	2024	Multiple-choice and generative QA	Comprehension of more than 200K-token novels	GPT-4 judged
LongStory	PAKDD 2024	Up to several thousand words	Length-controlled generation with completeness	Human ratings on coherence and completeness
ConStory-Bench (Lost in Stories)	ACL 2026	8,000 to 10,000 words	Consistency bugs across five error categories	Consistency Error Density and Group Relative Rank
LongGenBench	EMNLP 2024 / ICLR 2025	16K to 32K tokens	Following instructions while generating long text	Task-specific automatic metrics
HelloBench	2024	Over 4,000 words target	Open-ended long text across five subtasks	Calibrated LLM judge
LongEval	2025	Above 2,000 words	Plan-based generation for articles and Wikipedia	Content, structure and information density
WritingBench	NeurIPS 2025	Tens to thousands of words	Six domains, 100 subdomains, query-aware criteria	Fine-tuned critic, 84 percent human alignment
LLM Creative Story-Writing Benchmark (Lech Mazur)	2025, refreshed April 2026	Short creative pieces with 10 mandatory elements	Element integration and prose quality	Pairwise comparison by panel

EQ-Bench Longform Creative Writing

The canonical artifact for the phrase is the EQ-Bench Longform Creative Writing benchmark at eqbench.com. It frames the task as a 13-step pipeline. The first five steps cover brainstorming a story concept, sketching character profiles and producing an eight-chapter outline that the model is allowed to critique and revise. The next eight steps produce the chapters themselves, each targeting about 1,000 words and each conditioned on the full previous narrative. After generation, the judge model scores each chapter on a 14-dimension rubric and produces a holistic rating of the full piece. Scores are then bootstrap resampled 500 times to give 95 percent confidence intervals.

The default judge in 2026 is Claude Sonnet 4.6. The default sampler settings are temperature 0.7 and min_p 0.1. The official command line uses the open source longform-writing-bench repository on GitHub. A full run costs roughly ten US dollars per evaluated model using current Sonnet pricing for judging, plus the cost of the model under test.

The rubric weights forced poetry or incoherent metaphor at five times the normal weight at the 1.7 scale and includes a structural penalty for single-sentence paragraphs, both of which were added in the v3 update in 2025 to reduce two specific failure patterns common in models that try to imitate literary fiction.

EQ-Bench Creative Writing v3

The short-form sister benchmark, Creative Writing v3, runs 32 prompts with three iterations each, for 96 generated items per model. It combines isolated rubric scores with a Glicko-2 based Elo from pairwise comparisons between neighbouring models on the leaderboard. Bias controls include truncating outputs to 4,000 characters to mitigate length bias, running every comparison in both A/B and B/A order to mitigate position bias, and using specific rubric items to penalise verbosity and poetic incoherence. The leaderboard reports both rubric score and Elo, and was retuned in v3 because v2 had saturated at the top.

As of April 2026, the public leaderboard shows Claude Opus 4.7 leading with an Elo of approximately 2216, followed by GPT-5.5 around 2024 and Claude Sonnet 4.6 around 1991. On the rubric score scale, Claude 4 Opus is reported at about 73.8, GPT-5 at about 71.4 and Gemini 2.5 Pro at about 70.9. These numbers move week to week as new models are added and the judge is refreshed.

BooookScore

BooookScore, presented at ICLR 2024 by Chang and Lo, is the canonical benchmark for book-length summarisation. The package generates summaries of books that exceed 100,000 tokens by recursively chunking, merging and compressing, then computes BooookScore as the proportion of sentences that contain none of a set of identified error types. The original paper found that closed-source LLMs such as GPT-4 and Claude 2 produced summaries with higher BooookScore than open-source models, and the v2 release added batched sentence-level annotation to reduce judging cost.

While BooookScore measures comprehension and compression rather than original storytelling, it is a load-bearing component of the longform writing literature because it shows that even on highly structured tasks, coherence over book-length input is hard to obtain.

NovelQA

NovelQA, introduced on arXiv in March 2024, is a question-answering benchmark drawn from English novels with average context above 200,000 tokens. Questions are split into detail-oriented at 22.2 percent, single-hop at 42.8 percent and multi-hop at 35 percent. The annotators all hold or are pursuing degrees in English Literature. The headline finding is that model performance falls sharply when supporting evidence appears beyond the 100,000-token mark, which is a stronger and more interpretable version of the older lost-in-the-middle phenomenon. Generative answers are judged by GPT-4 with a Cohen's kappa of about 89 percent against human ratings.

LongStory and consistency methods

LongStory, published in PAKDD 2024 by Kyeongman Park and colleagues, is both a method and an associated evaluation for length-controlled long story generation. It introduces a long and short-term context weight calibrator and discourse tokens to mark structural positions. In their reported evaluation it beats Plotmachine and standard LLM baselines on coherence, completeness, relevance and repetitiveness.

The more recent ConStory-Bench from the Microsoft Research paper Lost in Stories: Consistency Bugs in Long Story Generation by LLMs takes a different angle. Rather than designing a new generation method, it standardises evaluation of consistency, providing 2,000 prompts, a five-category error taxonomy with 19 subtypes and the Consistency Error Density and Group Relative Rank metrics. The paper found that errors are concentrated in the middle of narratives and in segments with higher token-level entropy.

LongGenBench, HelloBench and LongEval

These three benchmarks treat longform creative writing as one corner of a broader long-form generation problem.

LongGenBench evaluates ten state of the art LLMs on prompts that demand 16K and 32K tokens of output across four scenarios and three instruction types, and finds that models which do well on Ruler-style needle-in-a-haystack tests still struggle to actually generate that much coherent text. HelloBench grounds its design in Bloom's Taxonomy and runs five subtasks including text completion and heuristic generation, observing that most current LLMs cannot reliably produce text longer than 4,000 words at quality. LongEval, published in 2025, focuses on plan-based generation in arXiv-paper, blog and Wikipedia domains with a target above 2,000 words and reports separate scores for content quality, structural coherence and information density.

WritingBench

WritingBench, published at NeurIPS 2025 by the Qwen Team, takes a much broader view of writing. It uses 1,000 queries across six domains, including Literature and Arts as well as Academic and Engineering, Finance and Business, Politics and Law, Education and Advertising and Marketing, with 100 subdomains in total. Its evaluation framework lets the LLM generate instance-specific criteria for each query and uses a fine-tuned critic for scoring, reporting 84 percent agreement with human raters compared with 67 percent and 58 percent for static rubrics. The current public leaderboard is led by Qwen3-235B-A22B-Thinking-2507 with a score of about 0.883.

LLM Creative Story-Writing Benchmark by Lech Mazur

Lech Mazur's benchmark complements the EQ-Bench family by focusing on integration. Each model writes a short fiction piece that must meaningfully incorporate ten mandatory elements: character, object, concept, attribute, action, method, setting, timeframe, motivation and tone. An 18-question rubric covers narrative craft and element integration, and a panel of grader LLMs runs pairwise comparisons in both orders to remove position bias. After the April 29 2026 refresh, top scores went to GPT-5.5 in extra-high reasoning mode at about 3.0 global comparison score, GPT-5.4 at about 2.8, Claude Opus 4.7 at about 2.4 and Claude Sonnet 4.6 Thinking at about 2.1. Some entries note that Claude Opus 4.7 declined certain prompts and completed 347 of 400 stories, with scores reflecting only completed narratives.

Evaluation methodologies

Longform creative writing has driven significant methodological innovation in LLM evaluation, in part because the things human readers care about are notoriously hard to measure.

LLM-as-judge

The dominant method for longform creative writing in 2025 and 2026 is LLM-as-judge with a structured rubric. The EQ-Bench family, WritingBench, HelloBench, LongEval and ConStory-Bench all use some variant of this approach. The advantages are obvious: a single judge model can score thousands of stories cheaply and consistently. The disadvantages are equally obvious: the judge has its own taste, can be flattered by prose that matches its own style and may be biased by length, position or verbosity.

The EQ-Bench project addresses judge quality directly through the Judgemark benchmark, which scores judges on separability, score stability and correlation with human preferences. The fourth version, Judgemark v4, drives the choice of Claude Sonnet 4.6 as the default judge for both EQ-Bench 3 and the Creative Writing leaderboards.

Pairwise comparison and Elo

Creative Writing v3 was redesigned around pairwise comparisons because the rubric had saturated. The Glicko-2 rating system, applied to head-to-head matchups with neighbouring models in the leaderboard, gives much finer discrimination at the top. Lech Mazur's benchmark uses pairwise comparison as its primary signal rather than a fallback. The downside is that pairwise comparison scales quadratically in the number of models, which is why the EQ-Bench implementation only compares neighbours.

Human preference

Chatbot Arena and various closed user studies still anchor much of the practitioner intuition about which models write best, but they do not isolate longform writing as a sub-task at the scale of the dedicated benchmarks. Sudowrite's blind tests, reported around the launch of Muse 1.5 in mid 2025, found their fine-tuned model preferred about twice as often as Claude 3.7 Sonnet on fiction prose. NovelQA used 89 percent kappa with human raters as a target for judge calibration on generative answers.

Degradation curves

The distinctive contribution of EQ-Bench Longform is the visual degradation sparkline, which plots per-chapter rubric scores across the eight chapters. The benchmark reports a numerical degradation score equal to the drop from the model's best chapter to its weakest chapter. Top-tier models in 2026 keep this number under five points, while weaker models often lose more than fifteen points by the final chapter. The benchmark categorises degradation into archetypes that include the quality cliff after chapter three or four, gradual decay throughout, oscillation between adjacent chapters, final chapter collapse and middle sag.

Automatic structural metrics

Across the literature, automatic metrics complement the LLM judge. The most common are repetition based on n-gram overlap across chapters, slop score from the EQ-Bench slop dictionary, length statistics and structural penalties for excessive single-sentence paragraphs. These metrics are cheap, deterministic and resistant to judge bias, which is why they are reported separately on the leaderboard rather than folded into the rubric score.

Why perplexity is a poor fit

For most longform creative writing tasks, perplexity is a misleading signal. Recent work on LongPPL has shown that perplexity computed over all tokens does not track benchmark performance because most tokens in fiction are easy to predict, while the difficult tokens that distinguish good and bad narratives are rare. LongPPL, which restricts perplexity to selected key tokens, correlates strongly with downstream long context benchmarks at about minus 0.96, but the technique is not yet standard in creative writing leaderboards.

Leading models on longform creative writing

The leaderboards in 2026 show a stable top tier dominated by Anthropic's Claude line, with OpenAI's GPT-5 family and Google's Gemini 2.5 and 3.1 Pro models trading the second and third slots depending on the metric. The picture below pools EQ-Bench Creative Writing v3, the Longform Creative Writing leaderboard, Lech Mazur's benchmark and several practitioner surveys from late 2025 and early 2026. Numbers shift with each refresh and should be read as direction rather than gospel.

Model	Provider	EQ-Bench CW v3 rubric (approximate)	Longform strengths	Notable weaknesses
Claude Opus 4.7	Anthropic	top tier	Voice consistency, emotional nuance, sustained chapter quality	Lyrical drift in technical sections
Claude Opus 4	Anthropic	about 73.8	Best documented v3 score for a base model	Conservative tone in some genres
GPT-5.5 (extra-high reasoning)	OpenAI	leads Lech Mazur element integration	Tight plotting and pacing	Can read engineered rather than emotional
GPT-5	OpenAI	about 71.4	Strong all-rounder	Some writers reported regression from GPT-4.5 era
Gemini 3.1 Pro	Google	high	Coherent long drafts, low drift	Less lyrical than Claude on pure fiction
Gemini 2.5 Pro	Google	about 70.9	Strong structured non-fiction	Tonal flatness on emotional scenes
Claude Sonnet 4.6	Anthropic	about 68 to 70	Natural prose, cost-effective default	Below Opus on the most ambitious narratives
DeepSeek V3.2 / V4 Pro	DeepSeek	about 66	Strong open weights baseline for fiction	More slop than the closed leaders
Grok 4.2	xAI	about 65	Distinct voice, opinionated	Lower coherence on long arcs
Kimi K2.5 / K2.6	Moonshot AI	about 63	Competitive on Chinese fiction	Less tested on English longform
Qwen 3.5 / 3.6 Max	Alibaba	about 62	Leads WritingBench in the Thinking variant	English literary tone is uneven
Llama 4 Maverick	Meta	about 59	Best open weights generalist	Significant degradation late in chapters
o3 and o1 reasoning	OpenAI	about 55 to 58	Solid planning capability	Over-schematised prose, lower than chat siblings
Mistral Large 3	Mistral	about 54	Reliable European option	Less expressive on dialogue
Phi-4	Microsoft	about 49	Strong per-parameter writer	Limited stamina past short stories
Mistral Nemo Gutenberg, Llama 3.1 Storm	Community fine-tunes	low to mid	Very low slop, distinct voice	Coherence falls apart on chapter-scale fiction
Gemma 3 27B antislop	Sam Paech fine-tune	competitive vs base	Antislop training reduces tics by about 90 percent	Smaller context budget
Muse 1.5	Sudowrite	not on EQ-Bench	Fine-tuned on published novels, preferred about 2x over Claude 3.7 Sonnet in blind fiction tests	Closed model, narrow domain

Several patterns hold across this table. Reasoning models such as the o-series often rank lower than their general capability tier on creative writing because long structured chains of thought produce over-schematised prose. Community fine-tunes of open weights such as Mistral Nemo Gutenberg and Llama 3.1 Storm achieve the lowest slop scores but lose coherence over chapter-scale fiction. Frontier closed models hold the long-form story together far more reliably than any open weight model in early 2026.

Techniques for longform creative writing

The techniques used in production fiction stacks rarely involve calling a base model once. Almost all of the strongest published systems use some form of planning, decomposition and revision pipeline.

Technique	First described / popularised	What it does	Why it helps longform
Outline first prompting	Common practice since the GPT-3 era	Generate an outline before any prose	Externalises plan so each chapter can be conditioned on it
Recursive reprompting and revision (Re3)	Yang et al., EMNLP 2022	Plan, generate, rerank for plot coherence, edit for factual consistency	Human raters preferred Re3 plots 14 percent more often, premise relevance 20 percent more often
DOC, hierarchical outlining	Yang et al., 2023	Outline a story top-down and expand each node	Pushes structure down before generating prose
Chapter-conditioned generation	EQ-Bench Longform 2024	Feed full prior chapters as context for each new chapter	Preserves character and setting details across an entire book
Multi-agent storytelling (Agents' Room)	OpenReview 2024	Decompose into specialised agents for plot, character, dialogue and prose	Outperforms single-model baselines on long narratives
StoryWriter multi-agent framework	ACM CIKM 2025	Modular open source pipeline using planning, writing and revision agents	Used to generate a 6,000-story dataset averaging 8,000 words
Multi-agent character simulation	ACL workshop 2025	Director agent orchestrates character agents who role-play scenes	Produces richer dialogue and emergent character voice
Antislop sampling and FTPO	Sam Paech, arXiv October 2025	Backtracking sampler and token-level preference optimisation	Reduces slop frequency by about 90 percent without harming general capability
Retrieval over the story so far	SCORE and related systems	RAG over earlier chapters with explicit state tracking	Pushes character item state consistency to about 98 percent
Long context base models	Claude, Gemini, GPT-5 generations from 2024 onwards	One-million-token context windows	Removes the need for chunked summarisation in many fiction workloads

How tools and agents combine these in practice

Fiction-focused tools combine several of these techniques. Sudowrite layers brainstorming, scene-level generation, prose refinement and a proprietary fine-tuned Muse model on top of frontier APIs. NovelCrafter pairs a structured codex for characters, locations, factions and magic systems with prompt orchestration that feeds the codex into each generation. Both tools effectively implement chapter-conditioned generation with retrieval-augmented memory. Open source frameworks such as StoryWriter and Agents' Room formalise the agent layer, and provide replicable baselines for academic comparison.

Notable AI-written and AI-assisted longform works

Although no LLM has yet produced a critically acclaimed full-length novel without heavy human collaboration, several published works illustrate the state of the art in different eras.

1 the Road, published by Jean Boite Editions in 2018, is an experimental novel generated by an artificial neural network during a March 2017 road trip by Ross Goodwin from New York to New Orleans, with the model conditioned on sensor inputs and on a corpus of nearly 200 hand-picked books. The text was published unedited as a historical artifact.

Death of an Author by Stephen Marche, writing under the pen name Aidan Marchine, was published in 2023 as one of the first long-form novellas to use extensive AI-generated text, drawing on ChatGPT and Cohere models. It was reviewed in The New York Times and Slate as a serious if uneven experiment in human and machine collaboration.

In 2025 and 2026 the most visible AI-assisted fiction has come from professional novelists using tools such as Sudowrite and NovelCrafter to draft scenes and chapters. The blind tests Sudowrite reported around the Muse 1.5 launch in June 2025 showed that, on fiction prose alone, a domain-tuned model can be preferred over a general frontier model. None of this work is fully autonomous, and even the AI-judged EQ-Bench Longform pipeline depends on a structured planning rubric that human researchers have spent months refining.

EQ-Bench Longform in depth

Because it is the most cited benchmark using the literal phrase longform creative writing, the EQ-Bench Longform pipeline deserves a section of its own. The numbers and parameters below come from the official site and from the longform-writing-bench README.

Generation pipeline

The pipeline consists of 13 generation steps. The first five build the foundation. Step one is a brainstorming step in which the model proposes story concepts. Step two is a critique step in which the model evaluates and refines them. Step three locks the concept and produces character profiles. Step four produces a chapter-by-chapter outline. Step five is a reflection pass that allows the model to revise both characters and outline before committing.

Steps six through 13 produce the eight chapters in order. Each chapter generation receives the concept, the character profiles, the chapter outline and the full text of all previous chapters as context. The default target length is 1,000 words per chapter, and chapters that fall significantly outside that target are penalised through the length and structural metrics.

Judging pipeline

After generation, the judge scores each chapter individually against the 14-dimension rubric, and then provides a holistic rating of the full book. Chapter scores are weighted equally and a separate weight is applied to the holistic rating. Final scores are reported with 95 percent confidence intervals from 500 bootstrap resamples. The judge is Claude Sonnet 4.6 by default in 2026.

The 14-dimension rubric

The rubric covers compelling plot, narrative coherence, character consistency, chapter plan adherence, emotional engagement, nuanced characterisation, tonal consistency, prose quality, dialogue naturalness, originality, scene craft, structural integrity, pacing and avoidance of common AI failure modes. The v3 update in 2025 added weighted penalties for forced poetry or incoherent metaphor at five times the normal weight at the 1.7 scale, and a structural penalty for excessive single-sentence paragraphs. Both changes target failure modes that judges had been under-penalising.

Reported failure modes

The EQ-Bench team documents a set of failure modes that appear across model families.

Failure mode	Description	Approximate frequency in tested models
Weak dialogue	Unnatural or stilted conversations	High, around 60 percent
Tell don't show	Excessive exposition over demonstration	High, around 70 percent
Purple prose	Overly ornate language	Medium, around 40 percent
Predictability	Formulaic plot development	High, around 65 percent
Metaphor abuse	Forced or incoherent metaphors	Medium, around 45 percent
Character drift	Inconsistent characterisation	Medium, around 50 percent

Degradation archetypes

The per-chapter score sparkline reveals recurring degradation patterns. The most common is a quality cliff after chapter three or four, often visible as a five to ten point drop in successive ratings. Other archetypes are gradual decay across all eight chapters, oscillation between adjacent chapters, final chapter collapse where the model rushes the ending, and middle sag in chapters four to six where the model loses confidence in its plan.

Running the benchmark

The canonical command from the repository README is

python3 longform_writing_bench.py \
    --test-model "google/gemini-2.0-flash-001" \
    --judge-model "anthropic/claude-sonnet-4" \
    --runs-file "results/longform_bench_runs.json" \
    --run-id "demo" --threads 12 --iterations 1

Key flags include --skip-generation to re-judge existing outputs, --redo-judging to apply an updated rubric and --iterations to control how many independent runs feed the confidence intervals. All file writes are atomic and locked to support parallel execution and crash recovery. A full single-iteration run on a frontier model takes roughly fifteen to thirty minutes wall clock plus API time, and a typical evaluation costs about ten US dollars at 2026 Sonnet judging rates.

Open problems and future directions

Longform creative writing as a measurement problem is far from solved. Several open problems are actively debated in 2026.

First, there is no consensus human baseline. Professional novelists writing eight 1,000-word chapters under the same constraints would establish an upper bound, but the cost and time involved have so far prevented anyone from gathering such data at scale. Without a human anchor, top model scores are unbounded and hard to interpret.

Second, the dominant evaluation method, LLM-as-judge, is itself an LLM with taste. The Judgemark benchmark partially addresses this by quantifying judge separability and stability, and the choice of Claude Sonnet 4.6 as default reflects its leading Judgemark score. But the fundamental tension between using one model to grade another remains, and is most acute when the judge and the candidate share a family.

Third, almost every benchmark is English only. ConStory-Bench, WritingBench and the EQ-Bench family include some multilingual coverage but the leaderboards are dominated by English stories. Models that are strong on Chinese, Japanese or Spanish fiction are systematically under-represented.

Fourth, leaderboards and judging dimensions are oriented toward a particular literary aesthetic. The current rubrics reward emotional engagement, character nuance and tonal consistency, all of which favour conventional realist fiction. Experimental, satirical or genre-specific writing styles can score lower simply because the judge is not tuned to them.

Fifth, there is the question of saturation. EQ-Bench Creative Writing v2 saturated within about two years and required redesign. The Longform variant adds enough degrees of freedom through chapter weighting, slop scoring and degradation tracking to delay this, but the same forces apply, and the eventual v4 redesign is already discussed by Sam Paech publicly.

Finally, the relationship between longform creative writing and agentic workflows is increasingly important. Multi-agent storytelling frameworks such as Agents' Room and StoryWriter mean that the practical answer to the question of which model writes best is increasingly the answer to which combination of model, prompt and orchestration writes best. Future benchmarks may need to evaluate fiction pipelines rather than fiction models.

EQ-Bench for the parent suite
BooookScore for book-length summarisation
Long context for the comprehension counterpart
Large language model for the underlying technology
LLM evaluation for the broader measurement landscape
AI agent for orchestration frameworks used in story pipelines
Claude Opus for the current creative writing leader
GPT-5 and Gemini for the main competing families
Gemma for the open weights line used in antislop fine-tunes

References

EQ-Bench Longform Creative Writing Leaderboard, eqbench.com. https://eqbench.com/creative_writing_longform.html
EQ-Bench Creative Writing v3 Leaderboard, eqbench.com. https://eqbench.com/creative_writing.html
EQ-Bench About page, eqbench.com. https://eqbench.com/about.html
longform-writing-bench repository, EQ-Bench organisation, GitHub. https://github.com/EQ-bench/longform-writing-bench
creative-writing-bench repository, EQ-Bench organisation, GitHub. https://github.com/EQ-bench/creative-writing-bench
Sam Paech, slop-score repository, GitHub. https://github.com/sam-paech/slop-score
Sam Paech, auto-antislop repository, GitHub. https://github.com/sam-paech/auto-antislop
Sam Paech, gemma-3-27b-it-antislop, Hugging Face. https://huggingface.co/sam-paech/gemma-3-27b-it-antislop
Antislop: A Comprehensive Framework for Identifying and Eliminating Slop in LLMs, arXiv 2510.15061. https://arxiv.org/abs/2510.15061
Chang and Lo, BooookScore: A Systematic Exploration of Book-Length Summarization in the Era of LLMs, ICLR 2024. https://arxiv.org/abs/2310.00785
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens, arXiv 2403.12766. https://arxiv.org/abs/2403.12766
Park et al., LongStory: Coherent, Complete and Length Controlled Long Story Generation, PAKDD 2024, arXiv 2311.15208. https://arxiv.org/abs/2311.15208
Lost in Stories: Consistency Bugs in Long Story Generation by LLMs, ACL 2026, arXiv 2603.05890. https://arxiv.org/abs/2603.05890
ConStory-Bench project page. https://picrew.github.io/constory-bench.github.io/
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs, ICLR 2025, arXiv 2409.02076. https://arxiv.org/abs/2409.02076
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models, arXiv 2409.16191. https://arxiv.org/abs/2409.16191
LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm, arXiv 2502.19103. https://arxiv.org/abs/2502.19103
WritingBench: A Comprehensive Benchmark for Generative Writing, NeurIPS 2025, arXiv 2503.05244. https://arxiv.org/abs/2503.05244
Lech Mazur, LLM Creative Story-Writing Benchmark repository, GitHub. https://github.com/lechmazur/writing
Yang et al., Re3: Generating Longer Stories With Recursive Reprompting and Revision, EMNLP 2022, arXiv 2210.06774. https://arxiv.org/abs/2210.06774
Agents' Room: Narrative Generation through Multi-step Collaboration, OpenReview 2024. https://openreview.net/forum?id=HfWcFs7XLR
StoryWriter: A Multi-Agent Framework for Long Story Generation, ACM CIKM 2025. https://dl.acm.org/doi/10.1145/3746252.3761616
Multi-Agent Character Simulation for Story Writing, In2Writing workshop, ACL 2025. https://aclanthology.org/2025.in2writing-1.9.pdf
What is Wrong with Perplexity for Long-context Language Modeling, arXiv 2410.23771. https://arxiv.org/html/2410.23771v5
Sudowrite Muse 1.5, Sudowrite blog. https://sudowrite.com/blog/sudowrite-vs-novelcrafter-the-ultimate-ai-showdown-for-novelists/
Wikipedia, 1 the Road by Ross Goodwin. https://en.wikipedia.org/wiki/1_the_Road
Wikipedia, Death of an Author (novella) by Stephen Marche. https://en.wikipedia.org/wiki/Death_of_an_Author_(novella)
Creative Writing LLM Leaderboard 2026, Awesome Agents. https://awesomeagents.ai/leaderboards/creative-writing-llm-leaderboard/
Best LLMs for Writing in 2026, Intellectual Lead. https://intellectualead.com/best-llm-writing/
BestAI, LLM Longform Creative Writing Benchmark v3 Released. https://bestai.com/news/LLM-longform-creative-writing-benchmark-v3-dcd5c944af

Definition and scope

Why longform creative writing matters

Core challenges in longform creative writing

Coherence across thousands of tokens

Character consistency

Plot tension and pacing

Prose quality and slop

Memory and long context fidelity

Benchmarks for longform creative writing

EQ-Bench Longform Creative Writing

EQ-Bench Creative Writing v3

BooookScore

NovelQA

LongStory and consistency methods

LongGenBench, HelloBench and LongEval

WritingBench

LLM Creative Story-Writing Benchmark by Lech Mazur

Evaluation methodologies

LLM-as-judge

Pairwise comparison and Elo

Human preference

Degradation curves

Automatic structural metrics

Why perplexity is a poor fit

Leading models on longform creative writing

Techniques for longform creative writing

How tools and agents combine these in practice

Notable AI-written and AI-assisted longform works

EQ-Bench Longform in depth

Generation pipeline

Judging pipeline

The 14-dimension rubric

Reported failure modes

Degradation archetypes

Running the benchmark

Open problems and future directions

Related concepts

References

Improve this article

Related Articles

Creative Writing v3

τ-bench

Aider Polyglot

BALROG

IFBench

COLLIE

Definition and scope

Why longform creative writing matters

Core challenges in longform creative writing

Coherence across thousands of tokens

Character consistency

Plot tension and pacing

Prose quality and slop

Memory and long context fidelity

Benchmarks for longform creative writing

EQ-Bench Longform Creative Writing

EQ-Bench Creative Writing v3

BooookScore

NovelQA

LongStory and consistency methods

LongGenBench, HelloBench and LongEval

WritingBench

LLM Creative Story-Writing Benchmark by Lech Mazur

Evaluation methodologies

LLM-as-judge

Pairwise comparison and Elo

Human preference

Degradation curves

Automatic structural metrics

Why perplexity is a poor fit

Leading models on longform creative writing

Techniques for longform creative writing

How tools and agents combine these in practice

Notable AI-written and AI-assisted longform works

EQ-Bench Longform in depth

Generation pipeline

Judging pipeline

The 14-dimension rubric

Reported failure modes

Degradation archetypes