Longform Creative Writing
Last reviewed
May 16, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 · 6,011 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 · 6,011 words
Add missing citations, update stale details, or suggest a clearer explanation.
Longform Creative Writing is a capability area and benchmark family for large language models that measures how well a system can write coherent, engaging fiction or narrative non-fiction over thousands of words rather than a single short prompt. The most cited public artifact under this name is the Longform Creative Writing benchmark on EQ-Bench, authored by Samuel J. Paech, in which a model brainstorms a story, plans an eight-chapter outline, and then writes roughly 8,000 words of continuous fiction that an LLM judge scores chapter by chapter. The broader research field around longform creative writing also includes BooookScore, NovelQA, LongStory, ConStory-Bench, LongGenBench, HelloBench, LongEval, WritingBench and Lech Mazur's LLM Creative Story-Writing Benchmark, each probing a different facet of sustained narrative quality.
| Longform Creative Writing | |
|---|---|
| Overview | |
| Full name | Longform Creative Writing Benchmark |
| Abbreviation | LCW |
| Description | An LLM-judged benchmark evaluating extended narrative generation across eight chapters of roughly 1,000 words each |
| Release date | 2024 |
| Latest version | v1.11 (interface), generation rubric v3 |
| Benchmark updated | 2026-02-19 |
| Authors | Samuel J. Paech |
| Organization | EQ-Bench |
| Technical Details | |
| Type | Creative writing, extended narrative |
| Modality | Text |
| Task format | Multi-turn story generation with planning and reflection |
| Number of tasks | 1 story per run, 8 chapters |
| Total examples | 8 chapters per evaluation |
| Evaluation metric | 0 to 100 rubric score, degradation, slop, repetition |
| Domains | Fiction writing, narrative consistency, character development |
| Languages | English |
| Performance | |
| Human performance | Not formally reported |
| Baseline | Variable by model |
| SOTA score | Approximately 85 to 90 on the rubric |
| Judge model | Claude Sonnet 4.6 (default since 2026) |
| Sampling defaults | temperature 0.7, min_p 0.1 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Code (long form) | longform-writing-bench |
| Code (short form) | creative-writing-bench |
| License | Open source |
| Predecessor | Longform Creative Writing v2 |
For large language model evaluation, the phrase longform creative writing has come to mean any task where the model must produce an original narrative whose length materially stresses planning, memory and prose stamina. There is no canonical word threshold in the literature, but several practical conventions have emerged. The EQ-Bench Longform Creative Writing run targets approximately 8,000 words across eight chapters. HelloBench treats 4,000 words as a soft ceiling that most contemporary models fail to reach with intact quality. LongGenBench evaluates outputs of 16,000 and 32,000 tokens, and the ConStory-Bench paper from 2026 targets stories of 8,000 to 10,000 words. In all of these settings the central question is the same: does the model still tell a good story near the end of the output, given everything it has already committed to.
Longform creative writing is distinct from three adjacent capabilities. It differs from short creative writing, where outputs are short enough that planning and consistency rarely fail. It differs from long context comprehension, where the model must understand long input rather than produce long output, which is the regime explored by long context benchmarks such as NovelQA, Loong, LongBench and Ruler. And it differs from structured longform generation of articles or reports, which trades narrative arc for factual coverage and is the focus of WritingBench, LongGenBench and LongEval. A model can be excellent at one of these and weak at the others.
The shift from short prompts to longform stories changes the failure surface of an LLM in ways that short benchmarks systematically miss. A model that writes one sparkling paragraph can still drown on chapter five because every additional token adds a new opportunity for inconsistency, for stylistic drift and for the model to repeat phrases it has already used. Three classes of stakeholder have driven the rise of this benchmark family.
Researchers care because longform creative writing exposes weaknesses in long context handling, planning and reward modeling that are invisible at the prompt-response level. Practitioners care because real fiction tools such as Sudowrite, NovelCrafter and the writing rooms inside Claude and ChatGPT deliver value mostly when they can sustain quality over chapters. And model developers care because the rubric scores correlate informally with subscriber sentiment among writers, a vocal segment of consumer LLM users.
| Reason | What it captures | Why short benchmarks miss it |
|---|---|---|
| Quality degradation | Whether prose decays late in the output | Single-prompt tests never reach the decay zone |
| Character consistency | Voice, motivation and detail stability | One paragraph cannot contradict itself |
| Plot tension | Pacing, foreshadowing, payoff | Short prompts are answered, not paced |
| Memory fidelity | Recall of small details across chapters | No prior chapters to remember |
| Style stamina | Avoiding slop and verbal tics over thousands of words | Slop is rare in short answers |
The most consistent failure mode is coherence collapse. A character introduced as left-handed in chapter one fires a pistol right-handed in chapter four. A village that the model called Greythorne becomes Greythorpe. The Microsoft Research paper Lost in Stories: Consistency Bugs in Long Story Generation by LLMs accepted to ACL 2026 catalogues these errors across five categories and 19 subtypes, finding that they cluster near the middle of narratives and inside passages with higher token-level entropy. The accompanying ConStory-Bench benchmark uses a three-stage pipeline of error extraction, contradiction pairing and evidence chain construction to score 2,000 generated stories with a metric called Consistency Error Density.
Character drift is the single most common failure mode visible to human readers. Models routinely change a character's age, occupation or speech register between chapters, particularly after long stretches in which the character was off-page. The EQ-Bench rubric explicitly tracks character consistency as a major component of the overall score and penalises any chapter in which a named character behaves out of line with their established profile.
Language models trained on internet text often default to a flat narrative voltage in which every paragraph carries roughly equal weight. Real fiction relies on swings of intensity, foreshadowing and deliberate gaps. The EQ-Bench Longform rubric uses dimensions such as compelling plot, emotional engagement and tonal consistency to penalise output that reads like a uniform briefing document rather than a story.
Slop is the EQ-Bench shorthand for distinctive phrasing tics common in LLM output. Sam Paech's slop score, a weighted composite drawn from his slop-score tool, attributes 60 percent of its score to overused individual words, 25 percent to the not just X but Y construction and 15 percent to overused trigrams. The same author's auto-antislop repository and the Antislop framework published on arXiv in October 2025 use Final Token Preference Optimization to reduce slop frequency by about 90 percent without harming GSM8K, MMLU or creative writing scores. Sam Paech has released antislop fine-tunes of Gemma 3 12B and 27B that score well on creative writing leaderboards despite their modest size.
Longform creative writing places sustained pressure on the model's ability to retrieve and reuse small facts from earlier in its own output. Papers on perplexity in long context, including the LongPPL work in 2024 and 2025, have shown that ordinary perplexity is a poor predictor of long context behaviour because most tokens are easy and only a small fraction of key tokens matter. The same intuition applies to fiction. A model that nails 99 percent of tokens in chapter eight but forgets which character was carrying the locket has failed the reader regardless of its mean perplexity.
A family of related benchmarks has emerged since 2023, each emphasising a different slice of the problem. The summary below distinguishes generation benchmarks from comprehension benchmarks that are sometimes lumped together with longform creative writing.
| Benchmark | Year | Output length | Focus | Judge / metric |
|---|---|---|---|---|
| EQ-Bench Longform Creative Writing | 2024, v3 in 2025 | About 8,000 words, 8 chapters | Sustained narrative quality and degradation | Claude Sonnet 4.6 rubric |
| EQ-Bench Creative Writing v3 | 2024, v3 in 2025 | Short stories, 32 prompts and 3 iterations | Discriminating top tier creative ability | Sonnet 4 rubric plus Elo |
| BooookScore | ICLR 2024 | Book-length summaries of greater than 100K-token novels | Coherence of long summary chains | LLM-detected error rate |
| NovelQA | 2024 | Multiple-choice and generative QA | Comprehension of more than 200K-token novels | GPT-4 judged |
| LongStory | PAKDD 2024 | Up to several thousand words | Length-controlled generation with completeness | Human ratings on coherence and completeness |
| ConStory-Bench (Lost in Stories) | ACL 2026 | 8,000 to 10,000 words | Consistency bugs across five error categories | Consistency Error Density and Group Relative Rank |
| LongGenBench | EMNLP 2024 / ICLR 2025 | 16K to 32K tokens | Following instructions while generating long text | Task-specific automatic metrics |
| HelloBench | 2024 | Over 4,000 words target | Open-ended long text across five subtasks | Calibrated LLM judge |
| LongEval | 2025 | Above 2,000 words | Plan-based generation for articles and Wikipedia | Content, structure and information density |
| WritingBench | NeurIPS 2025 | Tens to thousands of words | Six domains, 100 subdomains, query-aware criteria | Fine-tuned critic, 84 percent human alignment |
| LLM Creative Story-Writing Benchmark (Lech Mazur) | 2025, refreshed April 2026 | Short creative pieces with 10 mandatory elements | Element integration and prose quality | Pairwise comparison by panel |
The canonical artifact for the phrase is the EQ-Bench Longform Creative Writing benchmark at eqbench.com. It frames the task as a 13-step pipeline. The first five steps cover brainstorming a story concept, sketching character profiles and producing an eight-chapter outline that the model is allowed to critique and revise. The next eight steps produce the chapters themselves, each targeting about 1,000 words and each conditioned on the full previous narrative. After generation, the judge model scores each chapter on a 14-dimension rubric and produces a holistic rating of the full piece. Scores are then bootstrap resampled 500 times to give 95 percent confidence intervals.
The default judge in 2026 is Claude Sonnet 4.6. The default sampler settings are temperature 0.7 and min_p 0.1. The official command line uses the open source longform-writing-bench repository on GitHub. A full run costs roughly ten US dollars per evaluated model using current Sonnet pricing for judging, plus the cost of the model under test.
The rubric weights forced poetry or incoherent metaphor at five times the normal weight at the 1.7 scale and includes a structural penalty for single-sentence paragraphs, both of which were added in the v3 update in 2025 to reduce two specific failure patterns common in models that try to imitate literary fiction.
The short-form sister benchmark, Creative Writing v3, runs 32 prompts with three iterations each, for 96 generated items per model. It combines isolated rubric scores with a Glicko-2 based Elo from pairwise comparisons between neighbouring models on the leaderboard. Bias controls include truncating outputs to 4,000 characters to mitigate length bias, running every comparison in both A/B and B/A order to mitigate position bias, and using specific rubric items to penalise verbosity and poetic incoherence. The leaderboard reports both rubric score and Elo, and was retuned in v3 because v2 had saturated at the top.
As of April 2026, the public leaderboard shows Claude Opus 4.7 leading with an Elo of approximately 2216, followed by GPT-5.5 around 2024 and Claude Sonnet 4.6 around 1991. On the rubric score scale, Claude 4 Opus is reported at about 73.8, GPT-5 at about 71.4 and Gemini 2.5 Pro at about 70.9. These numbers move week to week as new models are added and the judge is refreshed.
BooookScore, presented at ICLR 2024 by Chang and Lo, is the canonical benchmark for book-length summarisation. The package generates summaries of books that exceed 100,000 tokens by recursively chunking, merging and compressing, then computes BooookScore as the proportion of sentences that contain none of a set of identified error types. The original paper found that closed-source LLMs such as GPT-4 and Claude 2 produced summaries with higher BooookScore than open-source models, and the v2 release added batched sentence-level annotation to reduce judging cost.
While BooookScore measures comprehension and compression rather than original storytelling, it is a load-bearing component of the longform writing literature because it shows that even on highly structured tasks, coherence over book-length input is hard to obtain.
NovelQA, introduced on arXiv in March 2024, is a question-answering benchmark drawn from English novels with average context above 200,000 tokens. Questions are split into detail-oriented at 22.2 percent, single-hop at 42.8 percent and multi-hop at 35 percent. The annotators all hold or are pursuing degrees in English Literature. The headline finding is that model performance falls sharply when supporting evidence appears beyond the 100,000-token mark, which is a stronger and more interpretable version of the older lost-in-the-middle phenomenon. Generative answers are judged by GPT-4 with a Cohen's kappa of about 89 percent against human ratings.
LongStory, published in PAKDD 2024 by Kyeongman Park and colleagues, is both a method and an associated evaluation for length-controlled long story generation. It introduces a long and short-term context weight calibrator and discourse tokens to mark structural positions. In their reported evaluation it beats Plotmachine and standard LLM baselines on coherence, completeness, relevance and repetitiveness.
The more recent ConStory-Bench from the Microsoft Research paper Lost in Stories: Consistency Bugs in Long Story Generation by LLMs takes a different angle. Rather than designing a new generation method, it standardises evaluation of consistency, providing 2,000 prompts, a five-category error taxonomy with 19 subtypes and the Consistency Error Density and Group Relative Rank metrics. The paper found that errors are concentrated in the middle of narratives and in segments with higher token-level entropy.
These three benchmarks treat longform creative writing as one corner of a broader long-form generation problem.
LongGenBench evaluates ten state of the art LLMs on prompts that demand 16K and 32K tokens of output across four scenarios and three instruction types, and finds that models which do well on Ruler-style needle-in-a-haystack tests still struggle to actually generate that much coherent text. HelloBench grounds its design in Bloom's Taxonomy and runs five subtasks including text completion and heuristic generation, observing that most current LLMs cannot reliably produce text longer than 4,000 words at quality. LongEval, published in 2025, focuses on plan-based generation in arXiv-paper, blog and Wikipedia domains with a target above 2,000 words and reports separate scores for content quality, structural coherence and information density.
WritingBench, published at NeurIPS 2025 by the Qwen Team, takes a much broader view of writing. It uses 1,000 queries across six domains, including Literature and Arts as well as Academic and Engineering, Finance and Business, Politics and Law, Education and Advertising and Marketing, with 100 subdomains in total. Its evaluation framework lets the LLM generate instance-specific criteria for each query and uses a fine-tuned critic for scoring, reporting 84 percent agreement with human raters compared with 67 percent and 58 percent for static rubrics. The current public leaderboard is led by Qwen3-235B-A22B-Thinking-2507 with a score of about 0.883.
Lech Mazur's benchmark complements the EQ-Bench family by focusing on integration. Each model writes a short fiction piece that must meaningfully incorporate ten mandatory elements: character, object, concept, attribute, action, method, setting, timeframe, motivation and tone. An 18-question rubric covers narrative craft and element integration, and a panel of grader LLMs runs pairwise comparisons in both orders to remove position bias. After the April 29 2026 refresh, top scores went to GPT-5.5 in extra-high reasoning mode at about 3.0 global comparison score, GPT-5.4 at about 2.8, Claude Opus 4.7 at about 2.4 and Claude Sonnet 4.6 Thinking at about 2.1. Some entries note that Claude Opus 4.7 declined certain prompts and completed 347 of 400 stories, with scores reflecting only completed narratives.
Longform creative writing has driven significant methodological innovation in LLM evaluation, in part because the things human readers care about are notoriously hard to measure.
The dominant method for longform creative writing in 2025 and 2026 is LLM-as-judge with a structured rubric. The EQ-Bench family, WritingBench, HelloBench, LongEval and ConStory-Bench all use some variant of this approach. The advantages are obvious: a single judge model can score thousands of stories cheaply and consistently. The disadvantages are equally obvious: the judge has its own taste, can be flattered by prose that matches its own style and may be biased by length, position or verbosity.
The EQ-Bench project addresses judge quality directly through the Judgemark benchmark, which scores judges on separability, score stability and correlation with human preferences. The fourth version, Judgemark v4, drives the choice of Claude Sonnet 4.6 as the default judge for both EQ-Bench 3 and the Creative Writing leaderboards.
Creative Writing v3 was redesigned around pairwise comparisons because the rubric had saturated. The Glicko-2 rating system, applied to head-to-head matchups with neighbouring models in the leaderboard, gives much finer discrimination at the top. Lech Mazur's benchmark uses pairwise comparison as its primary signal rather than a fallback. The downside is that pairwise comparison scales quadratically in the number of models, which is why the EQ-Bench implementation only compares neighbours.
Chatbot Arena and various closed user studies still anchor much of the practitioner intuition about which models write best, but they do not isolate longform writing as a sub-task at the scale of the dedicated benchmarks. Sudowrite's blind tests, reported around the launch of Muse 1.5 in mid 2025, found their fine-tuned model preferred about twice as often as Claude 3.7 Sonnet on fiction prose. NovelQA used 89 percent kappa with human raters as a target for judge calibration on generative answers.
The distinctive contribution of EQ-Bench Longform is the visual degradation sparkline, which plots per-chapter rubric scores across the eight chapters. The benchmark reports a numerical degradation score equal to the drop from the model's best chapter to its weakest chapter. Top-tier models in 2026 keep this number under five points, while weaker models often lose more than fifteen points by the final chapter. The benchmark categorises degradation into archetypes that include the quality cliff after chapter three or four, gradual decay throughout, oscillation between adjacent chapters, final chapter collapse and middle sag.
Across the literature, automatic metrics complement the LLM judge. The most common are repetition based on n-gram overlap across chapters, slop score from the EQ-Bench slop dictionary, length statistics and structural penalties for excessive single-sentence paragraphs. These metrics are cheap, deterministic and resistant to judge bias, which is why they are reported separately on the leaderboard rather than folded into the rubric score.
For most longform creative writing tasks, perplexity is a misleading signal. Recent work on LongPPL has shown that perplexity computed over all tokens does not track benchmark performance because most tokens in fiction are easy to predict, while the difficult tokens that distinguish good and bad narratives are rare. LongPPL, which restricts perplexity to selected key tokens, correlates strongly with downstream long context benchmarks at about minus 0.96, but the technique is not yet standard in creative writing leaderboards.
The leaderboards in 2026 show a stable top tier dominated by Anthropic's Claude line, with OpenAI's GPT-5 family and Google's Gemini 2.5 and 3.1 Pro models trading the second and third slots depending on the metric. The picture below pools EQ-Bench Creative Writing v3, the Longform Creative Writing leaderboard, Lech Mazur's benchmark and several practitioner surveys from late 2025 and early 2026. Numbers shift with each refresh and should be read as direction rather than gospel.
| Model | Provider | EQ-Bench CW v3 rubric (approximate) | Longform strengths | Notable weaknesses |
|---|---|---|---|---|
| Claude Opus 4.7 | Anthropic | top tier | Voice consistency, emotional nuance, sustained chapter quality | Lyrical drift in technical sections |
| Claude Opus 4 | Anthropic | about 73.8 | Best documented v3 score for a base model | Conservative tone in some genres |
| GPT-5.5 (extra-high reasoning) | OpenAI | leads Lech Mazur element integration | Tight plotting and pacing | Can read engineered rather than emotional |
| GPT-5 | OpenAI | about 71.4 | Strong all-rounder | Some writers reported regression from GPT-4.5 era |
| Gemini 3.1 Pro | high | Coherent long drafts, low drift | Less lyrical than Claude on pure fiction | |
| Gemini 2.5 Pro | about 70.9 | Strong structured non-fiction | Tonal flatness on emotional scenes | |
| Claude Sonnet 4.6 | Anthropic | about 68 to 70 | Natural prose, cost-effective default | Below Opus on the most ambitious narratives |
| DeepSeek V3.2 / V4 Pro | DeepSeek | about 66 | Strong open weights baseline for fiction | More slop than the closed leaders |
| Grok 4.2 | xAI | about 65 | Distinct voice, opinionated | Lower coherence on long arcs |
| Kimi K2.5 / K2.6 | Moonshot AI | about 63 | Competitive on Chinese fiction | Less tested on English longform |
| Qwen 3.5 / 3.6 Max | Alibaba | about 62 | Leads WritingBench in the Thinking variant | English literary tone is uneven |
| Llama 4 Maverick | Meta | about 59 | Best open weights generalist | Significant degradation late in chapters |
| o3 and o1 reasoning | OpenAI | about 55 to 58 | Solid planning capability | Over-schematised prose, lower than chat siblings |
| Mistral Large 3 | Mistral | about 54 | Reliable European option | Less expressive on dialogue |
| Phi-4 | Microsoft | about 49 | Strong per-parameter writer | Limited stamina past short stories |
| Mistral Nemo Gutenberg, Llama 3.1 Storm | Community fine-tunes | low to mid | Very low slop, distinct voice | Coherence falls apart on chapter-scale fiction |
| Gemma 3 27B antislop | Sam Paech fine-tune | competitive vs base | Antislop training reduces tics by about 90 percent | Smaller context budget |
| Muse 1.5 | Sudowrite | not on EQ-Bench | Fine-tuned on published novels, preferred about 2x over Claude 3.7 Sonnet in blind fiction tests | Closed model, narrow domain |
Several patterns hold across this table. Reasoning models such as the o-series often rank lower than their general capability tier on creative writing because long structured chains of thought produce over-schematised prose. Community fine-tunes of open weights such as Mistral Nemo Gutenberg and Llama 3.1 Storm achieve the lowest slop scores but lose coherence over chapter-scale fiction. Frontier closed models hold the long-form story together far more reliably than any open weight model in early 2026.
The techniques used in production fiction stacks rarely involve calling a base model once. Almost all of the strongest published systems use some form of planning, decomposition and revision pipeline.
| Technique | First described / popularised | What it does | Why it helps longform |
|---|---|---|---|
| Outline first prompting | Common practice since the GPT-3 era | Generate an outline before any prose | Externalises plan so each chapter can be conditioned on it |
| Recursive reprompting and revision (Re3) | Yang et al., EMNLP 2022 | Plan, generate, rerank for plot coherence, edit for factual consistency | Human raters preferred Re3 plots 14 percent more often, premise relevance 20 percent more often |
| DOC, hierarchical outlining | Yang et al., 2023 | Outline a story top-down and expand each node | Pushes structure down before generating prose |
| Chapter-conditioned generation | EQ-Bench Longform 2024 | Feed full prior chapters as context for each new chapter | Preserves character and setting details across an entire book |
| Multi-agent storytelling (Agents' Room) | OpenReview 2024 | Decompose into specialised agents for plot, character, dialogue and prose | Outperforms single-model baselines on long narratives |
| StoryWriter multi-agent framework | ACM CIKM 2025 | Modular open source pipeline using planning, writing and revision agents | Used to generate a 6,000-story dataset averaging 8,000 words |
| Multi-agent character simulation | ACL workshop 2025 | Director agent orchestrates character agents who role-play scenes | Produces richer dialogue and emergent character voice |
| Antislop sampling and FTPO | Sam Paech, arXiv October 2025 | Backtracking sampler and token-level preference optimisation | Reduces slop frequency by about 90 percent without harming general capability |
| Retrieval over the story so far | SCORE and related systems | RAG over earlier chapters with explicit state tracking | Pushes character item state consistency to about 98 percent |
| Long context base models | Claude, Gemini, GPT-5 generations from 2024 onwards | One-million-token context windows | Removes the need for chunked summarisation in many fiction workloads |
Fiction-focused tools combine several of these techniques. Sudowrite layers brainstorming, scene-level generation, prose refinement and a proprietary fine-tuned Muse model on top of frontier APIs. NovelCrafter pairs a structured codex for characters, locations, factions and magic systems with prompt orchestration that feeds the codex into each generation. Both tools effectively implement chapter-conditioned generation with retrieval-augmented memory. Open source frameworks such as StoryWriter and Agents' Room formalise the agent layer, and provide replicable baselines for academic comparison.
Although no LLM has yet produced a critically acclaimed full-length novel without heavy human collaboration, several published works illustrate the state of the art in different eras.
1 the Road, published by Jean Boite Editions in 2018, is an experimental novel generated by an artificial neural network during a March 2017 road trip by Ross Goodwin from New York to New Orleans, with the model conditioned on sensor inputs and on a corpus of nearly 200 hand-picked books. The text was published unedited as a historical artifact.
Death of an Author by Stephen Marche, writing under the pen name Aidan Marchine, was published in 2023 as one of the first long-form novellas to use extensive AI-generated text, drawing on ChatGPT and Cohere models. It was reviewed in The New York Times and Slate as a serious if uneven experiment in human and machine collaboration.
In 2025 and 2026 the most visible AI-assisted fiction has come from professional novelists using tools such as Sudowrite and NovelCrafter to draft scenes and chapters. The blind tests Sudowrite reported around the Muse 1.5 launch in June 2025 showed that, on fiction prose alone, a domain-tuned model can be preferred over a general frontier model. None of this work is fully autonomous, and even the AI-judged EQ-Bench Longform pipeline depends on a structured planning rubric that human researchers have spent months refining.
Because it is the most cited benchmark using the literal phrase longform creative writing, the EQ-Bench Longform pipeline deserves a section of its own. The numbers and parameters below come from the official site and from the longform-writing-bench README.
The pipeline consists of 13 generation steps. The first five build the foundation. Step one is a brainstorming step in which the model proposes story concepts. Step two is a critique step in which the model evaluates and refines them. Step three locks the concept and produces character profiles. Step four produces a chapter-by-chapter outline. Step five is a reflection pass that allows the model to revise both characters and outline before committing.
Steps six through 13 produce the eight chapters in order. Each chapter generation receives the concept, the character profiles, the chapter outline and the full text of all previous chapters as context. The default target length is 1,000 words per chapter, and chapters that fall significantly outside that target are penalised through the length and structural metrics.
After generation, the judge scores each chapter individually against the 14-dimension rubric, and then provides a holistic rating of the full book. Chapter scores are weighted equally and a separate weight is applied to the holistic rating. Final scores are reported with 95 percent confidence intervals from 500 bootstrap resamples. The judge is Claude Sonnet 4.6 by default in 2026.
The rubric covers compelling plot, narrative coherence, character consistency, chapter plan adherence, emotional engagement, nuanced characterisation, tonal consistency, prose quality, dialogue naturalness, originality, scene craft, structural integrity, pacing and avoidance of common AI failure modes. The v3 update in 2025 added weighted penalties for forced poetry or incoherent metaphor at five times the normal weight at the 1.7 scale, and a structural penalty for excessive single-sentence paragraphs. Both changes target failure modes that judges had been under-penalising.
The EQ-Bench team documents a set of failure modes that appear across model families.
| Failure mode | Description | Approximate frequency in tested models |
|---|---|---|
| Weak dialogue | Unnatural or stilted conversations | High, around 60 percent |
| Tell don't show | Excessive exposition over demonstration | High, around 70 percent |
| Purple prose | Overly ornate language | Medium, around 40 percent |
| Predictability | Formulaic plot development | High, around 65 percent |
| Metaphor abuse | Forced or incoherent metaphors | Medium, around 45 percent |
| Character drift | Inconsistent characterisation | Medium, around 50 percent |
The per-chapter score sparkline reveals recurring degradation patterns. The most common is a quality cliff after chapter three or four, often visible as a five to ten point drop in successive ratings. Other archetypes are gradual decay across all eight chapters, oscillation between adjacent chapters, final chapter collapse where the model rushes the ending, and middle sag in chapters four to six where the model loses confidence in its plan.
The canonical command from the repository README is
python3 longform_writing_bench.py \
--test-model "google/gemini-2.0-flash-001" \
--judge-model "anthropic/claude-sonnet-4" \
--runs-file "results/longform_bench_runs.json" \
--run-id "demo" --threads 12 --iterations 1
Key flags include --skip-generation to re-judge existing outputs, --redo-judging to apply an updated rubric and --iterations to control how many independent runs feed the confidence intervals. All file writes are atomic and locked to support parallel execution and crash recovery. A full single-iteration run on a frontier model takes roughly fifteen to thirty minutes wall clock plus API time, and a typical evaluation costs about ten US dollars at 2026 Sonnet judging rates.
Longform creative writing as a measurement problem is far from solved. Several open problems are actively debated in 2026.
First, there is no consensus human baseline. Professional novelists writing eight 1,000-word chapters under the same constraints would establish an upper bound, but the cost and time involved have so far prevented anyone from gathering such data at scale. Without a human anchor, top model scores are unbounded and hard to interpret.
Second, the dominant evaluation method, LLM-as-judge, is itself an LLM with taste. The Judgemark benchmark partially addresses this by quantifying judge separability and stability, and the choice of Claude Sonnet 4.6 as default reflects its leading Judgemark score. But the fundamental tension between using one model to grade another remains, and is most acute when the judge and the candidate share a family.
Third, almost every benchmark is English only. ConStory-Bench, WritingBench and the EQ-Bench family include some multilingual coverage but the leaderboards are dominated by English stories. Models that are strong on Chinese, Japanese or Spanish fiction are systematically under-represented.
Fourth, leaderboards and judging dimensions are oriented toward a particular literary aesthetic. The current rubrics reward emotional engagement, character nuance and tonal consistency, all of which favour conventional realist fiction. Experimental, satirical or genre-specific writing styles can score lower simply because the judge is not tuned to them.
Fifth, there is the question of saturation. EQ-Bench Creative Writing v2 saturated within about two years and required redesign. The Longform variant adds enough degrees of freedom through chapter weighting, slop scoring and degradation tracking to delay this, but the same forces apply, and the eventual v4 redesign is already discussed by Sam Paech publicly.
Finally, the relationship between longform creative writing and agentic workflows is increasingly important. Multi-agent storytelling frameworks such as Agents' Room and StoryWriter mean that the practical answer to the question of which model writes best is increasingly the answer to which combination of model, prompt and orchestration writes best. Future benchmarks may need to evaluate fiction pipelines rather than fiction models.