WritingBench is a comprehensive benchmark for evaluating the generative writing capabilities of large language models (LLMs) across diverse real-world writing tasks. Developed by researchers at Alibaba Group (X-PLUG), Renmin University of China, and Shanghai Jiao Tong University, WritingBench addresses a longstanding gap in LLM evaluation: while most benchmarks focus on reasoning, coding, or factual knowledge, few systematically test writing quality across professional domains. The benchmark comprises 1,239 writing queries spanning six primary domains and 100 fine-grained subdomains, paired with a query-dependent evaluation framework that dynamically generates scoring criteria for each individual prompt rather than relying on fixed rubrics.
The paper, authored by Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, and Fei Huang, was first published on arXiv in March 2025 (arXiv:2503.05244). It was subsequently accepted as a poster at the NeurIPS 2025 Datasets and Benchmarks Track, presented in San Diego on December 3, 2025. The benchmark, evaluation tools, trained critic model, and all associated code are released as open source under the Apache 2.0 license.
Evaluating writing quality in LLMs presents distinct challenges compared to evaluating factual accuracy or code correctness. Writing is inherently subjective, context-dependent, and multidimensional. A legal brief requires different qualities than a marketing slogan, and a research abstract demands different skills than a screenplay. Prior benchmarks for writing evaluation suffered from several limitations that WritingBench was designed to address.
Most existing writing benchmarks covered a narrow range of tasks or applied uniform evaluation criteria regardless of the writing domain. The authors identified three key shortcomings in previous work:
| Benchmark | Queries | Domains | Subdomains | Avg. Input Tokens | Max Input Tokens |
|---|---|---|---|---|---|
| EQ-Bench | 241 | 1 | N/A | 130 | 213 |
| LongBench-Write | 120 | 7 | N/A | 87 | 684 |
| HelloBench | 647 | 5 | 38 | 1,210 | 7,766 |
| WritingBench | 1,239 | 6 | 100 | 1,546 | 19,361 |
EQ-Bench, for example, evaluated only creative fiction writing with short prompts averaging 130 tokens. LongBench-Write focused on length compliance with just 120 queries and minimal contextual input. HelloBench offered broader coverage with 647 queries across five domains and 38 subdomains, but still fell short of representing the full spectrum of professional writing scenarios.
WritingBench addressed these gaps by providing substantially more queries (1,239), deeper domain coverage (100 subdomains), and much longer average input contexts (1,546 tokens, with some reaching 19,361 tokens). Critically, it also introduced a fundamentally different approach to evaluation: generating scoring criteria dynamically for each individual query rather than applying static rubrics.
Previous LLM-as-judge approaches typically used one of two strategies: a single set of global criteria applied to all writing tasks, or domain-specific criteria (one set per writing domain). Both approaches had serious alignment problems with human judgment. A global rubric cannot capture the specific requirements of a legal contract versus a poem, while domain-level rubrics still miss the fine-grained differences between, say, a patent application and a technical report within the same Academic and Engineering domain.
WritingBench's central insight is that evaluation criteria should be generated on a per-query basis. Each writing prompt has unique requirements regarding content, style, format, and length, and the evaluation criteria should reflect those specific requirements.
WritingBench organizes its 1,239 queries into six primary writing domains, each subdivided into fine-grained subdomains. The distribution of queries across domains is intentionally uneven, reflecting the varying breadth and complexity of different writing fields.
| Domain | Queries | Avg. Input Tokens | Example Subdomains |
|---|---|---|---|
| Academic and Engineering | 187 | 1,915 | Paper Outline, Abstract, Literature Review, Technical Documentation, Patent, Introduction, Conclusion, Test Report, Defense Presentation, Research Proposal |
| Finance and Business | 238 | 1,762 | Market Analysis, Investment Analysis, Contract, Tender Document, Financial Reports, Business Correspondence, Meeting Minutes, Risk Management, Strategic Planning, Pitch Deck |
| Politics and Law | 226 | 2,274 | Legal Opinion, Case Study, White Paper, Policy Advocacy, Judgment Document, Legal Agreement, Government Speech, Regulatory Analysis |
| Literature and Arts | 242 | 1,133 | Novel Outline, Poetry, Screenplay, Book Review, Character Design, Plot Development, Lyric Writing, Fan Fiction |
| Education | 151 | 1,173 | Lesson Plan, Curriculum Design, Assignment Grading, Class Activity, Coursework, Teaching Materials, Evaluation Comments |
| Advertising and Marketing | 195 | 886 | Social Media Content, Product Description, Brand Story, Sales Letter, Promotional Copy, Slogans, Travel Guide |
Several design choices are worth noting. Politics and Law has the highest average input token count (2,274), reflecting the extensive reference materials and legal context these queries provide. Literature and Arts has the most queries (242) but relatively shorter inputs (1,133 tokens), since creative writing prompts tend to be more open-ended. Advertising and Marketing queries are the shortest on average (886 tokens), as marketing briefs tend to be concise.
The benchmark's queries were constructed through a two-phase process combining LLM-generated initial drafts with systematic human refinement.
Phase 1: Model-Augmented Generation. LLMs generated initial query drafts from domain-specific seed pools. These drafts were then systematically diversified along multiple dimensions: style adjustments, format specifications, length constraints, personalization options, content specificity requirements, and expression optimization. This process ensured broad coverage across the 100 subdomains.
Phase 2: Human-in-the-Loop Refinement. Thirty trained annotators (compensated at $18/hour) collected open-source reference materials and refined the LLM-generated queries. Five experts with LLM experience then performed query adaptation and material pruning, ensuring each query was realistic, well-specified, and representative of actual professional writing tasks. This human curation step was essential for avoiding the circular problem of LLMs evaluating LLM-generated test prompts that might inadvertently favor certain model architectures.
Each query in WritingBench is annotated along three requirement dimensions:
The distribution of length requirements across the benchmark skews toward shorter outputs, though a substantial portion demands extended generation:
| Length Requirement | Number of Queries |
|---|---|
| Under 1,000 tokens | 727 |
| 1,000 to 3,000 tokens | 341 |
| 3,000 to 5,000 tokens | 94 |
| Over 5,000 tokens | 77 |
These requirement dimensions serve a dual purpose: they make queries more realistic (professional writing tasks almost always come with constraints) and they enable more granular evaluation.
WritingBench includes queries in both Chinese and English, reflecting the bilingual research context of the authoring institutions. Model performance is reported separately for Chinese (ZH) and English (EN) subsets, revealing interesting patterns about how models handle writing tasks in different languages.
The evaluation framework is WritingBench's most significant methodological contribution. Rather than scoring all writing samples against the same criteria, the framework generates five unique evaluation criteria for each individual query, then uses those criteria to score responses.
For each query, an LLM generates five evaluation criteria tailored to the specific writing task. Each criterion includes:
The criteria generation is guided by structured prompts that consider the query's domain, subdomain, style requirements, format requirements, length requirements, and any reference materials provided. This means a query asking for a legal contract will receive criteria such as "Clause Completeness" and "Legal Terminology Precision," while a query for a brand story might receive criteria like "Narrative Engagement" and "Brand Voice Consistency."
Once criteria are generated, an evaluator scores each model response on a 1-10 scale for all five criteria. The evaluator also provides written justifications referencing specific passages in the response. The final score for a given query-response pair is the average across all five criteria scores.
WritingBench supports two evaluator backends:
The dynamic, query-dependent approach achieved substantially better alignment with human judgments than static alternatives. The authors compared three evaluation strategies across multiple judges:
| Evaluation Method | ChatGPT-4o Agreement | Claude 3.5 Sonnet Agreement |
|---|---|---|
| Static Global Criteria | 69% | 65% |
| Static Domain-Specific Criteria | 40% | 59% |
| Dynamic Query-Dependent Criteria | 79% | 87% |
The results are striking. Dynamic criteria improved Claude 3.5 Sonnet's human agreement from 65% (static global) to 87%, a 22 percentage point gain. For ChatGPT-4o, agreement improved from 69% to 79%. The static domain-specific approach actually performed worse than global criteria in some cases (40% for ChatGPT-4o), suggesting that intermediate-level rubrics can be counterproductive if they do not match the specific nuances of individual queries.
To reduce the cost and latency of using large proprietary LLMs as judges, WritingBench introduces a purpose-built critic model: a fine-tuned version of Qwen-2.5-7B-Instruct trained specifically for criteria-aware writing evaluation.
The critic model was trained on 50,000 supervised fine-tuning instances, each consisting of a writing query, a set of five evaluation criteria, a model response, and corresponding scores with justifications. These training samples were drawn from diverse queries and model outputs to ensure the critic model generalized across writing domains.
| Training Parameter | Value |
|---|---|
| Base Model | Qwen-2.5-7B-Instruct |
| Training Instances | 50,000 |
| Optimizer | AdamW |
| Learning Rate | 7e-6 |
| Epochs | 3 |
| Hardware | 8x A100 GPUs |
| Batch Size | 64 (with 8-step gradient accumulation) |
| Input Length Cap | 2,048 tokens |
The critic model achieved 83% agreement with human evaluators, placing it between ChatGPT-4o (79%) and Claude 3.5 Sonnet (87%) in evaluation quality. Given that it runs on a single GPU as a 7B-parameter model, this represents a significant practical advantage over calling proprietary API endpoints for every evaluation. The model produces both numerical scores and textual justifications, providing explainability for its assessments.
The original paper evaluated 16 models on WritingBench using the critic model as the evaluator. Scores are on a 1-10 scale, averaged across all five criteria per query.
| Model | Overall Avg. | Chinese | English | Academic and Eng. (D1) | Finance and Bus. (D2) | Politics and Law (D3) | Literature and Arts (D4) | Education (D5) | Advertising and Mktg. (D6) |
|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-R1 | 8.55 | 8.7 | 8.5 | 8.5 | 8.5 | 8.6 | 8.6 | 8.7 | 8.6 |
| Qwen-2.5-7B-filtered | 8.49 | 8.6 | 8.4 | 8.4 | 8.4 | 8.6 | 8.4 | 8.6 | 8.5 |
| Llama-3.1-8B-filtered | 8.49 | 8.6 | 8.4 | 8.5 | 8.4 | 8.6 | 8.4 | 8.6 | 8.5 |
| Qwen-Max | 8.37 | 8.4 | 8.3 | 8.3 | 8.3 | 8.4 | 8.4 | 8.5 | 8.4 |
| ChatGPT-4o-latest | 8.16 | 8.3 | 8.1 | 8.1 | 8.1 | 8.2 | 8.1 | 8.4 | 8.1 |
| o1-Preview | 8.15 | 8.1 | 8.2 | 8.0 | 8.1 | 8.2 | 8.2 | 8.4 | 8.1 |
| DeepSeek-V3 | 7.95 | 8.0 | 7.9 | 7.9 | 7.8 | 8.0 | 7.8 | 8.2 | 8.0 |
| LongWriter | 7.91 | 7.9 | 7.9 | 8.0 | 8.1 | 8.1 | 7.7 | 8.1 | 7.6 |
| Qwen-2.5-72B-Instruct | 7.90 | 8.0 | 7.9 | 8.0 | 7.8 | 8.1 | 7.7 | 8.2 | 7.8 |
| Gemini-1.5-Pro | 7.78 | 7.8 | 7.7 | 7.7 | 7.5 | 7.8 | 7.9 | 8.0 | 7.9 |
| Claude-3.5-Sonnet | 7.71 | 7.7 | 7.7 | 7.6 | 7.5 | 7.6 | 7.7 | 7.9 | 8.0 |
| Mistral-Large-Instruct | 7.64 | 7.6 | 7.7 | 7.7 | 7.6 | 7.8 | 7.3 | 7.9 | 7.6 |
| Qwen-2.5-7B-Instruct | 7.43 | 7.3 | 7.5 | 7.7 | 7.4 | 7.6 | 6.9 | 7.8 | 7.3 |
| Llama-3.3-70B-Instruct | 7.01 | 6.7 | 7.3 | 7.0 | 6.9 | 7.0 | 6.8 | 7.3 | 7.3 |
| Llama-3.1-8B-Instruct | 6.35 | 5.7 | 6.9 | 6.6 | 6.4 | 6.1 | 6.0 | 6.7 | 6.6 |
| Suri | 4.97 | 4.4 | 5.5 | 5.6 | 5.3 | 5.0 | 4.1 | 5.0 | 5.1 |
DeepSeek-R1 led the field. With an overall average of 8.55, DeepSeek-R1 achieved the highest scores among all models tested. Its performance was remarkably consistent across all six domains, never dropping below 8.5 in any category.
Chain-of-thought reasoning helped. Models with chain-of-thought (CoT) capabilities, specifically DeepSeek-R1 and o1-Preview, outperformed their non-CoT counterparts. This finding suggests that planning and reasoning before generating text improves writing quality, particularly for structurally complex tasks.
Education was the easiest domain. Across nearly all models, Education (D5) yielded the highest scores. This likely reflects the relatively standardized nature of educational writing tasks (lesson plans, grading rubrics, teaching materials) compared to more open-ended domains.
Literature and Arts was the hardest domain. D4 consistently produced the lowest scores with the highest variance. Creative writing requires originality, voice, and aesthetic judgment that current models struggle to demonstrate reliably.
Smaller models lagged significantly. The gap between 7B/8B base models and frontier models was substantial. Llama-3.1-8B-Instruct scored only 6.35 overall, nearly 2.2 points behind DeepSeek-R1. However, as the data curation experiments showed, this gap could be largely closed through careful training data selection.
Specialized writing models underperformed. Suri, a model specifically fine-tuned for writing, scored the lowest at 4.97. This counterintuitive result suggests that narrow writing specialization without broad language understanding produces worse outcomes than general-purpose instruction tuning.
Language performance varied. Several models, particularly Llama-3.3-70B-Instruct (6.7 ZH vs. 7.3 EN) and Llama-3.1-8B-Instruct (5.7 ZH vs. 6.9 EN), performed noticeably worse on Chinese queries. This gap reflects the English-centric training data of Llama models.
Performance varied across the three requirement dimensions (style, format, length). Notably, the top models achieved near-perfect scores on length compliance:
| Model | Style (R1) | Format (R2) | Length (R3) |
|---|---|---|---|
| DeepSeek-R1 | 8.7 | 8.9 | 9.0 |
| Qwen-Max | 8.5 | 8.7 | 9.0 |
| Qwen-2.5-7B-filtered | 8.6 | 8.8 | 9.0 |
| Llama-3.1-8B-filtered | 8.6 | 8.8 | 8.9 |
Length requirements (R3) were generally the best-satisfied dimension, while style requirements (R1) proved the most challenging. This makes intuitive sense: following a word count instruction is more mechanical than capturing a specific tone or voice.
One of WritingBench's most practically significant contributions is demonstrating how the evaluation framework can be used for training data curation, not just model assessment.
The researchers started with 24,000 supervised fine-tuning (SFT) samples for writing tasks. They applied WritingBench's criteria generation pipeline to score every sample, then used the critic model to filter out the bottom 50%, retaining only 12,000 high-quality samples.
The results were remarkable. Both Qwen-2.5-7B and Llama-3.1-8B, when fine-tuned on the filtered 12,000 samples, achieved 8.49 on WritingBench, approaching DeepSeek-R1's 8.55 score despite being dramatically smaller models. The improvement was validated on an independent benchmark as well:
| Model | WritingBench | LongBench-Write |
|---|---|---|
| DeepSeek-R1 | 8.55 | 4.79 |
| Qwen-2.5-7B (baseline) | 7.43 | 4.39 |
| Qwen-2.5-7B (all 24K data) | 8.46 | 4.69 |
| Qwen-2.5-7B (filtered 12K) | 8.49 | 4.70 |
| Llama-3.1-8B (baseline) | 6.35 | 3.12 |
| Llama-3.1-8B (all 24K data) | 8.45 | 4.65 |
| Llama-3.1-8B (filtered 12K) | 8.49 | 4.65 |
Two findings stand out. First, the filtered 12K dataset consistently outperformed the full 24K dataset, confirming that data quality matters more than quantity for writing tasks. Second, a 7B model trained on carefully selected data can match or exceed the writing performance of GPT-4o (8.16), demonstrating that the barrier to high-quality writing generation is not necessarily model scale but training data quality.
The authors also explored the impact of chain-of-thought reasoning on writing quality through ablation experiments:
| Model Variant | WritingBench (D4) | EQ-Bench |
|---|---|---|
| DeepSeek-R1 | 8.55 | 84.99 |
| Qwen-2.5-32B (baseline) | 7.34 | 48.17 |
| Qwen-2.5-32B with CoT | 8.66 | 82.48 |
| Qwen-2.5-32B without CoT | 8.49 | 79.43 |
Chain-of-thought training improved Qwen-2.5-32B's EQ-Bench score by over 3 points (79.43 to 82.48), with a meaningful improvement on WritingBench's Literature and Arts domain as well. This supports the hypothesis that explicit reasoning steps help models plan and structure creative content more effectively.
WritingBench is fully open source, with all components available on GitHub (X-PLUG/WritingBench). The evaluation pipeline follows a three-step workflow.
Models generate responses to WritingBench queries using standardized generation parameters:
| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Top-p | 0.8 |
| Top-k | 20 |
| Max Output Length | 16,000 tokens |
These settings balance creativity with consistency, ensuring reproducible evaluations across different models.
The generated responses are scored using either the LLM-as-judge approach or the critic model. The scoring parameters differ from generation:
| Parameter | Value |
|---|---|
| Temperature | 1.0 |
| Top-p | 0.95 |
| Max Length | 2,048 tokens |
Each response receives five scores (one per criterion) along with textual justifications.
Scores are aggregated hierarchically: per-criterion, per-query (average of five criteria), per-subdomain, per-domain, and overall. Results can be exported as Excel files for detailed analysis.
The initial release (March 2025) contained the full 1,239 queries. A streamlined 1,000-query version was released in April 2025 alongside the public leaderboard. Both versions cover all six domains and 100 subdomains, with the reduced set removing redundant or lower-quality queries.
WritingBench maintains public leaderboards on both Hugging Face and ModelScope. Scores on the leaderboard are scaled from 10 to 100 (multiplied by 10) for easier comparison. As of early 2026, the leaderboard has been updated to use Claude Sonnet 4.5 as the default evaluator, replacing the earlier Claude 3.5 Sonnet.
Notable leaderboard scores (as of early 2026) include:
| Model | Score |
|---|---|
| Qwen3-235B-A22B-Thinking | 88.3 |
| Qwen3-Next-80B-A3B-Instruct | 87.3 |
| Qwen3-VL-235B-A22B-Thinking | 86.7 |
| Qwen3-VL-32B-Thinking | 86.2 |
| Qwen3-VL-8B-Thinking | 85.5 |
These scores reflect substantial improvements over the models evaluated in the original paper, likely due to both model architecture advances and improved training data in the intervening months.
WritingBench occupies a distinct position among writing evaluation benchmarks for LLMs. Several other benchmarks target different aspects of writing ability.
EQ-Bench focuses specifically on creative fiction writing and emotional intelligence in generated text. It uses 241 queries in a single domain with short prompts, making it complementary but narrow compared to WritingBench's professional breadth.
LongBench-Write evaluates models' ability to follow length instructions across 120 queries in seven domains. Its emphasis is on length compliance rather than holistic writing quality.
HelloBench is the closest predecessor to WritingBench, with 647 queries across five domains and 38 subdomains. WritingBench extends this approach with nearly double the queries, roughly triple the subdomains, and the addition of dynamic criteria generation.
LitBench (introduced in mid-2025) targets literary and creative writing evaluation with a focus on arena-style pairwise comparisons rather than rubric-based scoring. It takes a different methodological approach but addresses similar concerns about evaluating creative output.
The distinguishing feature of WritingBench among all of these is its query-dependent evaluation criteria. Other benchmarks either use fixed rubrics, simple pairwise preferences, or human evaluation (which does not scale). WritingBench's automated, per-query criteria generation provides both scalability and task-specificity.
While WritingBench represents a significant advance in writing evaluation, several limitations should be noted.
Creative writing remains difficult to evaluate. Even with dynamic criteria, the benchmark's evaluation framework struggles with highly subjective aspects of creative writing. Poetry, fiction, and experimental prose involve aesthetic qualities that resist quantification, and the criteria generation process may not capture dimensions like originality, voice, or emotional depth with the same precision as it captures structural and factual requirements.
Evaluator dependence. The benchmark's results depend heavily on the quality of the evaluator (whether LLM judge or critic model). As the human alignment experiments showed, different judges produce different agreement rates, and even the best configuration (Claude 3.5 Sonnet with dynamic criteria) disagreed with human annotators 13% of the time. Shifting from one evaluator version to another (as happened when the leaderboard moved from Claude 3.5 Sonnet to Claude Sonnet 4.5) can change relative model rankings.
Bilingual but not multilingual. WritingBench covers Chinese and English but does not extend to other languages. Professional writing conventions, rhetorical traditions, and quality expectations vary significantly across languages and cultures, limiting the benchmark's generalizability beyond its two supported languages.
Potential creator bias. The benchmark was developed primarily at Alibaba Group, whose Qwen model family performs well on the benchmark. While the open-source release and NeurIPS peer review provide transparency, users should be aware of this potential conflict of interest when interpreting results.
Critic model limitations. The critic model caps input at 2,048 tokens for scoring stability, which means very long model responses may be truncated during evaluation. This could disadvantage models that produce thorough, detailed outputs for complex queries.
Since its release, WritingBench has seen adoption in several contexts. The UK AI Safety Institute (now DSIT) integrated WritingBench into its Inspect Evals framework for systematic AI evaluation. The benchmark has been used by multiple model developers to evaluate and improve their models' writing capabilities.
The data curation methodology has arguably had as much impact as the benchmark itself. By demonstrating that a 7B model can approach frontier-level writing quality through careful training data selection, the paper provided a practical recipe for improving writing performance without scaling model size.
WritingBench also contributed to the broader conversation about LLM evaluation methodology. The dynamic criteria generation approach has influenced subsequent work on adaptive evaluation frameworks, where scoring rubrics are tailored to the specific task at hand rather than applied uniformly.