WritingBench

AI Benchmarks Large Language Models Natural Language Processing

20 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v4 · 3,952 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WritingBench is a comprehensive benchmark for evaluating the generative writing capabilities of large language models (LLMs) across diverse real-world writing tasks.^[1] Developed by researchers at Alibaba Group (X-PLUG), Renmin University of China, and Shanghai Jiao Tong University, WritingBench addresses a longstanding gap in LLM evaluation: while most benchmarks focus on reasoning, coding, or factual knowledge, few systematically test writing quality across professional domains.^[1] The benchmark comprises 1,239 writing queries spanning six primary domains and 100 fine-grained subdomains, paired with a query-dependent evaluation framework that dynamically generates scoring criteria for each individual prompt rather than relying on fixed rubrics.^[1]

The paper, authored by Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, and Fei Huang, was first published on arXiv in March 2025 (arXiv:2503.05244).^[1]^[4] It was subsequently accepted as a poster at the NeurIPS 2025 Datasets and Benchmarks Track, presented in San Diego on December 3, 2025.^[3] The benchmark, evaluation tools, trained critic model, and all associated code are released as open source under the Apache 2.0 license.^[2]

Motivation

Evaluating writing quality in LLMs presents distinct challenges compared to evaluating factual accuracy or code correctness. Writing is inherently subjective, context-dependent, and multidimensional. A legal brief requires different qualities than a marketing slogan, and a research abstract demands different skills than a screenplay. Prior benchmarks for writing evaluation suffered from several limitations that WritingBench was designed to address.^[1]

Limitations of Previous Benchmarks

Most existing writing benchmarks covered a narrow range of tasks or applied uniform evaluation criteria regardless of the writing domain. The authors identified three key shortcomings in previous work:

Benchmark	Queries	Domains	Subdomains	Avg. Input Tokens	Max Input Tokens
EQ-Bench	241	1	N/A	130	213
LongBench-Write	120	7	N/A	87	684
HelloBench	647	5	38	1,210	7,766
WritingBench	1,239	6	100	1,546	19,361^[1]

EQ-Bench, for example, evaluated only creative fiction writing with short prompts averaging 130 tokens. LongBench-Write focused on length compliance with just 120 queries and minimal contextual input. HelloBench offered broader coverage with 647 queries across five domains and 38 subdomains, but still fell short of representing the full spectrum of professional writing scenarios.^[1]

WritingBench addressed these gaps by providing substantially more queries (1,239), deeper domain coverage (100 subdomains), and much longer average input contexts (1,546 tokens, with some reaching 19,361 tokens).^[1] Critically, it also introduced a fundamentally different approach to evaluation: generating scoring criteria dynamically for each individual query rather than applying static rubrics.^[1]

The Static vs. Dynamic Evaluation Problem

Previous LLM-as-judge approaches typically used one of two strategies: a single set of global criteria applied to all writing tasks, or domain-specific criteria (one set per writing domain).^[1] Both approaches had serious alignment problems with human judgment. A global rubric cannot capture the specific requirements of a legal contract versus a poem, while domain-level rubrics still miss the fine-grained differences between, say, a patent application and a technical report within the same Academic and Engineering domain.

WritingBench's central insight is that evaluation criteria should be generated on a per-query basis.^[1] Each writing prompt has unique requirements regarding content, style, format, and length, and the evaluation criteria should reflect those specific requirements.

Benchmark Design

Six Primary Domains and 100 Subdomains

WritingBench organizes its 1,239 queries into six primary writing domains, each subdivided into fine-grained subdomains.^[1] The distribution of queries across domains is intentionally uneven, reflecting the varying breadth and complexity of different writing fields.

Domain	Queries	Avg. Input Tokens	Example Subdomains
Academic and Engineering	187	1,915	Paper Outline, Abstract, Literature Review, Technical Documentation, Patent, Introduction, Conclusion, Test Report, Defense Presentation, Research Proposal
Finance and Business	238	1,762	Market Analysis, Investment Analysis, Contract, Tender Document, Financial Reports, Business Correspondence, Meeting Minutes, Risk Management, Strategic Planning, Pitch Deck
Politics and Law	226	2,274	Legal Opinion, Case Study, White Paper, Policy Advocacy, Judgment Document, Legal Agreement, Government Speech, Regulatory Analysis
Literature and Arts	242	1,133	Novel Outline, Poetry, Screenplay, Book Review, Character Design, Plot Development, Lyric Writing, Fan Fiction
Education	151	1,173	Lesson Plan, Curriculum Design, Assignment Grading, Class Activity, Coursework, Teaching Materials, Evaluation Comments
Advertising and Marketing	195	886	Social Media Content, Product Description, Brand Story, Sales Letter, Promotional Copy, Slogans, Travel Guide^[1]

Several design choices are worth noting. Politics and Law has the highest average input token count (2,274), reflecting the extensive reference materials and legal context these queries provide.^[1] Literature and Arts has the most queries (242) but relatively shorter inputs (1,133 tokens), since creative writing prompts tend to be more open-ended.^[1] Advertising and Marketing queries are the shortest on average (886 tokens), as marketing briefs tend to be concise.^[1]

Query Construction Pipeline

The benchmark's queries were constructed through a two-phase process combining LLM-generated initial drafts with systematic human refinement.^[1]

Phase 1: Model-Augmented Generation. LLMs generated initial query drafts from domain-specific seed pools. These drafts were then systematically diversified along multiple dimensions: style adjustments, format specifications, length constraints, personalization options, content specificity requirements, and expression optimization.^[1] This process ensured broad coverage across the 100 subdomains.

Phase 2: Human-in-the-Loop Refinement. Thirty trained annotators (compensated at $18/hour) collected open-source reference materials and refined the LLM-generated queries.^[1] Five experts with LLM experience then performed query adaptation and material pruning, ensuring each query was realistic, well-specified, and representative of actual professional writing tasks.^[1] This human curation step was essential for avoiding the circular problem of LLMs evaluating LLM-generated test prompts that might inadvertently favor certain model architectures.

Requirement Dimensions

Each query in WritingBench is annotated along three requirement dimensions:

Style (R1): Specifies the tone, register, and rhetorical approach expected (formal, persuasive, narrative, technical, etc.)
Format (R2): Defines structural requirements such as section headers, bullet points, tables, paragraph count, or specific document templates
Length (R3): Sets minimum or maximum word/token counts for the expected output^[1]

The distribution of length requirements across the benchmark skews toward shorter outputs, though a substantial portion demands extended generation:

Length Requirement	Number of Queries
Under 1,000 tokens	727
1,000 to 3,000 tokens	341
3,000 to 5,000 tokens	94
Over 5,000 tokens	77^[1]

These requirement dimensions serve a dual purpose: they make queries more realistic (professional writing tasks almost always come with constraints) and they enable more granular evaluation.

Multilingual Coverage

WritingBench includes queries in both Chinese and English, reflecting the bilingual research context of the authoring institutions.^[1] Model performance is reported separately for Chinese (ZH) and English (EN) subsets, revealing interesting patterns about how models handle writing tasks in different languages.^[1]

Query-Dependent Evaluation Framework

The evaluation framework is WritingBench's most significant methodological contribution. Rather than scoring all writing samples against the same criteria, the framework generates five unique evaluation criteria for each individual query, then uses those criteria to score responses.^[1]

Phase 1: Dynamic Criteria Generation

For each query, an LLM generates five evaluation criteria tailored to the specific writing task. Each criterion includes:

A descriptive name (e.g., "Methodological Rigor" for an academic paper, "Emotional Resonance" for a poem)
A detailed description explaining what the criterion measures
A scoring rubric with descriptions for five quality levels on a 10-point scale: 1-2 (poor), 3-4 (below average), 5-6 (adequate), 7-8 (good), and 9-10 (excellent)^[1]

The criteria generation is guided by structured prompts that consider the query's domain, subdomain, style requirements, format requirements, length requirements, and any reference materials provided.^[1] This means a query asking for a legal contract will receive criteria such as "Clause Completeness" and "Legal Terminology Precision," while a query for a brand story might receive criteria like "Narrative Engagement" and "Brand Voice Consistency."^[1]

Phase 2: Criteria-Aware Scoring

Once criteria are generated, an evaluator scores each model response on a 1-10 scale for all five criteria.^[1] The evaluator also provides written justifications referencing specific passages in the response. The final score for a given query-response pair is the average across all five criteria scores.^[1]

WritingBench supports two evaluator backends:

LLM-as-Judge: A powerful LLM (such as Claude 3.5 Sonnet or Claude Sonnet 4.5) reads the criteria, rubrics, and the model's response, then assigns scores with explanations. The benchmark originally used Claude 3.5 Sonnet and later migrated to Claude Sonnet 4.5 as the default LLM judge.^[2]
Critic Model: A fine-tuned 7B-parameter model specifically trained for criteria-aware writing evaluation (described below).^[1]

Human Alignment Results

The dynamic, query-dependent approach achieved substantially better alignment with human judgments than static alternatives.^[1] The authors compared three evaluation strategies across multiple judges:

Evaluation Method	ChatGPT-4o Agreement	Claude 3.5 Sonnet Agreement
Static Global Criteria	69%	65%
Static Domain-Specific Criteria	40%	59%
Dynamic Query-Dependent Criteria	79%	87%^[1]

The results are striking. Dynamic criteria improved Claude 3.5 Sonnet's human agreement from 65% (static global) to 87%, a 22 percentage point gain.^[1] For ChatGPT-4o, agreement improved from 69% to 79%.^[1] The static domain-specific approach actually performed worse than global criteria in some cases (40% for ChatGPT-4o), suggesting that intermediate-level rubrics can be counterproductive if they do not match the specific nuances of individual queries.^[1]

Critic Model

To reduce the cost and latency of using large proprietary LLMs as judges, WritingBench introduces a purpose-built critic model: a fine-tuned version of Qwen-2.5-7B-Instruct trained specifically for criteria-aware writing evaluation.^[1]

Training Details

The critic model was trained on 50,000 supervised fine-tuning instances, each consisting of a writing query, a set of five evaluation criteria, a model response, and corresponding scores with justifications.^[1] These training samples were drawn from diverse queries and model outputs to ensure the critic model generalized across writing domains.

Training Parameter	Value
Base Model	Qwen-2.5-7B-Instruct
Training Instances	50,000
Optimizer	AdamW
Learning Rate	7e-6
Epochs	3
Hardware	8x A100 GPUs
Batch Size	64 (with 8-step gradient accumulation)
Input Length Cap	2,048 tokens^[1]

Performance

The critic model achieved 83% agreement with human evaluators, placing it between ChatGPT-4o (79%) and Claude 3.5 Sonnet (87%) in evaluation quality.^[1] Given that it runs on a single GPU as a 7B-parameter model, this represents a significant practical advantage over calling proprietary API endpoints for every evaluation. The model produces both numerical scores and textual justifications, providing explainability for its assessments.^[1]

Model Evaluation Results

The original paper evaluated 16 models on WritingBench using the critic model as the evaluator.^[1] Scores are on a 1-10 scale, averaged across all five criteria per query.^[1]

Overall Performance

Model	Overall Avg.	Chinese	English	Academic and Eng. (D1)	Finance and Bus. (D2)	Politics and Law (D3)	Literature and Arts (D4)	Education (D5)	Advertising and Mktg. (D6)
DeepSeek-R1	8.55	8.7	8.5	8.5	8.5	8.6	8.6	8.7	8.6^[1]
Qwen-2.5-7B-filtered	8.49	8.6	8.4	8.4	8.4	8.6	8.4	8.6	8.5
Llama-3.1-8B-filtered	8.49	8.6	8.4	8.5	8.4	8.6	8.4	8.6	8.5
Qwen-Max	8.37	8.4	8.3	8.3	8.3	8.4	8.4	8.5	8.4
ChatGPT-4o-latest	8.16	8.3	8.1	8.1	8.1	8.2	8.1	8.4	8.1
o1-Preview	8.15	8.1	8.2	8.0	8.1	8.2	8.2	8.4	8.1
DeepSeek-V3	7.95	8.0	7.9	7.9	7.8	8.0	7.8	8.2	8.0
LongWriter	7.91	7.9	7.9	8.0	8.1	8.1	7.7	8.1	7.6
Qwen-2.5-72B-Instruct	7.90	8.0	7.9	8.0	7.8	8.1	7.7	8.2	7.8
Gemini-1.5-Pro	7.78	7.8	7.7	7.7	7.5	7.8	7.9	8.0	7.9
Claude-3.5-Sonnet	7.71	7.7	7.7	7.6	7.5	7.6	7.7	7.9	8.0
Mistral-Large-Instruct	7.64	7.6	7.7	7.7	7.6	7.8	7.3	7.9	7.6
Qwen-2.5-7B-Instruct	7.43	7.3	7.5	7.7	7.4	7.6	6.9	7.8	7.3
Llama-3.3-70B-Instruct	7.01	6.7	7.3	7.0	6.9	7.0	6.8	7.3	7.3
Llama-3.1-8B-Instruct	6.35	5.7	6.9	6.6	6.4	6.1	6.0	6.7	6.6
Suri	4.97	4.4	5.5	5.6	5.3	5.0	4.1	5.0	5.1

Key Findings from Model Evaluations

DeepSeek-R1 led the field. With an overall average of 8.55, DeepSeek-R1 achieved the highest scores among all models tested.^[1] Its performance was remarkably consistent across all six domains, never dropping below 8.5 in any category.^[1]

Chain-of-thought reasoning helped. Models with chain-of-thought (CoT) capabilities, specifically DeepSeek-R1 and o1-Preview, outperformed their non-CoT counterparts.^[1] This finding suggests that planning and reasoning before generating text improves writing quality, particularly for structurally complex tasks.

Education was the easiest domain. Across nearly all models, Education (D5) yielded the highest scores.^[1] This likely reflects the relatively standardized nature of educational writing tasks (lesson plans, grading rubrics, teaching materials) compared to more open-ended domains.

Literature and Arts was the hardest domain. D4 consistently produced the lowest scores with the highest variance.^[1] Creative writing requires originality, voice, and aesthetic judgment that current models struggle to demonstrate reliably.

Smaller models lagged significantly. The gap between 7B/8B base models and frontier models was substantial. Llama-3.1-8B-Instruct scored only 6.35 overall, nearly 2.2 points behind DeepSeek-R1.^[1] However, as the data curation experiments showed, this gap could be largely closed through careful training data selection.^[1]

Specialized writing models underperformed. Suri, a model specifically fine-tuned for writing, scored the lowest at 4.97.^[1] This counterintuitive result suggests that narrow writing specialization without broad language understanding produces worse outcomes than general-purpose instruction tuning.

Language performance varied. Several models, particularly Llama-3.3-70B-Instruct (6.7 ZH vs. 7.3 EN) and Llama-3.1-8B-Instruct (5.7 ZH vs. 6.9 EN), performed noticeably worse on Chinese queries.^[1] This gap reflects the English-centric training data of Llama models.

Requirement Dimension Analysis

Performance varied across the three requirement dimensions (style, format, length). Notably, the top models achieved near-perfect scores on length compliance:

Model	Style (R1)	Format (R2)	Length (R3)
DeepSeek-R1	8.7	8.9	9.0
Qwen-Max	8.5	8.7	9.0
Qwen-2.5-7B-filtered	8.6	8.8	9.0
Llama-3.1-8B-filtered	8.6	8.8	8.9^[1]

Length requirements (R3) were generally the best-satisfied dimension, while style requirements (R1) proved the most challenging.^[1] This makes intuitive sense: following a word count instruction is more mechanical than capturing a specific tone or voice.

Data Curation Application

One of WritingBench's most practically significant contributions is demonstrating how the evaluation framework can be used for training data curation, not just model assessment.^[1]

Methodology

The researchers started with 24,000 supervised fine-tuning (SFT) samples for writing tasks.^[1] They applied WritingBench's criteria generation pipeline to score every sample, then used the critic model to filter out the bottom 50%, retaining only 12,000 high-quality samples.^[1]

Results

The results were remarkable. Both Qwen-2.5-7B and Llama-3.1-8B, when fine-tuned on the filtered 12,000 samples, achieved 8.49 on WritingBench, approaching DeepSeek-R1's 8.55 score despite being dramatically smaller models.^[1] The improvement was validated on an independent benchmark as well:

Model	WritingBench	LongBench-Write
DeepSeek-R1	8.55	4.79
Qwen-2.5-7B (baseline)	7.43	4.39
Qwen-2.5-7B (all 24K data)	8.46	4.69
Qwen-2.5-7B (filtered 12K)	8.49	4.70
Llama-3.1-8B (baseline)	6.35	3.12
Llama-3.1-8B (all 24K data)	8.45	4.65
Llama-3.1-8B (filtered 12K)	8.49	4.65^[1]

Two findings stand out. First, the filtered 12K dataset consistently outperformed the full 24K dataset, confirming that data quality matters more than quantity for writing tasks.^[1] Second, a 7B model trained on carefully selected data can match or exceed the writing performance of GPT-4o (8.16), demonstrating that the barrier to high-quality writing generation is not necessarily model scale but training data quality.^[1]

Chain-of-Thought Ablation

The authors also explored the impact of chain-of-thought reasoning on writing quality through ablation experiments:

Model Variant	WritingBench (D4)	EQ-Bench
DeepSeek-R1	8.55	84.99
Qwen-2.5-32B (baseline)	7.34	48.17
Qwen-2.5-32B with CoT	8.66	82.48
Qwen-2.5-32B without CoT	8.49	79.43^[1]

Chain-of-thought training improved Qwen-2.5-32B's EQ-Bench score by over 3 points (79.43 to 82.48), with a meaningful improvement on WritingBench's Literature and Arts domain as well.^[1] This supports the hypothesis that explicit reasoning steps help models plan and structure creative content more effectively.

Implementation and Usage

WritingBench is fully open source, with all components available on GitHub (X-PLUG/WritingBench).^[2] The evaluation pipeline follows a three-step workflow.^[2]

Step 1: Response Generation

Models generate responses to WritingBench queries using standardized generation parameters:

Parameter	Value
Temperature	0.7
Top-p	0.8
Top-k	20
Max Output Length	16,000 tokens^[2]

These settings balance creativity with consistency, ensuring reproducible evaluations across different models.

Step 2: Evaluation

The generated responses are scored using either the LLM-as-judge approach or the critic model.^[2] The scoring parameters differ from generation:

Parameter	Value
Temperature	1.0
Top-p	0.95
Max Length	2,048 tokens^[2]

Each response receives five scores (one per criterion) along with textual justifications.^[1]

Step 3: Score Aggregation

Scores are aggregated hierarchically: per-criterion, per-query (average of five criteria), per-subdomain, per-domain, and overall. Results can be exported as Excel files for detailed analysis.^[2]

Benchmark Variants

The initial release (March 2025) contained the full 1,239 queries.^[1] A streamlined 1,000-query version was released in April 2025 alongside the public leaderboard.^[2] Both versions cover all six domains and 100 subdomains, with the reduced set removing redundant or lower-quality queries.^[2]

Leaderboard

WritingBench maintains public leaderboards on both Hugging Face and ModelScope.^[2] Scores on the leaderboard are scaled from 10 to 100 (multiplied by 10) for easier comparison.^[2] As of early 2026, the leaderboard has been updated to use Claude Sonnet 4.5 as the default evaluator, replacing the earlier Claude 3.5 Sonnet.^[2]

Notable leaderboard scores (as of early 2026) include:

Model	Score
Qwen3-235B-A22B-Thinking	88.3
Qwen3-Next-80B-A3B-Instruct	87.3
Qwen3-VL-235B-A22B-Thinking	86.7
Qwen3-VL-32B-Thinking	86.2
Qwen3-VL-8B-Thinking	85.5^[5]

These scores reflect substantial improvements over the models evaluated in the original paper, likely due to both model architecture advances and improved training data in the intervening months.

Comparison with Other Writing Benchmarks

WritingBench occupies a distinct position among writing evaluation benchmarks for LLMs. Several other benchmarks target different aspects of writing ability.

EQ-Bench focuses specifically on creative fiction writing and emotional intelligence in generated text. It uses 241 queries in a single domain with short prompts, making it complementary but narrow compared to WritingBench's professional breadth.^[1]

LongBench-Write evaluates models' ability to follow length instructions across 120 queries in seven domains.^[1] Its emphasis is on length compliance rather than holistic writing quality.

HelloBench is the closest predecessor to WritingBench, with 647 queries across five domains and 38 subdomains.^[1] WritingBench extends this approach with nearly double the queries, roughly triple the subdomains, and the addition of dynamic criteria generation.

LitBench (introduced in mid-2025) targets literary and creative writing evaluation with a focus on arena-style pairwise comparisons rather than rubric-based scoring. It takes a different methodological approach but addresses similar concerns about evaluating creative output.

The distinguishing feature of WritingBench among all of these is its query-dependent evaluation criteria. Other benchmarks either use fixed rubrics, simple pairwise preferences, or human evaluation (which does not scale). WritingBench's automated, per-query criteria generation provides both scalability and task-specificity.

Limitations and Criticisms

While WritingBench represents a significant advance in writing evaluation, several limitations should be noted.

Creative writing remains difficult to evaluate. Even with dynamic criteria, the benchmark's evaluation framework struggles with highly subjective aspects of creative writing. Poetry, fiction, and experimental prose involve aesthetic qualities that resist quantification, and the criteria generation process may not capture dimensions like originality, voice, or emotional depth with the same precision as it captures structural and factual requirements.

Evaluator dependence. The benchmark's results depend heavily on the quality of the evaluator (whether LLM judge or critic model). As the human alignment experiments showed, different judges produce different agreement rates, and even the best configuration (Claude 3.5 Sonnet with dynamic criteria) disagreed with human annotators 13% of the time.^[1] Shifting from one evaluator version to another (as happened when the leaderboard moved from Claude 3.5 Sonnet to Claude Sonnet 4.5) can change relative model rankings.^[2]

Bilingual but not multilingual. WritingBench covers Chinese and English but does not extend to other languages.^[1] Professional writing conventions, rhetorical traditions, and quality expectations vary significantly across languages and cultures, limiting the benchmark's generalizability beyond its two supported languages.

Potential creator bias. The benchmark was developed primarily at Alibaba Group, whose Qwen model family performs well on the benchmark.^[1] While the open-source release and NeurIPS peer review provide transparency,^[2]^[3] users should be aware of this potential conflict of interest when interpreting results.

Critic model limitations. The critic model caps input at 2,048 tokens for scoring stability, which means very long model responses may be truncated during evaluation.^[1] This could disadvantage models that produce thorough, detailed outputs for complex queries.

Impact and Adoption

Since its release, WritingBench has seen adoption in several contexts. The UK AI Safety Institute (now DSIT) integrated WritingBench into its Inspect Evals framework for systematic AI evaluation.^[6] The benchmark has been used by multiple model developers to evaluate and improve their models' writing capabilities.

The data curation methodology has arguably had as much impact as the benchmark itself. By demonstrating that a 7B model can approach frontier-level writing quality through careful training data selection, the paper provided a practical recipe for improving writing performance without scaling model size.^[1]

WritingBench also contributed to the broader conversation about LLM evaluation methodology. The dynamic criteria generation approach has influenced subsequent work on adaptive evaluation frameworks, where scoring rubrics are tailored to the specific task at hand rather than applied uniformly.

References

Wu, Y., Mei, J., Yan, M., Li, C., Lai, S., Ren, Y., Wang, Z., Zhang, J., Wu, M., Jin, Q., & Huang, F. (2025). "WritingBench: A Comprehensive Benchmark for Generative Writing." arXiv:2503.05244. https://arxiv.org/abs/2503.05244 ↩
X-PLUG/WritingBench GitHub Repository. https://github.com/X-PLUG/WritingBench ↩
NeurIPS 2025 Datasets and Benchmarks Track, Poster #121666. "WritingBench: A Comprehensive Benchmark for Generative Writing." https://neurips.cc/virtual/2025/poster/121666 ↩
WritingBench Hugging Face Paper Page. https://huggingface.co/papers/2503.05244 ↩
WritingBench Benchmark Leaderboard (LLM Stats). https://llm-stats.com/benchmarks/writingbench ↩
UK AI Safety Institute, Inspect Evals: WritingBench. https://ukgovernmentbeis.github.io/inspect_evals/evals/writing/writingbench/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Creative Writing v3