WebDev Arena

AI Benchmarks

12 min read

Updated May 10, 2026

Suggest edit History Talk

RawGraph

Last edited

May 10, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 2,480 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WebDev Arena
Overview
Full name	Web Development Arena
Abbreviation	WebDev Arena
Description	A live, community-driven leaderboard that ranks large language models on their ability to build interactive web applications, judged head-to-head by human voters
Release date	December 2024
Latest leaderboard refresh	May 2026
Authors	Aryan Vichare, Anastasios N. Angelopoulos, Wei-Lin Chiang, Kelly Tang, Luca Manolache
Organization	LMArena (formerly LMSYS)
Origin lab	UC Berkeley Sky Computing Lab
Technical Details
Type	Web development, frontend coding, interactive applications
Modality	Code, text, image (for vision-capable models)
Task format	Side-by-side coding battles, paired comparison voting
Stack enforced	React, TypeScript, Tailwind CSS
Sandbox	E2B on AWS Firecracker microVMs
Evaluation metric	Bradley-Terry pairwise model, Arena Score
Vote categories	Model A wins, Model B wins, tie, both bad
Performance
Total votes (May 2026)	288,203 across 77 ranked models
Top model (May 2026)	Claude Opus 4.7 Thinking
Top score (May 2026)	Arena Score 1570
Score range (May 2026)	1091 to 1571
Saturated	No
Resources
Live leaderboard	arena.ai/leaderboard/code
Original site	web.lmarena.ai
Blog post	WebDev Arena: A Live LLM Leaderboard for Web App Development
Public dataset	lmarena-ai/webdev-arena-preference-10k on Hugging Face
Predecessor	Chatbot Arena

WebDev Arena is a live, community-driven leaderboard that ranks large language models on their ability to generate working web applications. Users submit a prompt, two anonymous models build the app side by side inside a sandbox, and the human voter picks the better output. Those votes feed a Bradley-Terry statistical model that produces an Arena Score for each contender^[1]^[2].

The project was launched in December 2024 by the team behind Chatbot Arena, now operating as LMArena (formerly LMSYS). The named authors are Aryan Vichare, Anastasios N. Angelopoulos, Wei-Lin Chiang, Kelly Tang, and Luca Manolache, all from UC Berkeley's Sky Computing Lab^[1]^[3]. The first public blog post and dataset arrived in early 2025 alongside roughly 80,000 community votes; by May 2026 the leaderboard had grown to 288,203 votes across 77 models^[1]^[4]^[5].

Why a separate arena for web development

Most coding benchmarks before WebDev Arena focused on isolated function-level problems. HumanEval, MBPP, and even harder suites like LiveCodeBench score a model on whether a short function passes hidden unit tests. That signal correlates poorly with what people use models for: building UIs, wiring up state, and gluing together packages without breaking the build^[1]. There is no automatic grader for whether a chess board feels nice to play on or whether a Spotify clone is fun to click through, so the arena turns that subjective dimension into the metric^[1]^[2].

How a battle works

A user types a prompt and two anonymous models stream their code in parallel. Each output renders inside its own iframe, so the voter can click through both apps before deciding^[2]^[6].

Step	What happens	Notes
1. Prompt	User types a request such as "Build a chess game"	Freeform, no template
2. Pairing	Two models sampled from the pool	Identities hidden during the vote
3. Generation	Both produce a single-file React component	TypeScript and Tailwind enforced by system prompt
4. Rendering	App boots in an E2B sandbox iframe	Startup roughly 150 ms per app
5. Trial	Voter clicks around in both iframes	Errors visible, broken apps still render
6. Vote	Voter picks Model A, Model B, tie, or both bad	Identities revealed after the vote
7. Update	Result feeds the Bradley-Terry fit	Scores recompute continuously

The four-way vote choice matters. Allowing "tie" and "both bad" gives the system a way to absorb cases where neither output is convincing. According to the WebDev Arena blog post, Model A wins about 36.6% of battles, Model B wins about 37.0%, both bad accounts for roughly 18%, and ties make up about 7.8%^[1]. Almost one in five generations failing the bar is a real signal that web app generation is still hard, even for the strongest models.

The forced React, TypeScript, and Tailwind stack

A recurring observation from external reviewers is that WebDev Arena is not really a generic web development benchmark. It is a React, TypeScript, and Tailwind CSS benchmark because the system prompt tells every model to produce a single React component in TypeScript styled with Tailwind classes^[6]^[7]. The system prompt explicitly forbids arbitrary Tailwind values such as h-[600px], requires a default export, and pushes the model toward components that are interactive and self-contained^[6]. Simon Willison ran a small experiment shortly after launch where his prompt asked for plain HTML, CSS, and JavaScript with no React, and both models still returned React components anyway^[6].

The technical contract for every generation looks roughly like this:

Requirement	Specification
Framework	React with hooks
Language	TypeScript
Styling	Tailwind CSS, no arbitrary numeric values
Module shape	Single file, default export
Imports	Explicit React imports
State	Local state only, no required props
Dependencies	Declared up front, no edits to the dependency manifest

The tight contract is partly a fairness move (if one model is allowed Tailwind shortcuts and another is not, comparisons stop being meaningful) and partly an infrastructure move, since the sandbox and renderer can boot a known stack quickly.

Sandboxing and rendering

Generations execute in E2B sandboxes built on top of AWS Firecracker microVMs. Each app gets an isolated Linux environment with roughly 150 ms cold-start latency^[1]^[2]. For models without native structured output, WebDev Arena runs a two-stage pipeline: the model writes prose plus code, and a smaller model (Epoch AI's writeup mentions GPT-4o-mini in this role) reformats the response into the JSON schema the renderer expects^[2]. The schema covers commentary, dependencies, the React code itself, and configuration needed for the iframe to boot. Apps that throw errors are not hidden; they render with their broken state visible, which is part of why the "both bad" vote category exists.

Scoring with the Bradley-Terry model

WebDev Arena uses the Bradley-Terry model, the same paired comparison framework that underpins Chatbot Arena and several Elo-style ranking systems^[1]^[2]. Each model gets a strength parameter, and the probability that model i beats model j is exp(theta_i) / (exp(theta_i) + exp(theta_j)). Fitting the parameters across all collected votes yields the Arena Score shown on the public leaderboard.

Component	Description
Win probability	`P(i beats j) = exp(theta_i) / (exp(theta_i) + exp(theta_j))`
Strength parameter	One scalar per model, fit via maximum likelihood
Update cadence	Continuous, as new votes arrive
Confidence intervals	95% intervals shown next to each score
Tie handling	Ties contribute partial credit; "both bad" subtracts from both

The 95% confidence intervals matter because lower-volume models can swing across several rank positions until enough battles accumulate.

What people actually ask for

The authors classified prompts into 11 buckets and reported the share of total traffic for the largest ones^[1]:

Category	Share of prompts
Website design (portfolios, landing pages, profile sites)	15.3%
Game development (chess, snake, clickers, puzzles)	12.1%
Clone development (recreating existing apps)	11.6%
Interactive tools, data visualization, and other categories	balance of 61.0%

A handful of specific prompts dominate. The most-requested tasks at the time of the original analysis were a VS Code or Cursor clone (4,189 requests), a WhatsApp clone (3,385 requests), and a chess game (3,154 requests)^[1]. People love asking models to rebuild the apps they already use, which is also where you can spot which model has internalized common UI patterns versus which one is faking it with a colorful div.

Top models on the live leaderboard

The leaderboard at arena.ai/leaderboard/code is live, so any snapshot ages quickly. As of May 2026, after 288,203 votes across 77 ranked models, the top of the board looked like this^[5]:

Rank	Model	Organization	Arena Score	Votes
1	claude-opus-4-7-thinking	Anthropic	1570	3,309
2	claude-opus-4-7	Anthropic	1560	3,333
3	claude-opus-4-6-thinking	Anthropic	1549	6,382
4	claude-opus-4-6	Anthropic	1544	7,325
5	glm-5.1	Z.ai	1531	3,609
6	claude-sonnet-4-6	Anthropic	1524	9,434
7	kimi-k2.6	Moonshot AI	1523	2,651
8	muse-spark	Meta	1509	1,629
9	gpt-5.5-high	OpenAI	1491	2,765
10	claude-opus-4-5-thinking-32k	Anthropic	1490	13,063

Five of the top ten slots belong to Anthropic Claude variants in this snapshot. Claude has been a consistent leader on frontend coding since the December 2024 launch, when Claude 3.5 Sonnet (October 2024 revision) sat at the top of the original leaderboard^[6]. The rest of the board mixes proprietary models from OpenAI and Google DeepMind with open-weights entrants from Z.ai, Moonshot AI, Alibaba (Qwen), and DeepSeek. An open model cracking the top five is a real shift from the early days of the arena, when Qwen2.5-Coder-32B was the only openly licensed model in the top six^[6].

Earlier leaderboard snapshot

For historical context, the December 2024 launch leaderboard looked very different. Claude 3.5 Sonnet (October 2024) led, followed by Gemini-Exp-1206, Gemini 2.0 Flash, GPT-4o (November 2024), Qwen2.5-Coder-32B, and Gemini 1.5 Pro 002^[6]. By February 2025, Claude 3.7 Sonnet had taken first place with an Arena Score around 1311 to 1358 and a 76% average win rate against the field^[1].

Public dataset

LMArena released a public sample of arena battles called lmarena-ai/webdev-arena-preference-10k, hosted on Hugging Face^[4]. The release packages 10,501 deduplicated battles (around 149 MB) in JSON and Parquet formats. Each row contains both conversations, the two model identities, the winner label (model_a, model_b, tie, or unknown), a Unix timestamp, an anonymization flag, and a UUID question id. The sample covers 13 frontier models including Claude 3.5 Sonnet (October 2024), GPT-4o (November 2024), Gemini 2.0 Flash Thinking (exp-1219), Gemini 1.5 Pro 002, Gemini 2.0 Flash (exp), Qwen2.5-Coder-32B Instruct, DeepSeek V3, DeepSeek V2, o1 (December 2024), and o1-mini (September 2024)^[4]. The license permits research use but prohibits redistribution.

Comparison with other coding benchmarks

WebDev Arena does not replace function-level benchmarks; it covers a different axis of capability. The contrast is sharpest when you put it next to the older suites:

Benchmark	What it measures	How it grades	Realism for product work
WebDev Arena	Full React, TypeScript, Tailwind apps from a freeform prompt	Human pairwise votes, Bradley-Terry	High for frontend
HumanEval	Single Python functions	Hidden unit tests	Low
MBPP	Short Python tasks	Hidden unit tests	Low
LiveCodeBench	Competitive-style coding problems with fresh contests	Unit tests, contamination guards	Medium
SWE-bench	Real GitHub issues across full repositories	Patch tests on real test suites	High for backend and library work
BigCodeBench	Library-rich function tasks	Unit tests	Medium

WebDev Arena and SWE-bench are the two evaluations that come closest to graded "professional" coding work, and they cover almost disjoint surface area. SWE-bench is about reading a real codebase, finding the bug, and writing a patch. WebDev Arena is about generating a brand new frontend from scratch, judged by whether it feels good to use.

Limitations and known issues

The single-file React component constraint is unrealistic for production work, where apps span dozens of files.
Backend logic, databases, auth, and deployment are out of scope; this is frontend-only.
The forced React, TypeScript, and Tailwind stack means models that excel at Vue, Svelte, plain HTML, or Astro are not measured here^[6]^[7].
Voter preferences skew toward visible polish. A subtler bug or accessibility regression may not show up in a one-minute trial.
Like every preference leaderboard, WebDev Arena is exposed to gaming and style overfitting. A 2025 arXiv preprint on the broader LMArena ecosystem flagged ways that flashy outputs can earn extra votes regardless of correctness^[8].
A non-trivial 18% of battles end in "both bad," a reminder that even leading models still produce broken apps a meaningful share of the time^[1].

Reception and broader context

LMArena attracts roughly one million monthly unique users across 100+ languages, and major labs including OpenAI, Meta, and DeepSeek monitor the leaderboard closely^[3]. Simon Willison's December 16, 2024 writeup framed the launch as "a new leaderboard from the Chatbot Arena team focused on web development" and pointed out the React-and-Tailwind framing within hours of release^[6]. Independent trackers such as Epoch AI reproduced the live leaderboard in their benchmark hub^[2], and researchers have used the public dataset to study preference modeling and the gap between unit-test scores and human-judged frontend quality^[4]. For a developer choosing a model to pair with Cursor, Aider, or a Claude or ChatGPT coding session, the leaderboard is closer to ground truth for "can it build me a UI" than older static benchmarks.

References

Vichare, Aryan; Angelopoulos, Anastasios N.; Chiang, Wei-Lin; Tang, Kelly; Manolache, Luca. "WebDev Arena: A Live LLM Leaderboard for Web App Development." LMArena blog, 2025. https://arena.ai/blog/webdev-arena/ ↩
Epoch AI. "WebDev Arena." Benchmark hub entry, 2025. https://epoch.ai/benchmarks/webdev-arena ↩
Sanchez, Andrew. "As companies pour billions into AI, a ranking system by UC Berkeley students has all eyes on it." Berkeley News, May 6, 2025. https://news.berkeley.edu/2025/05/06/as-companies-pour-billions-into-ai-a-ranking-system-by-uc-berkeley-students-has-all-eyes-on-it/ ↩
LMArena. "webdev-arena-preference-10k." Hugging Face dataset, 2025. https://huggingface.co/datasets/lmarena-ai/webdev-arena-preference-10k ↩
LMArena. "WebDev AI Leaderboard." Live leaderboard snapshot, May 2026. https://arena.ai/leaderboard/code ↩
Willison, Simon. "WebDev Arena." simonwillison.net, December 16, 2024. https://simonwillison.net/2024/Dec/16/webdev-arena/ ↩
LMArena. "WebDev Arena (about page)." https://web.lmarena.ai/about ↩
"Gaming the Arena: AI Model Evaluation and the Viral Capture of Attention." arXiv preprint 2512.15252, 2025. https://arxiv.org/pdf/2512.15252 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

LLM Rankings

Why a separate arena for web development

How a battle works

The forced React, TypeScript, and Tailwind stack

Sandboxing and rendering

Scoring with the Bradley-Terry model

What people actually ask for

Top models on the live leaderboard

Earlier leaderboard snapshot

Public dataset

Comparison with other coding benchmarks

Limitations and known issues

Reception and broader context

See also

References

Improve this article

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings