WebDev Arena
Last reviewed
May 10, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,480 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,480 words
Add missing citations, update stale details, or suggest a clearer explanation.
| WebDev Arena | |
|---|---|
| Overview | |
| Full name | Web Development Arena |
| Abbreviation | WebDev Arena |
| Description | A live, community-driven leaderboard that ranks large language models on their ability to build interactive web applications, judged head-to-head by human voters |
| Release date | December 2024 |
| Latest leaderboard refresh | May 2026 |
| Authors | Aryan Vichare, Anastasios N. Angelopoulos, Wei-Lin Chiang, Kelly Tang, Luca Manolache |
| Organization | LMArena (formerly LMSYS) |
| Origin lab | UC Berkeley Sky Computing Lab |
| Technical Details | |
| Type | Web development, frontend coding, interactive applications |
| Modality | Code, text, image (for vision-capable models) |
| Task format | Side-by-side coding battles, paired comparison voting |
| Stack enforced | React, TypeScript, Tailwind CSS |
| Sandbox | E2B on AWS Firecracker microVMs |
| Evaluation metric | Bradley-Terry pairwise model, Arena Score |
| Vote categories | Model A wins, Model B wins, tie, both bad |
| Performance | |
| Total votes (May 2026) | 288,203 across 77 ranked models |
| Top model (May 2026) | Claude Opus 4.7 Thinking |
| Top score (May 2026) | Arena Score 1570 |
| Score range (May 2026) | 1091 to 1571 |
| Saturated | No |
| Resources | |
| Live leaderboard | arena.ai/leaderboard/code |
| Original site | web.lmarena.ai |
| Blog post | WebDev Arena: A Live LLM Leaderboard for Web App Development |
| Public dataset | lmarena-ai/webdev-arena-preference-10k on Hugging Face |
| Predecessor | Chatbot Arena |
WebDev Arena is a live, community-driven leaderboard that ranks large language models on their ability to generate working web applications. Users submit a prompt, two anonymous models build the app side by side inside a sandbox, and the human voter picks the better output. Those votes feed a Bradley-Terry statistical model that produces an Arena Score for each contender[1][2].
The project was launched in December 2024 by the team behind Chatbot Arena, now operating as LMArena (formerly LMSYS). The named authors are Aryan Vichare, Anastasios N. Angelopoulos, Wei-Lin Chiang, Kelly Tang, and Luca Manolache, all from UC Berkeley's Sky Computing Lab[1][3]. The first public blog post and dataset arrived in early 2025 alongside roughly 80,000 community votes; by May 2026 the leaderboard had grown to 288,203 votes across 77 models[1][4][5].
Most coding benchmarks before WebDev Arena focused on isolated function-level problems. HumanEval, MBPP, and even harder suites like LiveCodeBench score a model on whether a short function passes hidden unit tests. That signal correlates poorly with what people use models for: building UIs, wiring up state, and gluing together packages without breaking the build[1]. There is no automatic grader for whether a chess board feels nice to play on or whether a Spotify clone is fun to click through, so the arena turns that subjective dimension into the metric[1][2].
A user types a prompt and two anonymous models stream their code in parallel. Each output renders inside its own iframe, so the voter can click through both apps before deciding[2][6].
| Step | What happens | Notes |
|---|---|---|
| 1. Prompt | User types a request such as "Build a chess game" | Freeform, no template |
| 2. Pairing | Two models sampled from the pool | Identities hidden during the vote |
| 3. Generation | Both produce a single-file React component | TypeScript and Tailwind enforced by system prompt |
| 4. Rendering | App boots in an E2B sandbox iframe | Startup roughly 150 ms per app |
| 5. Trial | Voter clicks around in both iframes | Errors visible, broken apps still render |
| 6. Vote | Voter picks Model A, Model B, tie, or both bad | Identities revealed after the vote |
| 7. Update | Result feeds the Bradley-Terry fit | Scores recompute continuously |
The four-way vote choice matters. Allowing "tie" and "both bad" gives the system a way to absorb cases where neither output is convincing. According to the WebDev Arena blog post, Model A wins about 36.6% of battles, Model B wins about 37.0%, both bad accounts for roughly 18%, and ties make up about 7.8%[1]. Almost one in five generations failing the bar is a real signal that web app generation is still hard, even for the strongest models.
A recurring observation from external reviewers is that WebDev Arena is not really a generic web development benchmark. It is a React, TypeScript, and Tailwind CSS benchmark because the system prompt tells every model to produce a single React component in TypeScript styled with Tailwind classes[6][7]. The system prompt explicitly forbids arbitrary Tailwind values such as h-[600px], requires a default export, and pushes the model toward components that are interactive and self-contained[6]. Simon Willison ran a small experiment shortly after launch where his prompt asked for plain HTML, CSS, and JavaScript with no React, and both models still returned React components anyway[6].
The technical contract for every generation looks roughly like this:
| Requirement | Specification |
|---|---|
| Framework | React with hooks |
| Language | TypeScript |
| Styling | Tailwind CSS, no arbitrary numeric values |
| Module shape | Single file, default export |
| Imports | Explicit React imports |
| State | Local state only, no required props |
| Dependencies | Declared up front, no edits to the dependency manifest |
The tight contract is partly a fairness move (if one model is allowed Tailwind shortcuts and another is not, comparisons stop being meaningful) and partly an infrastructure move, since the sandbox and renderer can boot a known stack quickly.
Generations execute in E2B sandboxes built on top of AWS Firecracker microVMs. Each app gets an isolated Linux environment with roughly 150 ms cold-start latency[1][2]. For models without native structured output, WebDev Arena runs a two-stage pipeline: the model writes prose plus code, and a smaller model (Epoch AI's writeup mentions GPT-4o-mini in this role) reformats the response into the JSON schema the renderer expects[2]. The schema covers commentary, dependencies, the React code itself, and configuration needed for the iframe to boot. Apps that throw errors are not hidden; they render with their broken state visible, which is part of why the "both bad" vote category exists.
WebDev Arena uses the Bradley-Terry model, the same paired comparison framework that underpins Chatbot Arena and several Elo-style ranking systems[1][2]. Each model gets a strength parameter, and the probability that model i beats model j is exp(theta_i) / (exp(theta_i) + exp(theta_j)). Fitting the parameters across all collected votes yields the Arena Score shown on the public leaderboard.
| Component | Description |
|---|---|
| Win probability | P(i beats j) = exp(theta_i) / (exp(theta_i) + exp(theta_j)) |
| Strength parameter | One scalar per model, fit via maximum likelihood |
| Update cadence | Continuous, as new votes arrive |
| Confidence intervals | 95% intervals shown next to each score |
| Tie handling | Ties contribute partial credit; "both bad" subtracts from both |
The 95% confidence intervals matter because lower-volume models can swing across several rank positions until enough battles accumulate.
The authors classified prompts into 11 buckets and reported the share of total traffic for the largest ones[1]:
| Category | Share of prompts |
|---|---|
| Website design (portfolios, landing pages, profile sites) | 15.3% |
| Game development (chess, snake, clickers, puzzles) | 12.1% |
| Clone development (recreating existing apps) | 11.6% |
| Interactive tools, data visualization, and other categories | balance of 61.0% |
A handful of specific prompts dominate. The most-requested tasks at the time of the original analysis were a VS Code or Cursor clone (4,189 requests), a WhatsApp clone (3,385 requests), and a chess game (3,154 requests)[1]. People love asking models to rebuild the apps they already use, which is also where you can spot which model has internalized common UI patterns versus which one is faking it with a colorful div.
The leaderboard at arena.ai/leaderboard/code is live, so any snapshot ages quickly. As of May 2026, after 288,203 votes across 77 ranked models, the top of the board looked like this[5]:
| Rank | Model | Organization | Arena Score | Votes |
|---|---|---|---|---|
| 1 | claude-opus-4-7-thinking | Anthropic | 1570 | 3,309 |
| 2 | claude-opus-4-7 | Anthropic | 1560 | 3,333 |
| 3 | claude-opus-4-6-thinking | Anthropic | 1549 | 6,382 |
| 4 | claude-opus-4-6 | Anthropic | 1544 | 7,325 |
| 5 | glm-5.1 | Z.ai | 1531 | 3,609 |
| 6 | claude-sonnet-4-6 | Anthropic | 1524 | 9,434 |
| 7 | kimi-k2.6 | Moonshot AI | 1523 | 2,651 |
| 8 | muse-spark | Meta | 1509 | 1,629 |
| 9 | gpt-5.5-high | OpenAI | 1491 | 2,765 |
| 10 | claude-opus-4-5-thinking-32k | Anthropic | 1490 | 13,063 |
Five of the top ten slots belong to Anthropic Claude variants in this snapshot. Claude has been a consistent leader on frontend coding since the December 2024 launch, when Claude 3.5 Sonnet (October 2024 revision) sat at the top of the original leaderboard[6]. The rest of the board mixes proprietary models from OpenAI and Google DeepMind with open-weights entrants from Z.ai, Moonshot AI, Alibaba (Qwen), and DeepSeek. An open model cracking the top five is a real shift from the early days of the arena, when Qwen2.5-Coder-32B was the only openly licensed model in the top six[6].
For historical context, the December 2024 launch leaderboard looked very different. Claude 3.5 Sonnet (October 2024) led, followed by Gemini-Exp-1206, Gemini 2.0 Flash, GPT-4o (November 2024), Qwen2.5-Coder-32B, and Gemini 1.5 Pro 002[6]. By February 2025, Claude 3.7 Sonnet had taken first place with an Arena Score around 1311 to 1358 and a 76% average win rate against the field[1].
LMArena released a public sample of arena battles called lmarena-ai/webdev-arena-preference-10k, hosted on Hugging Face[4]. The release packages 10,501 deduplicated battles (around 149 MB) in JSON and Parquet formats. Each row contains both conversations, the two model identities, the winner label (model_a, model_b, tie, or unknown), a Unix timestamp, an anonymization flag, and a UUID question id. The sample covers 13 frontier models including Claude 3.5 Sonnet (October 2024), GPT-4o (November 2024), Gemini 2.0 Flash Thinking (exp-1219), Gemini 1.5 Pro 002, Gemini 2.0 Flash (exp), Qwen2.5-Coder-32B Instruct, DeepSeek V3, DeepSeek V2, o1 (December 2024), and o1-mini (September 2024)[4]. The license permits research use but prohibits redistribution.
WebDev Arena does not replace function-level benchmarks; it covers a different axis of capability. The contrast is sharpest when you put it next to the older suites:
| Benchmark | What it measures | How it grades | Realism for product work |
|---|---|---|---|
| WebDev Arena | Full React, TypeScript, Tailwind apps from a freeform prompt | Human pairwise votes, Bradley-Terry | High for frontend |
| HumanEval | Single Python functions | Hidden unit tests | Low |
| MBPP | Short Python tasks | Hidden unit tests | Low |
| LiveCodeBench | Competitive-style coding problems with fresh contests | Unit tests, contamination guards | Medium |
| SWE-bench | Real GitHub issues across full repositories | Patch tests on real test suites | High for backend and library work |
| BigCodeBench | Library-rich function tasks | Unit tests | Medium |
WebDev Arena and SWE-bench are the two evaluations that come closest to graded "professional" coding work, and they cover almost disjoint surface area. SWE-bench is about reading a real codebase, finding the bug, and writing a patch. WebDev Arena is about generating a brand new frontend from scratch, judged by whether it feels good to use.
LMArena attracts roughly one million monthly unique users across 100+ languages, and major labs including OpenAI, Meta, and DeepSeek monitor the leaderboard closely[3]. Simon Willison's December 16, 2024 writeup framed the launch as "a new leaderboard from the Chatbot Arena team focused on web development" and pointed out the React-and-Tailwind framing within hours of release[6]. Independent trackers such as Epoch AI reproduced the live leaderboard in their benchmark hub[2], and researchers have used the public dataset to study preference modeling and the gap between unit-test scores and human-judged frontend quality[4]. For a developer choosing a model to pair with Cursor, Aider, or a Claude or ChatGPT coding session, the leaderboard is closer to ground truth for "can it build me a UI" than older static benchmarks.