# WebDev Arena

> Source: https://aiwiki.ai/wiki/webdev_arena
> Updated: 2026-05-10
> Categories: AI Benchmarks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

| WebDev Arena |
| --- |
| Overview |
| Full name | Web Development Arena |
| Abbreviation | WebDev Arena |
| Description | A live, community-driven leaderboard that ranks large language models on their ability to build interactive web applications, judged head-to-head by human voters |
| Release date | December 2024 |
| Latest leaderboard refresh | May 2026 |
| Authors | Aryan Vichare, Anastasios N. Angelopoulos, Wei-Lin Chiang, Kelly Tang, Luca Manolache |
| Organization | [LMArena](/wiki/lmarena) (formerly [LMSYS](/wiki/lmsys)) |
| Origin lab | UC Berkeley Sky Computing Lab |
| Technical Details |
| Type | [Web development](/wiki/web_development), frontend coding, interactive applications |
| Modality | Code, text, image (for vision-capable models) |
| Task format | Side-by-side coding battles, paired comparison voting |
| Stack enforced | [React](/wiki/react), [TypeScript](/wiki/typescript), [Tailwind CSS](/wiki/tailwind_css) |
| Sandbox | [E2B](/wiki/e2b) on AWS Firecracker microVMs |
| Evaluation metric | [Bradley-Terry](/wiki/bradley_terry_model) pairwise model, Arena Score |
| Vote categories | Model A wins, Model B wins, tie, both bad |
| Performance |
| Total votes (May 2026) | 288,203 across 77 ranked models |
| Top model (May 2026) | Claude Opus 4.7 Thinking |
| Top score (May 2026) | Arena Score 1570 |
| Score range (May 2026) | 1091 to 1571 |
| Saturated | No |
| Resources |
| Live leaderboard | [arena.ai/leaderboard/code](https://arena.ai/leaderboard/code) |
| Original site | [web.lmarena.ai](https://web.lmarena.ai/) |
| Blog post | [WebDev Arena: A Live LLM Leaderboard for Web App Development](https://arena.ai/blog/webdev-arena/) |
| Public dataset | [lmarena-ai/webdev-arena-preference-10k on Hugging Face](https://huggingface.co/datasets/lmarena-ai/webdev-arena-preference-10k) |
| Predecessor | [Chatbot Arena](/wiki/chatbot_arena) |

**WebDev Arena** is a live, community-driven leaderboard that ranks [large language models](/wiki/large_language_model) on their ability to generate working web applications. Users submit a prompt, two anonymous models build the app side by side inside a sandbox, and the human voter picks the better output. Those votes feed a [Bradley-Terry](/wiki/bradley_terry_model) statistical model that produces an Arena Score for each contender[1][2].

The project was launched in December 2024 by the team behind [Chatbot Arena](/wiki/chatbot_arena), now operating as [LMArena](/wiki/lmarena) (formerly [LMSYS](/wiki/lmsys)). The named authors are Aryan Vichare, [Anastasios N. Angelopoulos](/wiki/anastasios_angelopoulos), [Wei-Lin Chiang](/wiki/wei_lin_chiang), Kelly Tang, and Luca Manolache, all from UC Berkeley's Sky Computing Lab[1][3]. The first public blog post and dataset arrived in early 2025 alongside roughly 80,000 community votes; by May 2026 the leaderboard had grown to 288,203 votes across 77 models[1][4][5].

## Why a separate arena for web development

Most coding benchmarks before WebDev Arena focused on isolated function-level problems. [HumanEval](/wiki/humaneval), [MBPP](/wiki/mbpp), and even harder suites like [LiveCodeBench](/wiki/livecodebench) score a model on whether a short function passes hidden unit tests. That signal correlates poorly with what people use models for: building UIs, wiring up state, and gluing together packages without breaking the build[1]. There is no automatic grader for whether a chess board feels nice to play on or whether a Spotify clone is fun to click through, so the arena turns that subjective dimension into the metric[1][2].

## How a battle works

A user types a prompt and two anonymous models stream their code in parallel. Each output renders inside its own iframe, so the voter can click through both apps before deciding[2][6].

| Step | What happens | Notes |
| --- | --- | --- |
| 1. Prompt | User types a request such as "Build a chess game" | Freeform, no template |
| 2. Pairing | Two models sampled from the pool | Identities hidden during the vote |
| 3. Generation | Both produce a single-file React component | TypeScript and Tailwind enforced by system prompt |
| 4. Rendering | App boots in an [E2B](/wiki/e2b) sandbox iframe | Startup roughly 150 ms per app |
| 5. Trial | Voter clicks around in both iframes | Errors visible, broken apps still render |
| 6. Vote | Voter picks Model A, Model B, tie, or both bad | Identities revealed after the vote |
| 7. Update | Result feeds the Bradley-Terry fit | Scores recompute continuously |

The four-way vote choice matters. Allowing "tie" and "both bad" gives the system a way to absorb cases where neither output is convincing. According to the WebDev Arena blog post, Model A wins about 36.6% of battles, Model B wins about 37.0%, both bad accounts for roughly 18%, and ties make up about 7.8%[1]. Almost one in five generations failing the bar is a real signal that web app generation is still hard, even for the strongest models.

## The forced React, TypeScript, and Tailwind stack

A recurring observation from external reviewers is that WebDev Arena is not really a generic web development benchmark. It is a [React](/wiki/react), [TypeScript](/wiki/typescript), and [Tailwind CSS](/wiki/tailwind_css) benchmark because the system prompt tells every model to produce a single React component in TypeScript styled with Tailwind classes[6][7]. The system prompt explicitly forbids arbitrary Tailwind values such as `h-[600px]`, requires a default export, and pushes the model toward components that are interactive and self-contained[6]. Simon Willison ran a small experiment shortly after launch where his prompt asked for plain HTML, CSS, and JavaScript with no React, and both models still returned React components anyway[6].

The technical contract for every generation looks roughly like this:

| Requirement | Specification |
| --- | --- |
| Framework | React with hooks |
| Language | TypeScript |
| Styling | Tailwind CSS, no arbitrary numeric values |
| Module shape | Single file, default export |
| Imports | Explicit React imports |
| State | Local state only, no required props |
| Dependencies | Declared up front, no edits to the dependency manifest |

The tight contract is partly a fairness move (if one model is allowed Tailwind shortcuts and another is not, comparisons stop being meaningful) and partly an infrastructure move, since the sandbox and renderer can boot a known stack quickly.

## Sandboxing and rendering

Generations execute in [E2B](/wiki/e2b) sandboxes built on top of AWS Firecracker microVMs. Each app gets an isolated Linux environment with roughly 150 ms cold-start latency[1][2]. For models without native structured output, WebDev Arena runs a two-stage pipeline: the model writes prose plus code, and a smaller model (Epoch AI's writeup mentions GPT-4o-mini in this role) reformats the response into the JSON schema the renderer expects[2]. The schema covers commentary, dependencies, the React code itself, and configuration needed for the iframe to boot. Apps that throw errors are not hidden; they render with their broken state visible, which is part of why the "both bad" vote category exists.

## Scoring with the Bradley-Terry model

WebDev Arena uses the [Bradley-Terry model](/wiki/bradley_terry_model), the same paired comparison framework that underpins [Chatbot Arena](/wiki/chatbot_arena) and several Elo-style ranking systems[1][2]. Each model gets a strength parameter, and the probability that model i beats model j is `exp(theta_i) / (exp(theta_i) + exp(theta_j))`. Fitting the parameters across all collected votes yields the Arena Score shown on the public leaderboard.

| Component | Description |
| --- | --- |
| Win probability | `P(i beats j) = exp(theta_i) / (exp(theta_i) + exp(theta_j))` |
| Strength parameter | One scalar per model, fit via maximum likelihood |
| Update cadence | Continuous, as new votes arrive |
| Confidence intervals | 95% intervals shown next to each score |
| Tie handling | Ties contribute partial credit; "both bad" subtracts from both |

The 95% confidence intervals matter because lower-volume models can swing across several rank positions until enough battles accumulate.

## What people actually ask for

The authors classified prompts into 11 buckets and reported the share of total traffic for the largest ones[1]:

| Category | Share of prompts |
| --- | --- |
| Website design (portfolios, landing pages, profile sites) | 15.3% |
| Game development (chess, snake, clickers, puzzles) | 12.1% |
| Clone development (recreating existing apps) | 11.6% |
| Interactive tools, data visualization, and other categories | balance of 61.0% |

A handful of specific prompts dominate. The most-requested tasks at the time of the original analysis were a VS Code or Cursor clone (4,189 requests), a WhatsApp clone (3,385 requests), and a chess game (3,154 requests)[1]. People love asking models to rebuild the apps they already use, which is also where you can spot which model has internalized common UI patterns versus which one is faking it with a colorful div.

## Top models on the live leaderboard

The leaderboard at [arena.ai/leaderboard/code](https://arena.ai/leaderboard/code) is live, so any snapshot ages quickly. As of May 2026, after 288,203 votes across 77 ranked models, the top of the board looked like this[5]:

| Rank | Model | Organization | Arena Score | Votes |
| --- | --- | --- | --- | --- |
| 1 | claude-opus-4-7-thinking | [Anthropic](/wiki/anthropic) | 1570 | 3,309 |
| 2 | claude-opus-4-7 | [Anthropic](/wiki/anthropic) | 1560 | 3,333 |
| 3 | claude-opus-4-6-thinking | [Anthropic](/wiki/anthropic) | 1549 | 6,382 |
| 4 | claude-opus-4-6 | [Anthropic](/wiki/anthropic) | 1544 | 7,325 |
| 5 | glm-5.1 | [Z.ai](/wiki/zhipu_ai) | 1531 | 3,609 |
| 6 | claude-sonnet-4-6 | [Anthropic](/wiki/anthropic) | 1524 | 9,434 |
| 7 | kimi-k2.6 | [Moonshot AI](/wiki/moonshot_ai) | 1523 | 2,651 |
| 8 | muse-spark | [Meta](/wiki/meta) | 1509 | 1,629 |
| 9 | gpt-5.5-high | [OpenAI](/wiki/openai) | 1491 | 2,765 |
| 10 | claude-opus-4-5-thinking-32k | [Anthropic](/wiki/anthropic) | 1490 | 13,063 |

Five of the top ten slots belong to [Anthropic](/wiki/anthropic) Claude variants in this snapshot. Claude has been a consistent leader on frontend coding since the December 2024 launch, when [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) (October 2024 revision) sat at the top of the original leaderboard[6]. The rest of the board mixes proprietary models from [OpenAI](/wiki/openai) and [Google DeepMind](/wiki/google_deepmind) with open-weights entrants from [Z.ai](/wiki/zhipu_ai), [Moonshot AI](/wiki/moonshot_ai), [Alibaba](/wiki/alibaba) (Qwen), and [DeepSeek](/wiki/deepseek). An open model cracking the top five is a real shift from the early days of the arena, when [Qwen2.5-Coder-32B](/wiki/qwen2_5_coder) was the only openly licensed model in the top six[6].

### Earlier leaderboard snapshot

For historical context, the December 2024 launch leaderboard looked very different. [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) (October 2024) led, followed by Gemini-Exp-1206, Gemini 2.0 Flash, [GPT-4o](/wiki/gpt-4o) (November 2024), Qwen2.5-Coder-32B, and Gemini 1.5 Pro 002[6]. By February 2025, [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) had taken first place with an Arena Score around 1311 to 1358 and a 76% average win rate against the field[1].

## Public dataset

LMArena released a public sample of arena battles called `lmarena-ai/webdev-arena-preference-10k`, hosted on [Hugging Face](/wiki/hugging_face)[4]. The release packages 10,501 deduplicated battles (around 149 MB) in JSON and Parquet formats. Each row contains both conversations, the two model identities, the winner label (`model_a`, `model_b`, `tie`, or `unknown`), a Unix timestamp, an anonymization flag, and a UUID question id. The sample covers 13 frontier models including [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) (October 2024), [GPT-4o](/wiki/gpt-4o) (November 2024), Gemini 2.0 Flash Thinking (exp-1219), [Gemini 1.5 Pro](/wiki/gemini_1_5_pro) 002, Gemini 2.0 Flash (exp), [Qwen2.5-Coder-32B Instruct](/wiki/qwen2_5_coder), [DeepSeek V3](/wiki/deepseek_v3), DeepSeek V2, [o1](/wiki/o1) (December 2024), and o1-mini (September 2024)[4]. The license permits research use but prohibits redistribution.

## Comparison with other coding benchmarks

WebDev Arena does not replace function-level benchmarks; it covers a different axis of capability. The contrast is sharpest when you put it next to the older suites:

| Benchmark | What it measures | How it grades | Realism for product work |
| --- | --- | --- | --- |
| WebDev Arena | Full React, TypeScript, Tailwind apps from a freeform prompt | Human pairwise votes, Bradley-Terry | High for frontend |
| [HumanEval](/wiki/humaneval) | Single Python functions | Hidden unit tests | Low |
| [MBPP](/wiki/mbpp) | Short Python tasks | Hidden unit tests | Low |
| [LiveCodeBench](/wiki/livecodebench) | Competitive-style coding problems with fresh contests | Unit tests, contamination guards | Medium |
| [SWE-bench](/wiki/swe_bench) | Real GitHub issues across full repositories | Patch tests on real test suites | High for backend and library work |
| [BigCodeBench](/wiki/bigcodebench) | Library-rich function tasks | Unit tests | Medium |

WebDev Arena and [SWE-bench](/wiki/swe_bench) are the two evaluations that come closest to graded "professional" coding work, and they cover almost disjoint surface area. SWE-bench is about reading a real codebase, finding the bug, and writing a patch. WebDev Arena is about generating a brand new frontend from scratch, judged by whether it feels good to use.

## Limitations and known issues

- The single-file React component constraint is unrealistic for production work, where apps span dozens of files.
- Backend logic, databases, auth, and deployment are out of scope; this is frontend-only.
- The forced React, TypeScript, and Tailwind stack means models that excel at Vue, Svelte, plain HTML, or Astro are not measured here[6][7].
- Voter preferences skew toward visible polish. A subtler bug or accessibility regression may not show up in a one-minute trial.
- Like every preference leaderboard, WebDev Arena is exposed to gaming and style overfitting. A 2025 arXiv preprint on the broader LMArena ecosystem flagged ways that flashy outputs can earn extra votes regardless of correctness[8].
- A non-trivial 18% of battles end in "both bad," a reminder that even leading models still produce broken apps a meaningful share of the time[1].

## Reception and broader context

LMArena attracts roughly one million monthly unique users across 100+ languages, and major labs including OpenAI, Meta, and DeepSeek monitor the leaderboard closely[3]. Simon Willison's December 16, 2024 writeup framed the launch as "a new leaderboard from the Chatbot Arena team focused on web development" and pointed out the React-and-Tailwind framing within hours of release[6]. Independent trackers such as [Epoch AI](/wiki/epoch_ai) reproduced the live leaderboard in their benchmark hub[2], and researchers have used the public dataset to study preference modeling and the gap between unit-test scores and human-judged frontend quality[4]. For a developer choosing a model to pair with [Cursor](/wiki/cursor), [Aider](/wiki/aider), or a Claude or [ChatGPT](/wiki/chatgpt) coding session, the leaderboard is closer to ground truth for "can it build me a UI" than older static benchmarks.

## See also

- [Chatbot Arena](/wiki/chatbot_arena)
- [LMArena](/wiki/lmarena)
- [LMSYS](/wiki/lmsys)
- [Bradley-Terry model](/wiki/bradley_terry_model)
- [HumanEval](/wiki/humaneval)
- [SWE-bench](/wiki/swe_bench)
- [LiveCodeBench](/wiki/livecodebench)
- [E2B](/wiki/e2b)
- [React](/wiki/react)
- [TypeScript](/wiki/typescript)
- [Tailwind CSS](/wiki/tailwind_css)
- [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet)
- [AI code generation](/wiki/ai_code_generation)

## References

1. Vichare, Aryan; Angelopoulos, Anastasios N.; Chiang, Wei-Lin; Tang, Kelly; Manolache, Luca. "WebDev Arena: A Live LLM Leaderboard for Web App Development." LMArena blog, 2025. https://arena.ai/blog/webdev-arena/
2. Epoch AI. "WebDev Arena." Benchmark hub entry, 2025. https://epoch.ai/benchmarks/webdev-arena
3. Sanchez, Andrew. "As companies pour billions into AI, a ranking system by UC Berkeley students has all eyes on it." Berkeley News, May 6, 2025. https://news.berkeley.edu/2025/05/06/as-companies-pour-billions-into-ai-a-ranking-system-by-uc-berkeley-students-has-all-eyes-on-it/
4. LMArena. "webdev-arena-preference-10k." Hugging Face dataset, 2025. https://huggingface.co/datasets/lmarena-ai/webdev-arena-preference-10k
5. LMArena. "WebDev AI Leaderboard." Live leaderboard snapshot, May 2026. https://arena.ai/leaderboard/code
6. Willison, Simon. "WebDev Arena." simonwillison.net, December 16, 2024. https://simonwillison.net/2024/Dec/16/webdev-arena/
7. LMArena. "WebDev Arena (about page)." https://web.lmarena.ai/about
8. "Gaming the Arena: AI Model Evaluation and the Viral Capture of Attention." arXiv preprint 2512.15252, 2025. https://arxiv.org/pdf/2512.15252