Scale SEAL Leaderboards

Overview

The SEAL Leaderboards are a set of expert-curated, contamination-resistant evaluation leaderboards for frontier large language models, produced by the Safety, Evaluations and Alignment Lab (SEAL) at Scale AI. First published on May 29, 2024, the leaderboards rank models on private, held-out datasets that are authored and graded by verified domain experts and are never released publicly, so that model developers cannot train on the questions or otherwise game the results ^[1]^[2]. SEAL grew into one of the more widely cited independent evaluation hubs of the 2024 to 2026 period, covering reasoning, coding, mathematics, multilinguality, agentic tool use, instruction following, visual-language understanding, honesty, and adversarial robustness, and hosting headline benchmarks such as Humanity's Last Exam, EnigmaEval, and MultiChallenge ^[3]^[4].

The project is positioned as a deliberate alternative to two perceived failure modes of AI benchmark practice: public static benchmarks that leak into training data, and crowd-vote arenas such as LMArena whose rankings can be influenced by stylistic preference and targeted optimization. SEAL's stated answer is private data plus credentialed human experts plus neutral, vendor-independent scoring ^[1]^[5].

What the SEAL Leaderboards are

SEAL stands for Safety, Evaluations and Alignment Lab, a research group Scale AI launched in November 2023 ^[1]^[6]. The lab's leaderboards are not a single ranking but a collection of separate, capability-specific evaluations, each built around its own private prompt set and grading methodology. Scale describes the goal as supplying "trustworthy, third-party" rankings of frontier models at a time when many published scores came from the model developers themselves and were difficult to reproduce or verify ^[1]^[2].

Scale reported that during 2025 it added 15 new benchmarks and published more than 450 individual evaluation results across more than 50 models, illustrating the cadence at which the lab refreshed and expanded coverage ^[7]. The leaderboards live at scale.com/leaderboard, which redirects to Scale's "Scale Labs" evaluation hub ^[3].

Methodology (private, expert-curated)

Scale frames the leaderboards around four founding principles ^[1]^[2]:

Principle	What it means
Private, held-out datasets	Prompts and answers are proprietary and never published, so they cannot be incorporated into training corpora or memorized by models.
Verified domain experts	Prompts are authored, and responses graded, by specialists vetted through interviews and domain-specific screening rather than anonymous crowd workers.
Model-provider neutrality	Scale runs the evaluations itself and limits or flags entries from providers that may have seen the prompts (for example through API logging), so a developer does not score its own model.
Periodic refreshes	Leaderboards are updated multiple times per year as new models ship and as datasets are expanded or rotated.

The lab states it built SEAL specifically to address recurring problems in AI evaluation: opacity, dataset contamination, inconsistent reporting, and unverified evaluator expertise ^[2]^[5]. For some benchmarks Scale releases a public split of a dataset together with a paper while retaining a private split for live scoring; the honesty benchmark MASK is an example of this public-paper, private-leaderboard split ^[4]. Where automated grading is used, prompt sets are still expert-written; the ToolComp agentic benchmark, for instance, consists of 485 hand-crafted prompts with final answers and process-supervision labels designed to test dependent, multi-step tool use ^[4].

The leaderboards (HLE, EnigmaEval, and others)

SEAL's evaluations span several capability areas. The most prominent include:

Humanity's Last Exam (HLE): an extremely difficult, broad multi-domain reasoning exam co-developed by Scale AI and the Center for AI Safety (CAIS). The question set was finalized at 2,500 questions in early 2025, with text-only and multimodal variants ^[3]^[8]. On the text-only leaderboard captured in April 2025, OpenAI's o3 and o4-mini reasoning models led, scoring roughly 18 to 21 percent, underscoring how far frontier systems remained from saturating it ^[3].
EnigmaEval: a puzzle-solving benchmark of intricate, multi-step reasoning puzzles, on which top models scored only in the low teens (percent) as of April 2025 ^[3].
MultiChallenge: a multi-turn conversation benchmark introduced in early 2025 on which frontier models initially scored under 50 percent accuracy, testing the ability to track instructions and context across a dialogue ^[9].
MultiNRC and Professional Reasoning: additional reasoning leaderboards covering multilingual non-English reasoning and professional-domain problems ^[7]^[10].
VISTA: a visual-language understanding benchmark requiring models to integrate multiple perception abilities against structured rubrics ^[4]^[10].
MASK: a benchmark measuring model honesty under realistic pressure to be deceptive, built on a private split of the MASK dataset ^[4].
Agentic Tool Use: separate "Chat" and "Enterprise" leaderboards (drawing on work such as ToolComp) that test multi-step planning, correct API and tool invocation, and end-to-end task completion ^[4]^[10].
Adversarial Robustness: ranks models by the number of policy violations across 1,000 adversarial prompts, with fewer violations indicating greater robustness ^[11].

Earlier core domains launched in May 2024 included coding, instruction following, mathematics (initially based on the GSM1k dataset), and multilinguality ^[2]. Scale also began issuing annual "Models of the Year" awards summarizing leaderboard performance ^[7].

Position in the evaluation ecosystem

SEAL occupies a distinct niche relative to other evaluation efforts. Against public static benchmarks, its private datasets are meant to defeat contamination, the phenomenon in which test questions appear in training data and inflate scores. Against crowd-vote arenas, most notably LMArena, SEAL substitutes credentialed expert judgment for anonymous public votes, on the argument that votes may reward style or verbosity rather than correctness ^[1]^[5].

In September 2025 Scale extended the family in the opposite direction with SEAL Showdown, a public leaderboard launched on September 22, 2025 that ranks models on real-world human preferences drawn from millions of conversations across Scale's global contributor network spanning more than 100 countries and over 70 languages. SEAL Showdown lets users break results down by region, language, age, education level, and profession, and applies anti-gaming safeguards including a delay before sharing leaderboard data with developers. It was widely described as a direct competitor to LMArena, broadening the rater pool beyond the "tech enthusiasts" who dominate existing arenas ^[5]^[12]. Commentators noted the irony that, while LMArena drew criticism (for example a 2025 "Leaderboard Illusion" line of critique and reports that xAI optimized Grok specifically to top coding rankings), Scale was offering a more demographically representative crowd evaluation alongside its private expert leaderboards ^[5]^[12].

Reception

The SEAL Leaderboards were generally received as a credible attempt to raise evaluation standards, and their datasets and headline scores were frequently cited by researchers and the press ^[1]^[9]. The dominant criticism concerns the inherent tension of a commercial data vendor grading the customers it sells to. That tension sharpened on June 12, 2025, when Meta took a 49 percent non-voting stake in Scale AI for about $14.3 billion, valuing the company at more than $29 billion and bringing founder Alexandr Wang to Meta to help lead its superintelligence efforts, with Jason Droege promoted to Scale CEO ^[13]^[14].

The deal drew antitrust scrutiny and prompted rival labs, reportedly including OpenAI, Google, and xAI, to scale back or pause work with Scale over data-access and conflict-of-interest concerns ^[12]^[15]. Observers argued the investment undercut the neutrality that the SEAL leaderboards depend on, since a model evaluator part-owned by one frontier lab is hard to read as a disinterested referee ^[5]^[12]. Scale maintained that it operates independently and that the leaderboards remain provider-neutral ^[14].

Leadership turnover also followed. Summer Yue, a former RLHF research lead for Bard at Google DeepMind, had joined Scale to direct SEAL and drove the creation of the leaderboards; she announced her departure in 2025 to join Meta's Superintelligence Labs, where she continued work on AI safety and alignment ^[6]^[9]. As of mid-2026 the SEAL Leaderboards remained active and continued to be updated with new models and benchmarks ^[3]^[7].

References

Scale AI, "Scale's SEAL Leaderboards." https://scale.com/blog/leaderboard
Scale AI, "Scale's SEAL Leaderboards" (introduction, founding principles, initial domains; published May 29, 2024). https://scale.com/blog/leaderboard
Scale Labs, "AI Model Leaderboards & Benchmarks" (leaderboard hub, HLE / EnigmaEval / MultiChallenge results). https://labs.scale.com/leaderboard
Scale Labs, leaderboard methodology pages for MASK, VISTA, ToolComp / Agentic Tool Use. https://labs.scale.com/leaderboard
WinBuzzer, "Scale AI Launches 'SEAL Showdown' LLM Leaderboard," September 22, 2025. https://winbuzzer.com/2025/09/22/scale-ai-launches-seal-showdown-llm-leaderboard-can-it-dethrone-lmarena-xcxwbn/
Scale AI, "SEAL: Scale's Safety, Evaluations and Alignment Lab." https://scale.com/blog/safety-evaluations-alignment-lab
Scale AI, "Introducing the 2025 SEAL Models of the Year Awards." https://scale.com/blog/2025-model-awards
Scale AI, "Humanity's Last Exam (Text Only)" leaderboard. https://scale.com/leaderboard/humanitys_last_exam_text_only
Summer Yue, post introducing MultiChallenge by Scale AI. https://x.com/summeryue0/status/1887202616939323745
Scale AI, leaderboard index (reasoning, agentic, visual-language categories). https://scale.com/leaderboard
Scale AI, "Adversarial Robustness" leaderboard. https://scale.com/leaderboard/adversarial_robustness
Yahoo / AT&T, "LMArena has some competition: Scale AI launches Seal Showdown." https://currently.att.yahoo.com/att/lmarena-competition-scale-ai-launches-173607307.html
CNBC, "Scale AI's Alexandr Wang confirms departure for Meta as part of $14.3 billion deal," June 12, 2025. https://www.cnbc.com/2025/06/12/scale-ai-founder-wang-announces-exit-for-meta-part-of-14-billion-deal.html
TechCrunch, "Scale AI confirms 'significant' investment from Meta, says CEO Alexandr Wang is leaving," June 13, 2025. https://techcrunch.com/2025/06/13/scale-ai-confirms-significant-investment-from-meta-says-ceo-alexandr-wang-is-leaving/
TechCrunch, "Cracks are forming in Meta's partnership with Scale AI," August 29, 2025. https://techcrunch.com/2025/08/29/cracks-are-forming-in-metas-partnership-with-scale-ai/

Scale SEAL Leaderboards

Overview

What the SEAL Leaderboards are

Methodology (private, expert-curated)

The leaderboards (HLE, EnigmaEval, and others)

Position in the evaluation ecosystem

Reception

References

Improve this article

What links here

Overview

What the SEAL Leaderboards are

Methodology (private, expert-curated)

The leaderboards (HLE, EnigmaEval, and others)

Position in the evaluation ecosystem

Reception

References

What links here

Overview

What the SEAL Leaderboards are

Methodology (private, expert-curated)

The leaderboards (HLE, EnigmaEval, and others)

Position in the evaluation ecosystem

Reception

References

Improve this article

Related Articles

Helicone

Patronus AI

Langfuse

LangSmith

Arize Phoenix

LMArena

What links here

Overview

What the SEAL Leaderboards are

Methodology (private, expert-curated)

The leaderboards (HLE, EnigmaEval, and others)

Position in the evaluation ecosystem

Reception

References

Related Articles

Helicone

Patronus AI

Langfuse

LangSmith

Arize Phoenix

LMArena

What links here