Scale SEAL Leaderboards
Last reviewed
Jun 8, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 1,478 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 1,478 words
Add missing citations, update stale details, or suggest a clearer explanation.
The SEAL Leaderboards are a set of expert-curated, contamination-resistant evaluation leaderboards for frontier large language models, produced by the Safety, Evaluations and Alignment Lab (SEAL) at Scale AI. First published on May 29, 2024, the leaderboards rank models on private, held-out datasets that are authored and graded by verified domain experts and are never released publicly, so that model developers cannot train on the questions or otherwise game the results [1][2]. SEAL grew into one of the more widely cited independent evaluation hubs of the 2024 to 2026 period, covering reasoning, coding, mathematics, multilinguality, agentic tool use, instruction following, visual-language understanding, honesty, and adversarial robustness, and hosting headline benchmarks such as Humanity's Last Exam, EnigmaEval, and MultiChallenge [3][4].
The project is positioned as a deliberate alternative to two perceived failure modes of AI benchmark practice: public static benchmarks that leak into training data, and crowd-vote arenas such as LMArena whose rankings can be influenced by stylistic preference and targeted optimization. SEAL's stated answer is private data plus credentialed human experts plus neutral, vendor-independent scoring [1][5].
SEAL stands for Safety, Evaluations and Alignment Lab, a research group Scale AI launched in November 2023 [1][6]. The lab's leaderboards are not a single ranking but a collection of separate, capability-specific evaluations, each built around its own private prompt set and grading methodology. Scale describes the goal as supplying "trustworthy, third-party" rankings of frontier models at a time when many published scores came from the model developers themselves and were difficult to reproduce or verify [1][2].
Scale reported that during 2025 it added 15 new benchmarks and published more than 450 individual evaluation results across more than 50 models, illustrating the cadence at which the lab refreshed and expanded coverage [7]. The leaderboards live at scale.com/leaderboard, which redirects to Scale's "Scale Labs" evaluation hub [3].
Scale frames the leaderboards around four founding principles [1][2]:
| Principle | What it means |
|---|---|
| Private, held-out datasets | Prompts and answers are proprietary and never published, so they cannot be incorporated into training corpora or memorized by models. |
| Verified domain experts | Prompts are authored, and responses graded, by specialists vetted through interviews and domain-specific screening rather than anonymous crowd workers. |
| Model-provider neutrality | Scale runs the evaluations itself and limits or flags entries from providers that may have seen the prompts (for example through API logging), so a developer does not score its own model. |
| Periodic refreshes | Leaderboards are updated multiple times per year as new models ship and as datasets are expanded or rotated. |
The lab states it built SEAL specifically to address recurring problems in AI evaluation: opacity, dataset contamination, inconsistent reporting, and unverified evaluator expertise [2][5]. For some benchmarks Scale releases a public split of a dataset together with a paper while retaining a private split for live scoring; the honesty benchmark MASK is an example of this public-paper, private-leaderboard split [4]. Where automated grading is used, prompt sets are still expert-written; the ToolComp agentic benchmark, for instance, consists of 485 hand-crafted prompts with final answers and process-supervision labels designed to test dependent, multi-step tool use [4].
SEAL's evaluations span several capability areas. The most prominent include:
Earlier core domains launched in May 2024 included coding, instruction following, mathematics (initially based on the GSM1k dataset), and multilinguality [2]. Scale also began issuing annual "Models of the Year" awards summarizing leaderboard performance [7].
SEAL occupies a distinct niche relative to other evaluation efforts. Against public static benchmarks, its private datasets are meant to defeat contamination, the phenomenon in which test questions appear in training data and inflate scores. Against crowd-vote arenas, most notably LMArena, SEAL substitutes credentialed expert judgment for anonymous public votes, on the argument that votes may reward style or verbosity rather than correctness [1][5].
In September 2025 Scale extended the family in the opposite direction with SEAL Showdown, a public leaderboard launched on September 22, 2025 that ranks models on real-world human preferences drawn from millions of conversations across Scale's global contributor network spanning more than 100 countries and over 70 languages. SEAL Showdown lets users break results down by region, language, age, education level, and profession, and applies anti-gaming safeguards including a delay before sharing leaderboard data with developers. It was widely described as a direct competitor to LMArena, broadening the rater pool beyond the "tech enthusiasts" who dominate existing arenas [5][12]. Commentators noted the irony that, while LMArena drew criticism (for example a 2025 "Leaderboard Illusion" line of critique and reports that xAI optimized Grok specifically to top coding rankings), Scale was offering a more demographically representative crowd evaluation alongside its private expert leaderboards [5][12].
The SEAL Leaderboards were generally received as a credible attempt to raise evaluation standards, and their datasets and headline scores were frequently cited by researchers and the press [1][9]. The dominant criticism concerns the inherent tension of a commercial data vendor grading the customers it sells to. That tension sharpened on June 12, 2025, when Meta took a 49 percent non-voting stake in Scale AI for about $14.3 billion, valuing the company at more than $29 billion and bringing founder Alexandr Wang to Meta to help lead its superintelligence efforts, with Jason Droege promoted to Scale CEO [13][14].
The deal drew antitrust scrutiny and prompted rival labs, reportedly including OpenAI, Google, and xAI, to scale back or pause work with Scale over data-access and conflict-of-interest concerns [12][15]. Observers argued the investment undercut the neutrality that the SEAL leaderboards depend on, since a model evaluator part-owned by one frontier lab is hard to read as a disinterested referee [5][12]. Scale maintained that it operates independently and that the leaderboards remain provider-neutral [14].
Leadership turnover also followed. Summer Yue, a former RLHF research lead for Bard at Google DeepMind, had joined Scale to direct SEAL and drove the creation of the leaderboards; she announced her departure in 2025 to join Meta's Superintelligence Labs, where she continued work on AI safety and alignment [6][9]. As of mid-2026 the SEAL Leaderboards remained active and continued to be updated with new models and benchmarks [3][7].