Vals AI
Last reviewed
Jun 8, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 1,725 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 1,725 words
Add missing citations, update stale details, or suggest a clearer explanation.
Vals AI is an independent, third-party AI model evaluation company based in San Francisco that builds and publishes domain-specific benchmarks measuring how well large language models perform on real professional tasks. Rather than scoring models on generic academic quizzes, Vals AI concentrates on high-stakes knowledge work in fields such as law, tax, finance, and healthcare, often constructing its evaluation sets with domain experts and, in the legal arena, with practising lawyers from a consortium of firms. The company maintains public leaderboards that track the leading frontier models over time and runs additional private evaluations for AI labs and enterprise engineering teams. [1][2][3]
Vals AI describes itself as an "independent platform committed to advancing the future of Gen AI through unbiased benchmarks and scalable evaluation infrastructure for labs and engineering teams." [2] It has been covered as an emerging evaluation authority by outlets including Bloomberg and Andrew Ng's The Batch, and it is frequently grouped with other proprietary, expert-graded benchmarking efforts such as Scale AI's SEAL leaderboards as part of a broader shift toward private, harder-to-game model evaluation. [1][4][5]
Vals AI was founded by Rayan Krishnan and Langston Nashold, who left Stanford University's master's program in artificial intelligence to start the company, joined by founding engineer Rez Havaei. [4][6] Krishnan serves as chief executive officer and Nashold as chief technology officer. The company was incorporated and raised its initial outside capital in 2024, and it remains headquartered in San Francisco. [6][7] The founders retained close ties to Stanford researchers and recruited industry specialists in accounting, law, and finance to help design evaluations. [4]
The founders started Vals AI after concluding that the popular public benchmarks used to rank models did not reflect the work that professionals actually do. In the company's framing, three problems undermine conventional leaderboards: many widely cited benchmarks rely on "contrived academic datasets" rather than real industry tasks; public test sets suffer from data contamination once they are absorbed into model training corpora; and vendor-published results lack objectivity because companies can cherry-pick favorable examples. [2] Vals AI's stated answer is to commission custom, professionally grounded benchmarks, keep the underlying datasets private so they cannot leak into training data, and act as a neutral third party that reports results across multiple dimensions including accuracy, cost, and latency. [2]
The starting characterization of Vals AI as a Stanford-rooted, independent evaluation firm benchmarking professional AI is therefore accurate. Reported founding dates vary slightly between sources, with some listing 2023 and others 2024; the company's first outside funding and its earliest published leaderboards both date to 2024. [6][7]
Vals AI runs a growing catalogue of evaluations spanning legal, financial, tax, medical, coding, and general academic domains. A subset are public-facing leaderboards intended as free reference points for model comparison, while the underlying question sets for many of them are held privately to preserve evaluation integrity. The company also publishes composite measures, including a "Vals Index" that weights performance across finance and coding tasks as a proxy for economic impact. [3]
The table below summarizes representative benchmarks by domain. Benchmark line-ups and version numbers change as the company refreshes them.
| Domain | Benchmark | What it tests |
|---|---|---|
| Legal | CaseLaw | Private question-and-answer set drawn from court cases (a later version uses Canadian case law) [3] |
| Legal | ContractLaw | Contract analysis, retrieval, editing, and compliance [1] |
| Legal | LegalBench | An open-source legal-reasoning benchmark that Vals AI tracks on its leaderboard [1] |
| Finance | CorpFin | Reasoning over long-context commercial credit agreements [1][5] |
| Finance | Finance Agent | Entry-level financial-analyst tasks answered from public company filings [3][8] |
| Tax | TaxEval | Tax calculation and accounting knowledge [1] |
| Tax | MortgageTax | Reading and interpreting tax certificates presented as images [3] |
| Healthcare | MedQA | Medical question answering, with attention to model bias [3] |
| Healthcare | MedCode / MedScribe | Whether models can support medical billing and clinical administrative work [3] |
| Coding | SWE-bench Verified, LiveCodeBench, others | Production software-engineering and competitive-programming tasks [3] |
| Academic | GPQA Diamond, MMLU Pro | Graduate-level and broad multiple-choice reasoning [3] |
In the legal domain, Vals AI runs both these model-level benchmarks and a separate, higher-profile product testing commercial legal AI tools: the Vals Legal AI Report, abbreviated VLAIR. Note that VLAIR is the name of this recurring industry study rather than a single benchmark dataset, a distinction worth keeping clear. The first VLAIR study, published in February 2025, was developed in partnership with the research firm Legaltech Hub and a consortium of law firms including Reed Smith, Fisher Phillips, McDermott Will and Emery, and Ogletree Deakins, with lawyers from those firms helping to design tasks and grade outputs. [9][10] That first iteration evaluated commercial systems from vendors including Thomson Reuters (CoCounsel), vLex (Vincent AI), Harvey, and Vecflow (Oliver) across tasks such as document question answering, summarization, redlining, and chronology generation. [9][10]
A follow-up VLAIR legal-research study, released in October 2025, drew wide attention for finding that AI systems could outperform human lawyers on legal-research accuracy. In a blind evaluation of 210 questions across nine legal-research task types, the report found human lawyers scored about 71 percent on accuracy while the AI systems scored roughly 79 to 81 percent, with the tools also answering far faster than lawyers' average response time. [11][12] The tested products in that round included Alexi, Counsel Stack, Midpage, and a general-purpose ChatGPT baseline, while several market leaders declined to participate or withdrew before publication, a fact noted by legal-technology commentators. [11][13] Vals AI has said it intends to repeat the legal studies annually, and it issued an open call in 2025 inviting additional vendors to participate. [9][14]
Vals AI's methodology centers on realistic, expert-authored questions and private datasets. Working with independent domain experts, the company assembles multiple-choice and open-ended questions that mirror professional work, then withholds those datasets from public release so they are unlikely to appear in model training data or be optimized against directly. [1][2] Where the starting context's specific benchmark labels diverge from the company's own naming, the verified benchmark names above should be preferred.
Evaluations are designed to probe more than a single accuracy number. The company reports across multiple axes, including accuracy, latency, cost, and qualitative observations, and its more recent benchmarks increasingly test agentic capabilities such as tool use, multimodal reasoning, and long-context comprehension rather than only static question answering. [2][3] In its professional studies, grading is performed by human experts, lawyers and academics in the legal reports, using detailed rubrics, and results are frequently framed against a human professional baseline so that model performance can be read relative to the people whose work is being measured. [11][12] The legal-research study, for example, weighted accuracy at 50 percent, authoritativeness at 40 percent, and appropriateness at 10 percent, and explicitly compared AI scores against a panel of practicing lawyers. [11]
Because some benchmarks remain proprietary, third parties cannot fully reproduce the underlying datasets; Vals AI positions this trade-off as the price of avoiding contamination, and it publishes methodology documentation and per-model scores to maintain transparency about how rankings are produced. [2][3]
Vals AI raised an early round of outside capital in 2024. Reporting around the company places its total funding at roughly 5 million dollars, with a seed financing announced in mid-2024. [7] Listed backers include Bloomberg Beta, Pear VC, 8VC, and the European fund J12, with early involvement reported from a Sequoia scout investor. [6][7] Exact valuation figures for the round have not been publicly disclosed. The company has remained relatively small, operating with a lean team while expanding its benchmark catalogue and enterprise evaluation work. [7]
Vals AI is increasingly cited as a credible, independent reference point for organizations deciding which AI systems to deploy for professional work, and its results are referenced by media, enterprises, and the vendors it evaluates. An April 2024 write-up in The Batch highlighted Vals AI's early industry-specific leaderboards and reported that GPT-4 and Claude 3 Opus led its initial legal and financial benchmarks. [1] Bloomberg profiled the startup the same month as an attempt to measure how well AI models actually work on real tasks. [4] The October 2025 legal-research findings were covered extensively across legal-technology publications such as LawSites and Artificial Lawyer, and several vendors publicized their placements in the Vals Legal AI Report as independent validation of their products. [11][13][15]
Within the wider evaluation landscape, Vals AI is commonly discussed alongside Scale AI's SEAL leaderboards and academic efforts as part of a movement toward private, expert-graded benchmarks that are harder to game than open public test sets. [1][5] Its emphasis on regulated, high-stakes professional domains, and on comparing models to human professional baselines, distinguishes its work from general-purpose chatbot rankings and has made it a frequently cited authority in coverage of AI adoption in law, finance, and accounting. [4][11]