Vals AI

AI Companies Model Evaluation

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v1 · 1,725 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Vals AI is an independent, third-party AI model evaluation company based in San Francisco that builds and publishes domain-specific benchmarks measuring how well large language models perform on real professional tasks. Rather than scoring models on generic academic quizzes, Vals AI concentrates on high-stakes knowledge work in fields such as law, tax, finance, and healthcare, often constructing its evaluation sets with domain experts and, in the legal arena, with practising lawyers from a consortium of firms. The company maintains public leaderboards that track the leading frontier models over time and runs additional private evaluations for AI labs and enterprise engineering teams. ^[1]^[2]^[3]

Vals AI describes itself as an "independent platform committed to advancing the future of Gen AI through unbiased benchmarks and scalable evaluation infrastructure for labs and engineering teams." ^[2] It has been covered as an emerging evaluation authority by outlets including Bloomberg and Andrew Ng's The Batch, and it is frequently grouped with other proprietary, expert-graded benchmarking efforts such as Scale AI's SEAL leaderboards as part of a broader shift toward private, harder-to-game model evaluation. ^[1]^[4]^[5]

Founding and mission

Vals AI was founded by Rayan Krishnan and Langston Nashold, who left Stanford University's master's program in artificial intelligence to start the company, joined by founding engineer Rez Havaei. ^[4]^[6] Krishnan serves as chief executive officer and Nashold as chief technology officer. The company was incorporated and raised its initial outside capital in 2024, and it remains headquartered in San Francisco. ^[6]^[7] The founders retained close ties to Stanford researchers and recruited industry specialists in accounting, law, and finance to help design evaluations. ^[4]

The founders started Vals AI after concluding that the popular public benchmarks used to rank models did not reflect the work that professionals actually do. In the company's framing, three problems undermine conventional leaderboards: many widely cited benchmarks rely on "contrived academic datasets" rather than real industry tasks; public test sets suffer from data contamination once they are absorbed into model training corpora; and vendor-published results lack objectivity because companies can cherry-pick favorable examples. ^[2] Vals AI's stated answer is to commission custom, professionally grounded benchmarks, keep the underlying datasets private so they cannot leak into training data, and act as a neutral third party that reports results across multiple dimensions including accuracy, cost, and latency. ^[2]

The starting characterization of Vals AI as a Stanford-rooted, independent evaluation firm benchmarking professional AI is therefore accurate. Reported founding dates vary slightly between sources, with some listing 2023 and others 2024; the company's first outside funding and its earliest published leaderboards both date to 2024. ^[6]^[7]

The professional benchmarks

Vals AI runs a growing catalogue of evaluations spanning legal, financial, tax, medical, coding, and general academic domains. A subset are public-facing leaderboards intended as free reference points for model comparison, while the underlying question sets for many of them are held privately to preserve evaluation integrity. The company also publishes composite measures, including a "Vals Index" that weights performance across finance and coding tasks as a proxy for economic impact. ^[3]

The table below summarizes representative benchmarks by domain. Benchmark line-ups and version numbers change as the company refreshes them.

Domain	Benchmark	What it tests
Legal	CaseLaw	Private question-and-answer set drawn from court cases (a later version uses Canadian case law) ^[3]
Legal	ContractLaw	Contract analysis, retrieval, editing, and compliance ^[1]
Legal	LegalBench	An open-source legal-reasoning benchmark that Vals AI tracks on its leaderboard ^[1]
Finance	CorpFin	Reasoning over long-context commercial credit agreements ^[1]^[5]
Finance	Finance Agent	Entry-level financial-analyst tasks answered from public company filings ^[3]^[8]
Tax	TaxEval	Tax calculation and accounting knowledge ^[1]
Tax	MortgageTax	Reading and interpreting tax certificates presented as images ^[3]
Healthcare	MedQA	Medical question answering, with attention to model bias ^[3]
Healthcare	MedCode / MedScribe	Whether models can support medical billing and clinical administrative work ^[3]
Coding	SWE-bench Verified, LiveCodeBench, others	Production software-engineering and competitive-programming tasks ^[3]
Academic	GPQA Diamond, MMLU Pro	Graduate-level and broad multiple-choice reasoning ^[3]

In the legal domain, Vals AI runs both these model-level benchmarks and a separate, higher-profile product testing commercial legal AI tools: the Vals Legal AI Report, abbreviated VLAIR. Note that VLAIR is the name of this recurring industry study rather than a single benchmark dataset, a distinction worth keeping clear. The first VLAIR study, published in February 2025, was developed in partnership with the research firm Legaltech Hub and a consortium of law firms including Reed Smith, Fisher Phillips, McDermott Will and Emery, and Ogletree Deakins, with lawyers from those firms helping to design tasks and grade outputs. ^[9]^[10] That first iteration evaluated commercial systems from vendors including Thomson Reuters (CoCounsel), vLex (Vincent AI), Harvey, and Vecflow (Oliver) across tasks such as document question answering, summarization, redlining, and chronology generation. ^[9]^[10]

A follow-up VLAIR legal-research study, released in October 2025, drew wide attention for finding that AI systems could outperform human lawyers on legal-research accuracy. In a blind evaluation of 210 questions across nine legal-research task types, the report found human lawyers scored about 71 percent on accuracy while the AI systems scored roughly 79 to 81 percent, with the tools also answering far faster than lawyers' average response time. ^[11]^[12] The tested products in that round included Alexi, Counsel Stack, Midpage, and a general-purpose ChatGPT baseline, while several market leaders declined to participate or withdrew before publication, a fact noted by legal-technology commentators. ^[11]^[13] Vals AI has said it intends to repeat the legal studies annually, and it issued an open call in 2025 inviting additional vendors to participate. ^[9]^[14]

Methodology

Vals AI's methodology centers on realistic, expert-authored questions and private datasets. Working with independent domain experts, the company assembles multiple-choice and open-ended questions that mirror professional work, then withholds those datasets from public release so they are unlikely to appear in model training data or be optimized against directly. ^[1]^[2] Where the starting context's specific benchmark labels diverge from the company's own naming, the verified benchmark names above should be preferred.

Evaluations are designed to probe more than a single accuracy number. The company reports across multiple axes, including accuracy, latency, cost, and qualitative observations, and its more recent benchmarks increasingly test agentic capabilities such as tool use, multimodal reasoning, and long-context comprehension rather than only static question answering. ^[2]^[3] In its professional studies, grading is performed by human experts, lawyers and academics in the legal reports, using detailed rubrics, and results are frequently framed against a human professional baseline so that model performance can be read relative to the people whose work is being measured. ^[11]^[12] The legal-research study, for example, weighted accuracy at 50 percent, authoritativeness at 40 percent, and appropriateness at 10 percent, and explicitly compared AI scores against a panel of practicing lawyers. ^[11]

Because some benchmarks remain proprietary, third parties cannot fully reproduce the underlying datasets; Vals AI positions this trade-off as the price of avoiding contamination, and it publishes methodology documentation and per-model scores to maintain transparency about how rankings are produced. ^[2]^[3]

Funding

Vals AI raised an early round of outside capital in 2024. Reporting around the company places its total funding at roughly 5 million dollars, with a seed financing announced in mid-2024. ^[7] Listed backers include Bloomberg Beta, Pear VC, 8VC, and the European fund J12, with early involvement reported from a Sequoia scout investor. ^[6]^[7] Exact valuation figures for the round have not been publicly disclosed. The company has remained relatively small, operating with a lean team while expanding its benchmark catalogue and enterprise evaluation work. ^[7]

Standing and reception

Vals AI is increasingly cited as a credible, independent reference point for organizations deciding which AI systems to deploy for professional work, and its results are referenced by media, enterprises, and the vendors it evaluates. An April 2024 write-up in The Batch highlighted Vals AI's early industry-specific leaderboards and reported that GPT-4 and Claude 3 Opus led its initial legal and financial benchmarks. ^[1] Bloomberg profiled the startup the same month as an attempt to measure how well AI models actually work on real tasks. ^[4] The October 2025 legal-research findings were covered extensively across legal-technology publications such as LawSites and Artificial Lawyer, and several vendors publicized their placements in the Vals Legal AI Report as independent validation of their products. ^[11]^[13]^[15]

Within the wider evaluation landscape, Vals AI is commonly discussed alongside Scale AI's SEAL leaderboards and academic efforts as part of a movement toward private, expert-graded benchmarks that are harder to game than open public test sets. ^[1]^[5] Its emphasis on regulated, high-stakes professional domains, and on comparing models to human professional baselines, distinguishes its work from general-purpose chatbot rankings and has made it a frequently cited authority in coverage of AI adoption in law, finance, and accounting. ^[4]^[11]

References

The Batch (DeepLearning.AI). "Vals AI Evaluates Large Language Models on Industry-Specific Tasks." https://www.deeplearning.ai/the-batch/vals-ai-evaluates-large-language-models-on-industry-specific-tasks/ ↩
Vals AI. "About." https://www.vals.ai/about ↩
Vals AI. "Benchmarks." https://www.vals.ai/benchmarks ↩
Bloomberg. "This Startup is Trying to Test How Well AI Models Actually Work." April 11, 2024. https://www.bloomberg.com/news/newsletters/2024-04-11/this-startup-is-trying-to-test-how-well-ai-models-actually-work ↩
The Batch (DeepLearning.AI). "Private Benchmarks for Fairer Tests" (Scale AI SEAL leaderboards). https://www.deeplearning.ai/the-batch/private-benchmarks-for-fairer-tests/ ↩
Tracxn. "Vals AI - Company Profile, Founders and Board of Directors." https://tracxn.com/d/companies/valsai/__Aeq7C2n56rLfgsWjTrbrPIwiOteKWlKhIWYPbapLuow ↩
Crunchbase. "Vals.ai - Company Profile, Funding, Financials and Investors." https://www.crunchbase.com/organization/vals-ai ↩
Vals AI. "Finance Agent v2." https://www.vals.ai/benchmarks/fabv2 ↩
LawSites. "Vals AI Issues Open Call for Vendors to Participate In Its Legal Research and Other Legal AI Benchmarking Studies." May 2025. https://www.lawnext.com/2025/05/vals-ai-issues-open-call-for-vendors-to-participate-in-its-legal-research-and-other-legal-ai-benchmarking-studies.html ↩
Artificial Lawyer. "Vals Publishes Results of First Legal AI Benchmark Study." February 27, 2025. https://www.artificiallawyer.com/2025/02/27/vals-publishes-results-of-first-legal-ai-benchmark-study/ ↩
LawSites. "Vals AI's Latest Benchmark Finds Legal and General AI Now Outperform Lawyers in Legal Research Accuracy." October 2025. https://www.lawnext.com/2025/10/vals-ais-latest-benchmark-finds-legal-and-general-ai-now-outperform-lawyers-in-legal-research-accuracy.html ↩
Vals AI. "Vals Legal AI Report (VLAIR)." https://www.vals.ai/vlair ↩
Legal IT Insider. "Vals AI's benchmarking report for legal research is out - but the market leaders are absent." October 16, 2025. https://legaltechnology.com/2025/10/16/vals-ais-benchmarking-report-for-legal-research-is-out-but-the-market-leaders-are-absent/ ↩
Artificial Lawyer. "Vals Legal AI Research Eval - The Aftermath." October 20, 2025. https://www.artificiallawyer.com/2025/10/20/vals-legal-ai-research-eval-the-aftermath/ ↩
Maryland State Bar Association. "AI vs. Attorneys: Insights from the Vals Legal AI Report." https://www.msba.org/site/site/content/News-and-Publications/News/General-News/AI_vs._Attorneys_Insights_from_the_Vals_Legal_AI_Report.aspx ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Grok Code Fast LegalBench Vibe Code Bench

Overview

Founding and mission

The professional benchmarks

Methodology

Funding

Standing and reception

References

Improve this article

Related Articles

Helicone

Patronus AI

Langfuse

LangSmith

Arize Phoenix

LMArena

What links here

Related Articles

Helicone

Patronus AI

Langfuse

LangSmith

Arize Phoenix

LMArena

What links here