# LegalBench

> Source: https://aiwiki.ai/wiki/legalbench
> Updated: 2026-06-10
> Categories: AI Benchmarks, Large Language Models, Legal AI, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# LegalBench

**LegalBench** is a collaboratively constructed benchmark for evaluating legal reasoning in [large language models](/wiki/large_language_model) (LLMs). Developed by Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Re, and 36 additional co-authors from institutions including Stanford University, the University of Chicago, Harvard Law School, Georgetown University, and many others, LegalBench consists of 162 tasks spanning six categories of legal reasoning.[1] The benchmark was published at the [NeurIPS](/wiki/neurips) 2023 Datasets and Benchmarks track[1] and is publicly available on [Hugging Face](/wiki/hugging_face) and GitHub.[6][7]

LegalBench emerged from an open science effort that brought together legal professionals, law professors, computational legal scholars, and legal impact organizations.[2] Unlike benchmarks built exclusively by computer scientists, LegalBench draws directly on the expertise of practitioners who designed tasks that reflect real-world legal reasoning patterns. The benchmark covers a range of legal domains including contracts, evidence, civil procedure, corporate governance, immigration, privacy policy, and statutory interpretation.[1] It provides a common vocabulary for researchers and legal professionals to discuss [AI](/wiki/artificial_intelligence) performance in law.

## Background and Motivation

The application of LLMs to legal tasks has attracted significant attention from both the AI research community and the legal profession. Models such as [GPT-4](/wiki/gpt-4) have demonstrated the ability to pass the Uniform Bar Exam,[4] and law firms have begun exploring the use of [AI assistants](/wiki/ai_agents) for contract review, legal research, and document drafting. However, before LegalBench, there was no comprehensive benchmark specifically designed to measure the kinds of reasoning that lawyers and legal scholars consider central to legal practice.

Existing NLP benchmarks such as [GLUE](/wiki/glue_benchmark), [SuperGLUE](/wiki/superglue), and [MMLU](/wiki/mmlu)[5] include some questions related to law, but they treat legal reasoning as just one subject among many and do not decompose it into the distinct reasoning types that legal education emphasizes. Legal reasoning is not a single monolithic skill. It involves identifying relevant issues, recalling applicable rules, applying those rules to specific facts, drawing conclusions, interpreting statutory and contractual language, and understanding rhetorical strategies in legal arguments.

The creators of LegalBench sought to fill this gap by building a benchmark grounded in the IRAC framework, a foundational methodology in legal education.[1] IRAC stands for Issue, Rule, Application, and Conclusion, and it describes the sequential process by which lawyers analyze legal problems. By mapping benchmark tasks to these components (and adding interpretation and rhetorical understanding as additional categories), LegalBench enables a fine-grained diagnosis of where LLMs succeed and where they fall short in legal reasoning.[1]

LegalBench began as an open prototype. In September 2022, Guha, Nyarko, Ho, and Re published a short paper that proposed IRAC as an organizing framework for evaluating foundation models on legal reasoning, seeded the project with an initial set of 44 tasks, and invited the legal and computer science communities to contribute additional tasks.[11] The benchmark grew to its full 162 tasks over the following year.[1]

## Construction and Methodology

### Collaborative Task Design

LegalBench was assembled through a distinctive crowdsourcing process within the legal community. Rather than having computer scientists design all tasks, the project invited contributions from 40 individuals spanning diverse legal backgrounds: practicing attorneys, law professors, computational legal researchers, legal impact labs, and law students.[1] Contributors were asked to propose tasks that they found either "interesting" or "practically useful" for evaluating LLMs.[1][6]

This approach produced tasks that reflect the authentic concerns of legal practitioners. Some tasks test whether a model can classify evidence under the Federal Rules of Evidence. Others evaluate whether a model can identify specific clauses in merger agreements or determine whether a privacy policy discloses a particular data practice. The result is a benchmark that covers not just textbook legal questions but also the kinds of practical analysis that lawyers perform daily.

### Task Format and Types

Each LegalBench task pairs an input with an expected output.[1] The benchmark includes several task types:

| Task Type | Description | Example |
|---|---|---|
| Binary Classification | The model must answer "Yes" or "No" to a legal question | Determining whether a statement constitutes hearsay |
| Multi-Class Classification | The model must select from several possible labels | Classifying the type of contractual clause |
| Extraction | The model must identify and extract specific information | Extracting defined terms from Supreme Court opinions |
| Generation | The model must produce a free-text response | Generating the legal rule for a given issue |
| Entailment | The model must determine whether one statement follows from another | Evaluating whether a contract clause entails a particular obligation |

A substantial portion of the tasks are binary classification tasks, which require the LLM to determine whether a piece of legal text possesses a certain attribute.[1] This design choice reflects both the prevalence of yes/no determinations in legal practice and the practical need for tasks that can be scored objectively and at scale.

### Data Sources

LegalBench draws on 36 different data sources.[1] Several established legal NLP datasets were restructured and incorporated into the benchmark:

- **CUAD (Contract Understanding Atticus Dataset):** Originally designed for span-prediction extraction over long-form contracts,[9] CUAD was restructured into binary classification tasks for each type of contractual clause. LegalBench includes 38 tasks derived from CUAD, covering clauses such as non-compete provisions, IP ownership, termination rights, and exclusivity.[1]
- **MAUD (Mergers and Acquisitions Understanding Dataset):** MAUD tasks require answering multiple-choice questions about merger and acquisition agreements. These tasks proved particularly challenging for models, with [GPT-4](/wiki/gpt-4) averaging only 47.8% balanced accuracy on MAUD tasks.[1] LegalBench includes 34 tasks derived from MAUD.[1]
- **Contract NLI:** The Contract Natural Language Inference dataset provides tasks in which a model must determine whether a hypothesis about a confidentiality agreement is entailed, contradicted, or not mentioned by the agreement.[10] LegalBench includes 14 tasks derived from this dataset.[1]
- **OPP-115:** Nine tasks derived from the Online Privacy Policies dataset, which require models to classify privacy policy practices.[1]
- **Learned Hands:** Sixteen tasks drawn from the Learned Hands labeling platform, a collaboration between the Stanford Legal Design Lab and the Suffolk LIT Lab.[1][8] These tasks present short (one to three paragraph) descriptions of legal situations posted by real people on the Reddit forum r/legaladvice and ask the model to identify which area of law is implicated (e.g., immigration, housing, family law, employment).[1]
- **Supply Chain Disclosure:** Ten tasks that require models to classify corporate supply chain disclosures, each one to two pages in length.[1]

In addition to these incorporated datasets, many tasks were designed from scratch by legal professionals specifically for LegalBench. These original tasks cover areas such as hearsay classification, statutory interpretation, the definition of legal terms, and the identification of rhetorical strategies in judicial opinions.

### Quality Assurance

Each contributed task underwent a review process in which it was verified for correctness and clarity. Contributors provided annotated examples that served as ground truth labels. The benchmark specifies potential prompts, in-context demonstrations, and evaluation metrics for each task.[1] This standardization ensures that evaluations across different research groups are comparable.

## Task Taxonomy: Six Categories of Legal Reasoning

LegalBench organizes its 162 tasks into six categories, each corresponding to a distinct type of legal reasoning.[1] This taxonomy is grounded in the IRAC framework used in legal education, extended with two additional categories (interpretation and rhetorical understanding) to capture reasoning types that fall outside the traditional IRAC sequence.[1]

### Issue-Spotting (16 Tasks)

Issue-spotting tasks require the model to determine whether a given set of facts raises a particular legal question, implicates a specific area of law, or is relevant to a certain party.[1] In legal education, issue-spotting is often the first step in analyzing a problem: before a lawyer can apply a rule, they must first recognize which legal issues are present.

For example, the Learned Hands tasks fall into this category. Given a short description of a person's legal situation, the model must determine whether the facts are relevant to a particular legal domain such as housing, immigration, employment, or consumer protection.[1] Issue-spotting tasks test a model's ability to map factual scenarios onto the correct legal framework.

### Rule-Recall (5 Tasks)

Rule-recall tasks require the model to generate or identify the correct legal rule on a given issue within a specific jurisdiction.[1] For instance, a task might ask the model to state the rule for hearsay under the U.S. Federal Rules of Evidence, or to answer whether the law in a particular jurisdiction permits or prohibits a specific action.

These tasks evaluate whether a model has internalized the substantive content of legal rules during [pre-training](/wiki/pre-training). Because LLMs are trained on large corpora that include legal texts, statutes, and case law, they may have encoded legal rules in their parameters. Rule-recall tasks measure how accurately they can retrieve this knowledge.

### Rule-Application (16 Tasks)

Rule-application tasks evaluate whether the model can explain its reasoning in a way that demonstrates correct legal inferences.[1] This goes beyond simply knowing the rule or reaching the right conclusion. In legal practice, a lawyer must show their work by demonstrating how the facts of a particular case map onto the elements of a legal rule.

These tasks test whether a model can perform the analytical step of connecting specific facts to the abstract requirements of a legal standard. For example, a rule-application task might ask whether the model can explain why a particular piece of evidence qualifies as hearsay by identifying the relevant elements (an out-of-court statement offered to prove the truth of the matter asserted).

### Rule-Conclusion (16 Tasks)

Rule-conclusion tasks require the model to determine the correct legal outcome when given a set of facts and an applicable rule.[1] This is the final step in the IRAC framework: having identified the issue, recalled the rule, and applied it to the facts, the lawyer must state the conclusion.

These tasks test the model's ability to synthesize information and render a determination. The model must integrate the factual context with the legal standard and reach a reasoned outcome. Rule-conclusion tasks differ from rule-application tasks in that the focus is on the correctness of the final answer rather than the quality of the intermediate reasoning.[1]

### Interpretation (119 Tasks)

Interpretation tasks, the largest category by far, require the model to parse, understand, and classify legal language.[1] These tasks cover a wide range of legal documents including statutes, contracts, judicial opinions, privacy policies, and corporate disclosures.

Many of the tasks derived from CUAD, MAUD, Contract NLI, OPP-115, and Supply Chain Disclosure fall into this category. A typical interpretation task might present a contract clause and ask the model to determine whether it contains a non-compete provision, or present a privacy policy paragraph and ask whether it describes the collection of user location data.

The dominance of interpretation tasks (119 out of 162) reflects the reality of legal practice, where a large portion of a lawyer's work involves reading, understanding, and categorizing legal text.[1] It also reflects the availability of established datasets in this area that could be restructured for inclusion in LegalBench.

### Rhetorical Understanding (10 Tasks)

Rhetorical-understanding tasks require the model to reason about the structure and strategy of legal arguments and analysis.[1] Rather than determining what the law says (interpretation) or what outcome it dictates (rule-conclusion), these tasks evaluate whether the model understands how legal arguments are constructed and what function specific passages serve within a legal document.

For example, a rhetorical-understanding task might present a passage from a judicial opinion and ask the model to identify whether the passage is stating the holding, summarizing the facts, distinguishing a prior case, or presenting a policy rationale. These tasks test a higher-order understanding of legal discourse that goes beyond surface-level comprehension.

### Summary of Task Distribution

| Category | Number of Tasks | Focus |
|---|---|---|
| Issue-Spotting | 16 | Identifying which legal questions a set of facts raises |
| Rule-Recall | 5 | Generating or identifying correct legal rules |
| Rule-Application | 16 | Explaining legal reasoning with correct inferences |
| Rule-Conclusion | 16 | Determining the correct legal outcome |
| Interpretation | 119 | Parsing and classifying legal language and documents |
| Rhetorical Understanding | 10 | Analyzing the structure and strategy of legal arguments |
| **Total** | **162** | |

These category figures include overlap: a task can be counted under more than one type of reasoning, and the paper notes that the same 12 tasks are used for both rule-application and rule-conclusion, with rule-conclusion scoring only the predicted outcome while rule-application grades the quality of the generated explanation.[1] This is why the per-category counts sum to more than the benchmark's 162 unique tasks.

## Models Evaluated

The original LegalBench paper presented an empirical evaluation of 20 LLMs from 11 model families.[1] The models spanned a wide range of sizes and architectures, from small open-source models with 3 billion parameters to the largest commercial models available at the time of evaluation.

### Commercial Models

| Model | Developer | Notes |
|---|---|---|
| [GPT-4](/wiki/gpt-4) | [OpenAI](/wiki/openai) | Largest and best-performing commercial model in the evaluation |
| GPT-3.5 (text-davinci-003) | [OpenAI](/wiki/openai) | Instruction-tuned model from the GPT-3.5 family |
| Claude 1 (v1.3) | [Anthropic](/wiki/anthropic) | First-generation Claude model |

### Open-Source Models

| Model | Developer | Parameter Count |
|---|---|---|
| [Flan-T5](/wiki/t5)-XXL | Google | 11B |
| [Flan-T5](/wiki/t5)-XL | Google | 3B |
| [LLaMA](/wiki/llama)-2-13B | [Meta](/wiki/meta_ai) | 13B |
| [LLaMA](/wiki/llama)-2-7B | [Meta](/wiki/meta_ai) | 7B |
| Vicuna-13B-16k | LMSYS | 13B |
| Vicuna-7B-16k | LMSYS | 7B |
| OPT-13B | [Meta](/wiki/meta_ai) | 13B |
| OPT-6.7B | [Meta](/wiki/meta_ai) | 6.7B |
| OPT-2.7B | [Meta](/wiki/meta_ai) | 2.7B |
| INCITE-Instruct-7B | Together | 7B |
| INCITE-Base-7B | Together | 7B |
| INCITE-Instruct-3B | Together | 3B |
| [Falcon](/wiki/falcon)-7B-Instruct | TII | 7B |
| MPT-7B-8k-Instruct | MosaicML | 7B |
| WizardLM-13B | WizardLM | 13B |
| BLOOM-7B | BigScience | 7B |
| BLOOM-3B | BigScience | 3B |

## Evaluation Methodology

### Prompting Setup

Each model was evaluated using a [few-shot](/wiki/few-shot_learning) learning setup. For each task, the researchers manually wrote instructions and selected between zero and eight examples from the training split to serve as in-context demonstrations.[1] The number of examples varied depending on data availability and input sequence length constraints.

Prompts were designed to be as clear and unambiguous as possible. For binary classification tasks, models were instructed to respond with a single word ("Yes" or "No").[1] However, some models, notably Claude 1, had difficulty following these formatting constraints and would produce paragraph-length explanations instead of single-word answers, complicating automated evaluation.[1]

### Metrics

The primary evaluation metric for most tasks was balanced accuracy, chosen to account for class imbalance in the binary classification tasks.[1] A few tasks that required the model to produce multiple classes or extract multiple terms per instance used F1 score instead.[1] This standardization of metrics across the benchmark allows for meaningful comparisons between models and across tasks.

### Evaluation Scope

The evaluation focused on measuring model performance in a standardized, reproducible manner. Each task specifies its associated evaluation metric, allowing future researchers to add new models to the comparison without ambiguity about how performance should be measured.[1]

## Key Findings

### GPT-4 Leads but Does Not Dominate

[GPT-4](/wiki/gpt-4) was the strongest overall performer in the original evaluation, outperforming all other models across most task categories.[1] On issue-spotting tasks, GPT-4 led GPT-3.5 and Claude 1 by a margin of over 20 percentage points.[1] However, GPT-4's advantage was not uniform. On certain task groups, the gap between GPT-4 and smaller models narrowed considerably.

### Performance Varies Dramatically Across Tasks

One of the most important findings from LegalBench is that a given model's performance can vary dramatically across different legal tasks. This suggests that aggregate scores across all 162 tasks can be misleading. A model that performs well on contract clause classification may struggle with merger agreement analysis or statutory interpretation.

For example, on Supply Chain Disclosure tasks (requiring classification of one to two page corporate disclosures), large commercial models averaged between 74% and 75% balanced accuracy.[1] But on MAUD tasks (involving multiple-choice questions about merger agreements), GPT-4's balanced accuracy dropped to just 47.8%, barely above chance for some multi-class tasks.[1] In contrast, on certain Learned Hands issue-spotting tasks (such as immigration, traffic, and estate law), GPT-4 achieved balanced accuracy above 95%.[1]

### Open-Source Models Show Surprising Strengths

Despite their smaller parameter counts, some open-source models demonstrated competitive performance on specific categories. Flan-T5-XXL, with 11 billion parameters, outperformed both GPT-3.5 and Claude 1 on two categories: issue-spotting and rhetorical understanding.[1] This finding suggests that model architecture and training methodology matter as much as raw scale for certain types of legal reasoning.

Within model families, larger variants generally outperformed smaller ones. Flan-T5-XXL outperformed Flan-T5-XL across all five evaluated reasoning categories.[1] Similarly, LLaMA-2-13B outperformed LLaMA-2-7B on average across four reasoning categories.[1]

### GPT-3.5 and Claude 1 Perform Similarly

GPT-3.5 and Claude 1 were closely matched in overall performance, separated by an average gap of only about 2 percentage points.[1] This near-parity suggests that at the commercial model tier just below GPT-4, legal reasoning performance was relatively consistent across providers.

### Application and Conclusion Steps Are Hardest

Consistent with the structure of the IRAC framework, models found the "Application" and "Conclusion" stages of legal reasoning more difficult than issue-spotting or rule-recall. Performance sharply degraded on tasks requiring the model to integrate factual details with legal rules and synthesize a reasoned determination. This pattern suggests that LLMs are better at recognizing and retrieving legal knowledge than they are at performing the kind of context-specific analysis and synthesis that characterizes skilled legal reasoning.

### Prompt Sensitivity

Model performance showed notable sensitivity to the choice of in-context examples and the phrasing of task instructions.[1] Using plain-language descriptions of tasks sometimes outperformed more technical legal language.[1] This finding has practical implications for how legal AI tools should be designed: the way a question is framed can significantly affect the quality of the model's response.

## Notable Task Groups

### CUAD Tasks

The Contract Understanding Atticus Dataset tasks comprise the largest single source of LegalBench tasks.[1] Originally designed for contract review, these tasks were restructured into binary classification format.[9] Each task focuses on a specific type of contractual clause, such as:

- Anti-assignment provisions
- Audit rights
- Change of control clauses
- Exclusivity provisions
- IP ownership
- Liquidated damages
- Non-compete clauses
- Renewal terms
- Termination for convenience

These tasks are practically relevant because contract review is one of the most common applications of legal AI. The restructured CUAD tasks in LegalBench allow researchers to evaluate how well a model can identify specific provisions without requiring full span prediction.

### MAUD Tasks

The Mergers and Acquisitions Understanding Dataset tasks are among the most challenging in LegalBench. They require models to answer detailed questions about real merger agreements, testing the model's ability to understand complex transactional documents. The low performance of even GPT-4 on these tasks (47.8% average balanced accuracy)[1] highlights the difficulty of M&A document analysis and suggests that this area represents a significant frontier for legal AI.

### Learned Hands Tasks

The Learned Hands tasks bring an access-to-justice dimension to LegalBench. Derived from the Learned Hands platform (a collaboration between the Stanford Legal Design Lab and the Suffolk Legal Innovation and Technology Lab), these tasks use real descriptions posted by individuals seeking legal help.[8] The model must classify each description into the correct legal domain.

These tasks are especially relevant for legal aid organizations and court self-help programs, where automated issue-spotting could help route people to the correct resources more quickly. Performance on Learned Hands tasks varied widely by legal domain, with models achieving over 95% accuracy on some topics (immigration, traffic law) but struggling on others (torts, consumer protection).[1]

### Supply Chain Disclosure Tasks

These ten tasks evaluate whether models can accurately classify corporate supply chain disclosures under laws such as the California Transparency in Supply Chains Act.[1] The documents are typically one to two pages long, making them among the longer inputs in LegalBench. Large commercial models achieved 74% to 75% balanced accuracy on these tasks, indicating room for improvement.[1]

## The IRAC Framework and Legal AI

LegalBench's taxonomy is rooted in the IRAC (Issue, Rule, Application, Conclusion) framework, which is one of the most widely taught methods of legal analysis in law schools. Understanding this connection is important for interpreting LegalBench results.

In legal education, students learn to analyze problems by first identifying the **issue** (what legal question is presented), then stating the **rule** (what law governs the issue), then **applying** the rule to the specific facts, and finally reaching a **conclusion**. This sequential process mirrors how experienced lawyers approach unfamiliar legal problems.

LegalBench maps its task categories onto this framework: issue-spotting corresponds to "Issue," rule-recall corresponds to "Rule," rule-application corresponds to "Application," and rule-conclusion corresponds to "Conclusion."[1] The additional categories of interpretation and rhetorical understanding capture reasoning types that operate alongside or in support of the IRAC process.[1]

This mapping creates a diagnostic tool. When a model performs well on rule-recall but poorly on rule-application, researchers can infer that the model has encoded legal knowledge but struggles to deploy it in context-specific analysis. When a model performs well on issue-spotting but poorly on rule-conclusion, the bottleneck may lie in the synthesis of multiple reasoning steps rather than in any single component.

## Updated Leaderboard Results

Since the original 2023 paper, numerous additional models have been evaluated on LegalBench. The VALS.ai platform maintains an updated leaderboard.[3] As of early 2026, the top-performing models on LegalBench include:

| Rank | Model | Accuracy |
|---|---|---|
| 1 | Gemini 3 Pro | 87.0% |
| 2 | Gemini 3 Flash | 86.9% |
| 3 | [GPT-5](/wiki/gpt-4) | 86.0% |
| 4 | GPT-5.1 | 85.7% |
| 5 | GPT-5 | 84.6% |
| 6 | Gemini 2.5 Pro | 83.6% |
| 7 | Grok 4 | 83.4% |
| 8 | Gemini 2.5 Flash | 82.8% |
| 9 | o3 | 82.5% |
| 10 | [GPT-4](/wiki/gpt-4).1 | 81.9% |

These results demonstrate that legal reasoning performance has continued to improve with newer model generations. However, even the best models in 2026 have not surpassed 90% accuracy across all LegalBench tasks, indicating that the benchmark still presents meaningful challenges.

In the [Vals AI](/wiki/vals_ai) leaderboard update of June 9, 2026, [Anthropic](/wiki/anthropic)'s newly released [Claude Fable 5](/wiki/claude_fable_5) debuted in first place with 88.56% accuracy, ahead of Gemini 3.1 Pro Preview (87.40%), [Gemini 3 Pro](/wiki/gemini_3_pro) (87.02%), [Gemini 3 Flash](/wiki/gemini_3_flash) (86.86%), and [GPT-5.5](/wiki/gpt-5.5) (86.52%).[12] Anthropic announced Claude Fable 5 on June 9, 2026, positioning it as a state-of-the-art model for software engineering, knowledge work, and scientific research.[13]

## Impact and Adoption

### Research Community

LegalBench has become one of the standard benchmarks for evaluating legal AI systems. It is included in the HELM (Holistic Evaluation of Language Models) evaluation suite from Stanford CRFM[14] and is widely cited in papers evaluating legal language models. The paper has been cited extensively since its publication, and the benchmark has been used by researchers exploring topics such as legal [prompt engineering](/wiki/prompt_engineering), domain-specific [fine-tuning](/wiki/fine_tuning), and the development of legal-specific LLMs.

When Stanford CRFM introduced [HELM](/wiki/helm) Lite, a streamlined capability-focused version of HELM, in December 2023, it selected five of the 162 LegalBench tasks to serve as the suite's legal reasoning scenario.[14]

### Legal Profession

For legal technology companies and law firms evaluating AI tools, LegalBench provides a standardized way to compare model capabilities on legally relevant tasks. Rather than relying on anecdotal demonstrations or general-purpose benchmarks, practitioners can examine how a model performs on specific task categories that align with their use cases, whether that is contract review (interpretation tasks), legal research (rule-recall tasks), or case assessment (issue-spotting tasks).

### Access to Justice

The inclusion of Learned Hands tasks and the involvement of legal impact organizations in the benchmark's construction reflect a commitment to ensuring that legal AI evaluation considers access-to-justice applications. This is significant because some of the most promising applications of legal AI involve helping individuals who cannot afford legal representation to navigate the legal system.

### Extensions and Follow-Up Work

LegalBench has inspired follow-up projects, including LegalBench-RAG, a benchmark specifically designed to evaluate [retrieval-augmented generation](/wiki/information_retrieval) in the legal domain.[15] The open science model of LegalBench, in which legal professionals contribute tasks and the benchmark grows over time, has also influenced the design of other domain-specific benchmarks.

LegalBench-RAG was introduced in August 2024 by Nicholas Pipitone and Ghita Houir Alami as the first benchmark targeting the retrieval step of RAG pipelines in the legal domain.[15] It was constructed by retracing the context behind LegalBench queries back to its exact location in the original source documents, producing 6,858 query-answer pairs over a corpus of more than 79 million characters drawn from four datasets (CUAD, MAUD, ContractNLI, and PrivacyQA), all annotated by legal experts.[15] A lightweight subset, LegalBench-RAG-mini, supports rapid iteration and experimentation.[15]

## Limitations

While LegalBench represents a major advance in legal AI evaluation, it has several acknowledged limitations:

- **English-only:** LegalBench focuses exclusively on English-language legal reasoning, primarily within the context of U.S. law.[1] Legal systems in other jurisdictions and languages are not represented.
- **Static evaluation:** The benchmark measures model performance at a fixed point in time. It does not capture how models perform on evolving legal questions or newly enacted legislation.
- **Interpretation dominance:** The heavy concentration of tasks in the interpretation category (119 out of 162) means that aggregate scores may disproportionately reflect interpretation performance rather than providing a balanced assessment across all reasoning types.[1]
- **Binary format limitations:** Many tasks are framed as binary classification, which may not capture the full nuance of legal reasoning. Real-world legal analysis often involves degrees of certainty, competing interpretations, and context-dependent judgments.
- **Formatting sensitivity:** Some models, particularly Claude 1, had difficulty producing outputs in the required single-word format.[1] This introduces a confound where poor formatting compliance is conflated with poor legal reasoning ability.

## Availability and Access

LegalBench is fully open and available to the research community:

- **Dataset:** [Hugging Face](https://huggingface.co/datasets/nguha/legalbench)[7]
- **Code and documentation:** [GitHub](https://github.com/HazyResearch/legalbench)[6]
- **Project website:** [hazyresearch.stanford.edu/legalbench](https://hazyresearch.stanford.edu/legalbench/)[2]
- **Paper:** [arXiv:2308.11462](https://arxiv.org/abs/2308.11462)[1]
- **License:** Creative Commons Attribution 4.0 (individual tasks may have additional licensing requirements from their original dataset creators)[6][7]

The project website includes a "Getting Started" guide for researchers who want to evaluate new models on LegalBench, as well as information on how to contribute new tasks.[2]

## See Also

- [MMLU](/wiki/mmlu) (Massive Multitask Language Understanding)
- [BIG-Bench](/wiki/big_bench) (Beyond the Imitation Game Benchmark)
- [GLUE](/wiki/glue_benchmark) (General Language Understanding Evaluation)
- [GPT-4](/wiki/gpt-4)
- [Prompt Engineering](/wiki/prompt_engineering)
- [Few-Shot Learning](/wiki/few-shot_learning)
- [Natural Language Processing](/wiki/natural_language_processing)

## References

1. Guha, N., Nyarko, J., Ho, D. E., Re, C., Chilton, A., et al. (2023). "LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS) 2023, Datasets and Benchmarks Track.* arXiv:2308.11462.
2. Guha, N., et al. (2023). "LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning." Stanford Institute for Human-Centered Artificial Intelligence (HAI). Available at: https://hazyresearch.stanford.edu/legalbench/
3. VALS.ai. (2025). "LegalBench Leaderboard." Available at: https://www.vals.ai/benchmarks/legal_bench
4. Katz, D. M., Bommarito, M. J., Gao, S., & Arredondo, P. (2024). "GPT-4 Passes the Bar Exam." *Philosophical Transactions of the Royal Society A*, 382(2270). doi:10.1098/rsta.2023.0254.
5. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." *Proceedings of the International Conference on Learning Representations (ICLR).*
6. HazyResearch. (2023). "LegalBench GitHub Repository." Available at: https://github.com/HazyResearch/legalbench
7. Guha, N. (2023). "nguha/legalbench: Datasets at Hugging Face." Available at: https://huggingface.co/datasets/nguha/legalbench
8. Learned Hands Project. Stanford Legal Design Lab and Suffolk LIT Lab. Available at: https://learnedhands.law.stanford.edu/
9. Hendrycks, D., et al. (2021). "CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review." *Proceedings of NeurIPS 2021.*
10. Koreeda, Y. & Manning, C. D. (2021). "ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts." *Findings of EMNLP 2021.*
11. Guha, N., Ho, D. E., Nyarko, J., & Re, C. (2022). "LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning." arXiv:2209.06120. Available at: https://arxiv.org/abs/2209.06120
12. VALS.ai. (2026). "LegalBench Leaderboard" (updated June 9, 2026). Available at: https://www.vals.ai/benchmarks/legal_bench
13. Anthropic. (2026). "Claude Fable 5 and Claude Mythos 5." June 9, 2026. Available at: https://www.anthropic.com/news/claude-fable-5-mythos-5
14. Stanford Center for Research on Foundation Models (CRFM). (2023). "HELM Lite: Lightweight and Broad Capabilities Evaluation." December 19, 2023. Available at: https://crfm.stanford.edu/2023/12/19/helm-lite.html
15. Pipitone, N., & Houir Alami, G. (2024). "LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain." arXiv:2408.10343. Available at: https://arxiv.org/abs/2408.10343

