LegalBench is a collaboratively constructed benchmark for evaluating legal reasoning in large language models (LLMs). Developed by Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Re, and 36 additional co-authors from institutions including Stanford University, the University of Chicago, Harvard Law School, Georgetown University, and many others, LegalBench consists of 162 tasks spanning six categories of legal reasoning. The benchmark was published at the NeurIPS 2023 Datasets and Benchmarks track and is publicly available on Hugging Face and GitHub.
LegalBench emerged from an open science effort that brought together legal professionals, law professors, computational legal scholars, and legal impact organizations. Unlike benchmarks built exclusively by computer scientists, LegalBench draws directly on the expertise of practitioners who designed tasks that reflect real-world legal reasoning patterns. The benchmark covers a range of legal domains including contracts, evidence, civil procedure, corporate governance, immigration, privacy policy, and statutory interpretation. It provides a common vocabulary for researchers and legal professionals to discuss AI performance in law.
The application of LLMs to legal tasks has attracted significant attention from both the AI research community and the legal profession. Models such as GPT-4 have demonstrated the ability to pass the Uniform Bar Exam, and law firms have begun exploring the use of AI assistants for contract review, legal research, and document drafting. However, before LegalBench, there was no comprehensive benchmark specifically designed to measure the kinds of reasoning that lawyers and legal scholars consider central to legal practice.
Existing NLP benchmarks such as GLUE, SuperGLUE, and MMLU include some questions related to law, but they treat legal reasoning as just one subject among many and do not decompose it into the distinct reasoning types that legal education emphasizes. Legal reasoning is not a single monolithic skill. It involves identifying relevant issues, recalling applicable rules, applying those rules to specific facts, drawing conclusions, interpreting statutory and contractual language, and understanding rhetorical strategies in legal arguments.
The creators of LegalBench sought to fill this gap by building a benchmark grounded in the IRAC framework, a foundational methodology in legal education. IRAC stands for Issue, Rule, Application, and Conclusion, and it describes the sequential process by which lawyers analyze legal problems. By mapping benchmark tasks to these components (and adding interpretation and rhetorical understanding as additional categories), LegalBench enables a fine-grained diagnosis of where LLMs succeed and where they fall short in legal reasoning.
LegalBench was assembled through a distinctive crowdsourcing process within the legal community. Rather than having computer scientists design all tasks, the project invited contributions from 40 individuals spanning diverse legal backgrounds: practicing attorneys, law professors, computational legal researchers, legal impact labs, and law students. Contributors were asked to propose tasks that they found either "intellectually interesting" or "practically useful" for evaluating LLMs.
This approach produced tasks that reflect the authentic concerns of legal practitioners. Some tasks test whether a model can classify evidence under the Federal Rules of Evidence. Others evaluate whether a model can identify specific clauses in merger agreements or determine whether a privacy policy discloses a particular data practice. The result is a benchmark that covers not just textbook legal questions but also the kinds of practical analysis that lawyers perform daily.
Each LegalBench task pairs an input with an expected output. The benchmark includes several task types:
| Task Type | Description | Example |
|---|---|---|
| Binary Classification | The model must answer "Yes" or "No" to a legal question | Determining whether a statement constitutes hearsay |
| Multi-Class Classification | The model must select from several possible labels | Classifying the type of contractual clause |
| Extraction | The model must identify and extract specific information | Extracting defined terms from Supreme Court opinions |
| Generation | The model must produce a free-text response | Generating the legal rule for a given issue |
| Entailment | The model must determine whether one statement follows from another | Evaluating whether a contract clause entails a particular obligation |
A substantial portion of the tasks are binary classification tasks, which require the LLM to determine whether a piece of legal text possesses a certain attribute. This design choice reflects both the prevalence of yes/no determinations in legal practice and the practical need for tasks that can be scored objectively and at scale.
LegalBench draws on 36 different data sources. Several established legal NLP datasets were restructured and incorporated into the benchmark:
In addition to these incorporated datasets, many tasks were designed from scratch by legal professionals specifically for LegalBench. These original tasks cover areas such as hearsay classification, statutory interpretation, the definition of legal terms, and the identification of rhetorical strategies in judicial opinions.
Each contributed task underwent a review process in which it was verified for correctness and clarity. Contributors provided annotated examples that served as ground truth labels. The benchmark specifies potential prompts, in-context demonstrations, and evaluation metrics for each task. This standardization ensures that evaluations across different research groups are comparable.
LegalBench organizes its 162 tasks into six categories, each corresponding to a distinct type of legal reasoning. This taxonomy is grounded in the IRAC framework used in legal education, extended with two additional categories (interpretation and rhetorical understanding) to capture reasoning types that fall outside the traditional IRAC sequence.
Issue-spotting tasks require the model to determine whether a given set of facts raises a particular legal question, implicates a specific area of law, or is relevant to a certain party. In legal education, issue-spotting is often the first step in analyzing a problem: before a lawyer can apply a rule, they must first recognize which legal issues are present.
For example, the Learned Hands tasks fall into this category. Given a short description of a person's legal situation, the model must determine whether the facts are relevant to a particular legal domain such as housing, immigration, employment, or consumer protection. Issue-spotting tasks test a model's ability to map factual scenarios onto the correct legal framework.
Rule-recall tasks require the model to generate or identify the correct legal rule on a given issue within a specific jurisdiction. For instance, a task might ask the model to state the rule for hearsay under the U.S. Federal Rules of Evidence, or to answer whether the law in a particular jurisdiction permits or prohibits a specific action.
These tasks evaluate whether a model has internalized the substantive content of legal rules during pre-training. Because LLMs are trained on large corpora that include legal texts, statutes, and case law, they may have encoded legal rules in their parameters. Rule-recall tasks measure how accurately they can retrieve this knowledge.
Rule-application tasks evaluate whether the model can explain its reasoning in a way that demonstrates correct legal inferences. This goes beyond simply knowing the rule or reaching the right conclusion. In legal practice, a lawyer must show their work by demonstrating how the facts of a particular case map onto the elements of a legal rule.
These tasks test whether a model can perform the analytical step of connecting specific facts to the abstract requirements of a legal standard. For example, a rule-application task might ask whether the model can explain why a particular piece of evidence qualifies as hearsay by identifying the relevant elements (an out-of-court statement offered to prove the truth of the matter asserted).
Rule-conclusion tasks require the model to determine the correct legal outcome when given a set of facts and an applicable rule. This is the final step in the IRAC framework: having identified the issue, recalled the rule, and applied it to the facts, the lawyer must state the conclusion.
These tasks test the model's ability to synthesize information and render a determination. The model must integrate the factual context with the legal standard and reach a reasoned outcome. Rule-conclusion tasks differ from rule-application tasks in that the focus is on the correctness of the final answer rather than the quality of the intermediate reasoning.
Interpretation tasks, the largest category by far, require the model to parse, understand, and classify legal language. These tasks cover a wide range of legal documents including statutes, contracts, judicial opinions, privacy policies, and corporate disclosures.
Many of the tasks derived from CUAD, MAUD, Contract NLI, OPP-115, and Supply Chain Disclosure fall into this category. A typical interpretation task might present a contract clause and ask the model to determine whether it contains a non-compete provision, or present a privacy policy paragraph and ask whether it describes the collection of user location data.
The dominance of interpretation tasks (119 out of 162) reflects the reality of legal practice, where a large portion of a lawyer's work involves reading, understanding, and categorizing legal text. It also reflects the availability of established datasets in this area that could be restructured for inclusion in LegalBench.
Rhetorical-understanding tasks require the model to reason about the structure and strategy of legal arguments and analysis. Rather than determining what the law says (interpretation) or what outcome it dictates (rule-conclusion), these tasks evaluate whether the model understands how legal arguments are constructed and what function specific passages serve within a legal document.
For example, a rhetorical-understanding task might present a passage from a judicial opinion and ask the model to identify whether the passage is stating the holding, summarizing the facts, distinguishing a prior case, or presenting a policy rationale. These tasks test a higher-order understanding of legal discourse that goes beyond surface-level comprehension.
| Category | Number of Tasks | Focus |
|---|---|---|
| Issue-Spotting | 16 | Identifying which legal questions a set of facts raises |
| Rule-Recall | 5 | Generating or identifying correct legal rules |
| Rule-Application | 16 | Explaining legal reasoning with correct inferences |
| Rule-Conclusion | 16 | Determining the correct legal outcome |
| Interpretation | 119 | Parsing and classifying legal language and documents |
| Rhetorical Understanding | 10 | Analyzing the structure and strategy of legal arguments |
| Total | 162 |
The original LegalBench paper presented an empirical evaluation of 20 LLMs from 11 model families. The models spanned a wide range of sizes and architectures, from small open-source models with 3 billion parameters to the largest commercial models available at the time of evaluation.
| Model | Developer | Notes |
|---|---|---|
| GPT-4 | OpenAI | Largest and best-performing commercial model in the evaluation |
| GPT-3.5 (text-davinci-003) | OpenAI | Instruction-tuned model from the GPT-3.5 family |
| Claude 1 (v1.3) | Anthropic | First-generation Claude model |
| Model | Developer | Parameter Count |
|---|---|---|
| Flan-T5-XXL | 11B | |
| Flan-T5-XL | 3B | |
| LLaMA-2-13B | Meta | 13B |
| LLaMA-2-7B | Meta | 7B |
| Vicuna-13B-16k | LMSYS | 13B |
| Vicuna-7B-16k | LMSYS | 7B |
| OPT-13B | Meta | 13B |
| OPT-6.7B | Meta | 6.7B |
| OPT-2.7B | Meta | 2.7B |
| INCITE-Instruct-7B | Together | 7B |
| INCITE-Base-7B | Together | 7B |
| INCITE-Instruct-3B | Together | 3B |
| Falcon-7B-Instruct | TII | 7B |
| MPT-7B-8k-Instruct | MosaicML | 7B |
| WizardLM-13B | WizardLM | 13B |
| BLOOM-7B | BigScience | 7B |
| BLOOM-3B | BigScience | 3B |
Each model was evaluated using a few-shot learning setup. For each task, the researchers manually wrote instructions and selected between zero and eight examples from the training split to serve as in-context demonstrations. The number of examples varied depending on data availability and input sequence length constraints.
Prompts were designed to be as clear and unambiguous as possible. For binary classification tasks, models were instructed to respond with a single word ("Yes" or "No"). However, some models, notably Claude 1, had difficulty following these formatting constraints and would produce paragraph-length explanations instead of single-word answers, complicating automated evaluation.
The primary evaluation metric for most tasks was balanced accuracy, chosen to account for class imbalance in the binary classification tasks. A few tasks that required the model to produce multiple classes or extract multiple terms per instance used F1 score instead. This standardization of metrics across the benchmark allows for meaningful comparisons between models and across tasks.
The evaluation focused on measuring model performance in a standardized, reproducible manner. Each task specifies its associated evaluation metric, allowing future researchers to add new models to the comparison without ambiguity about how performance should be measured.
GPT-4 was the strongest overall performer in the original evaluation, outperforming all other models across most task categories. On issue-spotting tasks, GPT-4 led GPT-3.5 and Claude 1 by a margin of over 20 percentage points. However, GPT-4's advantage was not uniform. On certain task groups, the gap between GPT-4 and smaller models narrowed considerably.
One of the most important findings from LegalBench is that a given model's performance can vary dramatically across different legal tasks. This suggests that aggregate scores across all 162 tasks can be misleading. A model that performs well on contract clause classification may struggle with merger agreement analysis or statutory interpretation.
For example, on Supply Chain Disclosure tasks (requiring classification of one to two page corporate disclosures), large commercial models averaged between 74% and 75% balanced accuracy. But on MAUD tasks (involving multiple-choice questions about merger agreements), GPT-4's balanced accuracy dropped to just 47.8%, barely above chance for some multi-class tasks. In contrast, on certain Learned Hands issue-spotting tasks (such as immigration, traffic, and estate law), GPT-4 achieved balanced accuracy above 95%.
Despite their smaller parameter counts, some open-source models demonstrated competitive performance on specific categories. Flan-T5-XXL, with 11 billion parameters, outperformed both GPT-3.5 and Claude 1 on two categories: issue-spotting and rhetorical understanding. This finding suggests that model architecture and training methodology matter as much as raw scale for certain types of legal reasoning.
Within model families, larger variants generally outperformed smaller ones. Flan-T5-XXL outperformed Flan-T5-XL across all five evaluated reasoning categories. Similarly, LLaMA-2-13B outperformed LLaMA-2-7B on average across four reasoning categories.
GPT-3.5 and Claude 1 were closely matched in overall performance, separated by an average gap of only about 2 percentage points. This near-parity suggests that at the commercial model tier just below GPT-4, legal reasoning performance was relatively consistent across providers.
Consistent with the structure of the IRAC framework, models found the "Application" and "Conclusion" stages of legal reasoning more difficult than issue-spotting or rule-recall. Performance sharply degraded on tasks requiring the model to integrate factual details with legal rules and synthesize a reasoned determination. This pattern suggests that LLMs are better at recognizing and retrieving legal knowledge than they are at performing the kind of context-specific analysis and synthesis that characterizes skilled legal reasoning.
Model performance showed notable sensitivity to the choice of in-context examples and the phrasing of task instructions. Using plain-language descriptions of tasks sometimes outperformed more technical legal language. This finding has practical implications for how legal AI tools should be designed: the way a question is framed can significantly affect the quality of the model's response.
The Contract Understanding Atticus Dataset tasks comprise the largest single source of LegalBench tasks. Originally designed for contract review, these tasks were restructured into binary classification format. Each task focuses on a specific type of contractual clause, such as:
These tasks are practically relevant because contract review is one of the most common applications of legal AI. The restructured CUAD tasks in LegalBench allow researchers to evaluate how well a model can identify specific provisions without requiring full span prediction.
The Mergers and Acquisitions Understanding Dataset tasks are among the most challenging in LegalBench. They require models to answer detailed questions about real merger agreements, testing the model's ability to understand complex transactional documents. The low performance of even GPT-4 on these tasks (47.8% average balanced accuracy) highlights the difficulty of M&A document analysis and suggests that this area represents a significant frontier for legal AI.
The Learned Hands tasks bring an access-to-justice dimension to LegalBench. Derived from the Learned Hands platform (a collaboration between the Stanford Legal Design Lab and the Suffolk Legal Innovation and Technology Lab), these tasks use real descriptions posted by individuals seeking legal help. The model must classify each description into the correct legal domain.
These tasks are especially relevant for legal aid organizations and court self-help programs, where automated issue-spotting could help route people to the correct resources more quickly. Performance on Learned Hands tasks varied widely by legal domain, with models achieving over 95% accuracy on some topics (immigration, traffic law) but struggling on others (torts, consumer protection).
These ten tasks evaluate whether models can accurately classify corporate supply chain disclosures under laws such as the California Transparency in Supply Chains Act. The documents are typically one to two pages long, making them among the longer inputs in LegalBench. Large commercial models achieved 74% to 75% balanced accuracy on these tasks, indicating room for improvement.
LegalBench's taxonomy is rooted in the IRAC (Issue, Rule, Application, Conclusion) framework, which is one of the most widely taught methods of legal analysis in law schools. Understanding this connection is important for interpreting LegalBench results.
In legal education, students learn to analyze problems by first identifying the issue (what legal question is presented), then stating the rule (what law governs the issue), then applying the rule to the specific facts, and finally reaching a conclusion. This sequential process mirrors how experienced lawyers approach unfamiliar legal problems.
LegalBench maps its task categories onto this framework: issue-spotting corresponds to "Issue," rule-recall corresponds to "Rule," rule-application corresponds to "Application," and rule-conclusion corresponds to "Conclusion." The additional categories of interpretation and rhetorical understanding capture reasoning types that operate alongside or in support of the IRAC process.
This mapping creates a diagnostic tool. When a model performs well on rule-recall but poorly on rule-application, researchers can infer that the model has encoded legal knowledge but struggles to deploy it in context-specific analysis. When a model performs well on issue-spotting but poorly on rule-conclusion, the bottleneck may lie in the synthesis of multiple reasoning steps rather than in any single component.
Since the original 2023 paper, numerous additional models have been evaluated on LegalBench. The VALS.ai platform maintains an updated leaderboard. As of early 2026, the top-performing models on LegalBench include:
| Rank | Model | Accuracy |
|---|---|---|
| 1 | Gemini 3 Pro | 87.0% |
| 2 | Gemini 3 Flash | 86.9% |
| 3 | GPT-5 | 86.0% |
| 4 | GPT-5.1 | 85.7% |
| 5 | GPT-5 | 84.6% |
| 6 | Gemini 2.5 Pro | 83.6% |
| 7 | Grok 4 | 83.4% |
| 8 | Gemini 2.5 Flash | 82.8% |
| 9 | o3 | 82.5% |
| 10 | GPT-4.1 | 81.9% |
These results demonstrate that legal reasoning performance has continued to improve with newer model generations. However, even the best models in 2026 have not surpassed 90% accuracy across all LegalBench tasks, indicating that the benchmark still presents meaningful challenges.
LegalBench has become one of the standard benchmarks for evaluating legal AI systems. It is included in the HELM (Holistic Evaluation of Language Models) evaluation suite from Stanford CRFM and is widely cited in papers evaluating legal language models. The paper has been cited extensively since its publication, and the benchmark has been used by researchers exploring topics such as legal prompt engineering, domain-specific fine-tuning, and the development of legal-specific LLMs.
For legal technology companies and law firms evaluating AI tools, LegalBench provides a standardized way to compare model capabilities on legally relevant tasks. Rather than relying on anecdotal demonstrations or general-purpose benchmarks, practitioners can examine how a model performs on specific task categories that align with their use cases, whether that is contract review (interpretation tasks), legal research (rule-recall tasks), or case assessment (issue-spotting tasks).
The inclusion of Learned Hands tasks and the involvement of legal impact organizations in the benchmark's construction reflect a commitment to ensuring that legal AI evaluation considers access-to-justice applications. This is significant because some of the most promising applications of legal AI involve helping individuals who cannot afford legal representation to navigate the legal system.
LegalBench has inspired follow-up projects, including LegalBench-RAG, a benchmark specifically designed to evaluate retrieval-augmented generation in the legal domain. The open science model of LegalBench, in which legal professionals contribute tasks and the benchmark grows over time, has also influenced the design of other domain-specific benchmarks.
While LegalBench represents a major advance in legal AI evaluation, it has several acknowledged limitations:
LegalBench is fully open and available to the research community:
The project website includes a "Getting Started" guide for researchers who want to evaluate new models on LegalBench, as well as information on how to contribute new tasks.