LegalBench

LegalBench is a collaboratively constructed benchmark for evaluating legal reasoning in large language models (LLMs). Developed by Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Re, and 36 additional co-authors from institutions including Stanford University, the University of Chicago, Harvard Law School, Georgetown University, and many others, LegalBench consists of 162 tasks spanning six categories of legal reasoning. The benchmark was published at the NeurIPS 2023 Datasets and Benchmarks track and is publicly available on Hugging Face and GitHub.

LegalBench emerged from an open science effort that brought together legal professionals, law professors, computational legal scholars, and legal impact organizations. Unlike benchmarks built exclusively by computer scientists, LegalBench draws directly on the expertise of practitioners who designed tasks that reflect real-world legal reasoning patterns. The benchmark covers a range of legal domains including contracts, evidence, civil procedure, corporate governance, immigration, privacy policy, and statutory interpretation. It provides a common vocabulary for researchers and legal professionals to discuss AI performance in law.

Background and Motivation

The application of LLMs to legal tasks has attracted significant attention from both the AI research community and the legal profession. Models such as GPT-4 have demonstrated the ability to pass the Uniform Bar Exam, and law firms have begun exploring the use of AI assistants for contract review, legal research, and document drafting. However, before LegalBench, there was no comprehensive benchmark specifically designed to measure the kinds of reasoning that lawyers and legal scholars consider central to legal practice.

Existing NLP benchmarks such as GLUE, SuperGLUE, and MMLU include some questions related to law, but they treat legal reasoning as just one subject among many and do not decompose it into the distinct reasoning types that legal education emphasizes. Legal reasoning is not a single monolithic skill. It involves identifying relevant issues, recalling applicable rules, applying those rules to specific facts, drawing conclusions, interpreting statutory and contractual language, and understanding rhetorical strategies in legal arguments.

The creators of LegalBench sought to fill this gap by building a benchmark grounded in the IRAC framework, a foundational methodology in legal education. IRAC stands for Issue, Rule, Application, and Conclusion, and it describes the sequential process by which lawyers analyze legal problems. By mapping benchmark tasks to these components (and adding interpretation and rhetorical understanding as additional categories), LegalBench enables a fine-grained diagnosis of where LLMs succeed and where they fall short in legal reasoning.

Construction and Methodology

Collaborative Task Design

LegalBench was assembled through a distinctive crowdsourcing process within the legal community. Rather than having computer scientists design all tasks, the project invited contributions from 40 individuals spanning diverse legal backgrounds: practicing attorneys, law professors, computational legal researchers, legal impact labs, and law students. Contributors were asked to propose tasks that they found either "intellectually interesting" or "practically useful" for evaluating LLMs.

This approach produced tasks that reflect the authentic concerns of legal practitioners. Some tasks test whether a model can classify evidence under the Federal Rules of Evidence. Others evaluate whether a model can identify specific clauses in merger agreements or determine whether a privacy policy discloses a particular data practice. The result is a benchmark that covers not just textbook legal questions but also the kinds of practical analysis that lawyers perform daily.

Task Format and Types

Each LegalBench task pairs an input with an expected output. The benchmark includes several task types:

Task Type	Description	Example
Binary Classification	The model must answer "Yes" or "No" to a legal question	Determining whether a statement constitutes hearsay
Multi-Class Classification	The model must select from several possible labels	Classifying the type of contractual clause
Extraction	The model must identify and extract specific information	Extracting defined terms from Supreme Court opinions
Generation	The model must produce a free-text response	Generating the legal rule for a given issue
Entailment	The model must determine whether one statement follows from another	Evaluating whether a contract clause entails a particular obligation

A substantial portion of the tasks are binary classification tasks, which require the LLM to determine whether a piece of legal text possesses a certain attribute. This design choice reflects both the prevalence of yes/no determinations in legal practice and the practical need for tasks that can be scored objectively and at scale.

Data Sources

LegalBench draws on 36 different data sources. Several established legal NLP datasets were restructured and incorporated into the benchmark:

CUAD (Contract Understanding Atticus Dataset): Originally designed for span-prediction extraction over long-form contracts, CUAD was restructured into binary classification tasks for each type of contractual clause. LegalBench includes over 40 tasks derived from CUAD, covering clauses such as non-compete provisions, IP ownership, termination rights, and exclusivity.
MAUD (Mergers and Acquisitions Understanding Dataset): MAUD tasks require answering multiple-choice questions about merger and acquisition agreements. These tasks proved particularly challenging for models, with GPT-4 averaging only 47.8% balanced accuracy on MAUD tasks.
Contract NLI: The Contract Natural Language Inference dataset provides tasks in which a model must determine whether a hypothesis about a confidentiality agreement is entailed, contradicted, or not mentioned by the agreement. LegalBench includes 16 tasks derived from this dataset.
OPP-115: Nine tasks derived from the Online Privacy Policies dataset, which require models to classify privacy policy practices.
Learned Hands: Fourteen tasks drawn from the Learned Hands labeling platform, a collaboration between the Stanford Legal Design Lab and the Suffolk LIT Lab. These tasks present short (one to three paragraph) descriptions of legal situations posted by real people on the Reddit forum r/legaladvice and ask the model to identify which area of law is implicated (e.g., immigration, housing, family law, employment).
Supply Chain Disclosure: Ten tasks that require models to classify corporate supply chain disclosures, each one to two pages in length.

In addition to these incorporated datasets, many tasks were designed from scratch by legal professionals specifically for LegalBench. These original tasks cover areas such as hearsay classification, statutory interpretation, the definition of legal terms, and the identification of rhetorical strategies in judicial opinions.

Quality Assurance

Each contributed task underwent a review process in which it was verified for correctness and clarity. Contributors provided annotated examples that served as ground truth labels. The benchmark specifies potential prompts, in-context demonstrations, and evaluation metrics for each task. This standardization ensures that evaluations across different research groups are comparable.

Task Taxonomy: Six Categories of Legal Reasoning

LegalBench organizes its 162 tasks into six categories, each corresponding to a distinct type of legal reasoning. This taxonomy is grounded in the IRAC framework used in legal education, extended with two additional categories (interpretation and rhetorical understanding) to capture reasoning types that fall outside the traditional IRAC sequence.

Issue-Spotting (16 Tasks)

Issue-spotting tasks require the model to determine whether a given set of facts raises a particular legal question, implicates a specific area of law, or is relevant to a certain party. In legal education, issue-spotting is often the first step in analyzing a problem: before a lawyer can apply a rule, they must first recognize which legal issues are present.

For example, the Learned Hands tasks fall into this category. Given a short description of a person's legal situation, the model must determine whether the facts are relevant to a particular legal domain such as housing, immigration, employment, or consumer protection. Issue-spotting tasks test a model's ability to map factual scenarios onto the correct legal framework.

Rule-Recall (5 Tasks)

Rule-recall tasks require the model to generate or identify the correct legal rule on a given issue within a specific jurisdiction. For instance, a task might ask the model to state the rule for hearsay under the U.S. Federal Rules of Evidence, or to answer whether the law in a particular jurisdiction permits or prohibits a specific action.

These tasks evaluate whether a model has internalized the substantive content of legal rules during pre-training. Because LLMs are trained on large corpora that include legal texts, statutes, and case law, they may have encoded legal rules in their parameters. Rule-recall tasks measure how accurately they can retrieve this knowledge.

Rule-Application (16 Tasks)

Rule-application tasks evaluate whether the model can explain its reasoning in a way that demonstrates correct legal inferences. This goes beyond simply knowing the rule or reaching the right conclusion. In legal practice, a lawyer must show their work by demonstrating how the facts of a particular case map onto the elements of a legal rule.

These tasks test whether a model can perform the analytical step of connecting specific facts to the abstract requirements of a legal standard. For example, a rule-application task might ask whether the model can explain why a particular piece of evidence qualifies as hearsay by identifying the relevant elements (an out-of-court statement offered to prove the truth of the matter asserted).

Rule-Conclusion (16 Tasks)

Rule-conclusion tasks require the model to determine the correct legal outcome when given a set of facts and an applicable rule. This is the final step in the IRAC framework: having identified the issue, recalled the rule, and applied it to the facts, the lawyer must state the conclusion.

These tasks test the model's ability to synthesize information and render a determination. The model must integrate the factual context with the legal standard and reach a reasoned outcome. Rule-conclusion tasks differ from rule-application tasks in that the focus is on the correctness of the final answer rather than the quality of the intermediate reasoning.

Interpretation (119 Tasks)

Interpretation tasks, the largest category by far, require the model to parse, understand, and classify legal language. These tasks cover a wide range of legal documents including statutes, contracts, judicial opinions, privacy policies, and corporate disclosures.

Many of the tasks derived from CUAD, MAUD, Contract NLI, OPP-115, and Supply Chain Disclosure fall into this category. A typical interpretation task might present a contract clause and ask the model to determine whether it contains a non-compete provision, or present a privacy policy paragraph and ask whether it describes the collection of user location data.

The dominance of interpretation tasks (119 out of 162) reflects the reality of legal practice, where a large portion of a lawyer's work involves reading, understanding, and categorizing legal text. It also reflects the availability of established datasets in this area that could be restructured for inclusion in LegalBench.

Rhetorical Understanding (10 Tasks)

Rhetorical-understanding tasks require the model to reason about the structure and strategy of legal arguments and analysis. Rather than determining what the law says (interpretation) or what outcome it dictates (rule-conclusion), these tasks evaluate whether the model understands how legal arguments are constructed and what function specific passages serve within a legal document.

For example, a rhetorical-understanding task might present a passage from a judicial opinion and ask the model to identify whether the passage is stating the holding, summarizing the facts, distinguishing a prior case, or presenting a policy rationale. These tasks test a higher-order understanding of legal discourse that goes beyond surface-level comprehension.

Summary of Task Distribution

Category	Number of Tasks	Focus
Issue-Spotting	16	Identifying which legal questions a set of facts raises
Rule-Recall	5	Generating or identifying correct legal rules
Rule-Application	16	Explaining legal reasoning with correct inferences
Rule-Conclusion	16	Determining the correct legal outcome
Interpretation	119	Parsing and classifying legal language and documents
Rhetorical Understanding	10	Analyzing the structure and strategy of legal arguments
Total	162

Models Evaluated

The original LegalBench paper presented an empirical evaluation of 20 LLMs from 11 model families. The models spanned a wide range of sizes and architectures, from small open-source models with 3 billion parameters to the largest commercial models available at the time of evaluation.

Commercial Models

Model	Developer	Notes
GPT-4	OpenAI	Largest and best-performing commercial model in the evaluation
GPT-3.5 (text-davinci-003)	OpenAI	Instruction-tuned model from the GPT-3.5 family
Claude 1 (v1.3)	Anthropic	First-generation Claude model

Open-Source Models

Model	Developer	Parameter Count
Flan-T5-XXL	Google	11B
Flan-T5-XL	Google	3B
LLaMA-2-13B	Meta	13B
LLaMA-2-7B	Meta	7B
Vicuna-13B-16k	LMSYS	13B
Vicuna-7B-16k	LMSYS	7B
OPT-13B	Meta	13B
OPT-6.7B	Meta	6.7B
OPT-2.7B	Meta	2.7B
INCITE-Instruct-7B	Together	7B
INCITE-Base-7B	Together	7B
INCITE-Instruct-3B	Together	3B
Falcon-7B-Instruct	TII	7B
MPT-7B-8k-Instruct	MosaicML	7B
WizardLM-13B	WizardLM	13B
BLOOM-7B	BigScience	7B
BLOOM-3B	BigScience	3B

Evaluation Methodology

Prompting Setup

Each model was evaluated using a few-shot learning setup. For each task, the researchers manually wrote instructions and selected between zero and eight examples from the training split to serve as in-context demonstrations. The number of examples varied depending on data availability and input sequence length constraints.

Prompts were designed to be as clear and unambiguous as possible. For binary classification tasks, models were instructed to respond with a single word ("Yes" or "No"). However, some models, notably Claude 1, had difficulty following these formatting constraints and would produce paragraph-length explanations instead of single-word answers, complicating automated evaluation.

Metrics

The primary evaluation metric for most tasks was balanced accuracy, chosen to account for class imbalance in the binary classification tasks. A few tasks that required the model to produce multiple classes or extract multiple terms per instance used F1 score instead. This standardization of metrics across the benchmark allows for meaningful comparisons between models and across tasks.

Evaluation Scope

The evaluation focused on measuring model performance in a standardized, reproducible manner. Each task specifies its associated evaluation metric, allowing future researchers to add new models to the comparison without ambiguity about how performance should be measured.

Key Findings

GPT-4 Leads but Does Not Dominate

GPT-4 was the strongest overall performer in the original evaluation, outperforming all other models across most task categories. On issue-spotting tasks, GPT-4 led GPT-3.5 and Claude 1 by a margin of over 20 percentage points. However, GPT-4's advantage was not uniform. On certain task groups, the gap between GPT-4 and smaller models narrowed considerably.

Performance Varies Dramatically Across Tasks

One of the most important findings from LegalBench is that a given model's performance can vary dramatically across different legal tasks. This suggests that aggregate scores across all 162 tasks can be misleading. A model that performs well on contract clause classification may struggle with merger agreement analysis or statutory interpretation.

For example, on Supply Chain Disclosure tasks (requiring classification of one to two page corporate disclosures), large commercial models averaged between 74% and 75% balanced accuracy. But on MAUD tasks (involving multiple-choice questions about merger agreements), GPT-4's balanced accuracy dropped to just 47.8%, barely above chance for some multi-class tasks. In contrast, on certain Learned Hands issue-spotting tasks (such as immigration, traffic, and estate law), GPT-4 achieved balanced accuracy above 95%.

Open-Source Models Show Surprising Strengths

Despite their smaller parameter counts, some open-source models demonstrated competitive performance on specific categories. Flan-T5-XXL, with 11 billion parameters, outperformed both GPT-3.5 and Claude 1 on two categories: issue-spotting and rhetorical understanding. This finding suggests that model architecture and training methodology matter as much as raw scale for certain types of legal reasoning.

Within model families, larger variants generally outperformed smaller ones. Flan-T5-XXL outperformed Flan-T5-XL across all five evaluated reasoning categories. Similarly, LLaMA-2-13B outperformed LLaMA-2-7B on average across four reasoning categories.

GPT-3.5 and Claude 1 Perform Similarly

GPT-3.5 and Claude 1 were closely matched in overall performance, separated by an average gap of only about 2 percentage points. This near-parity suggests that at the commercial model tier just below GPT-4, legal reasoning performance was relatively consistent across providers.

Application and Conclusion Steps Are Hardest

Consistent with the structure of the IRAC framework, models found the "Application" and "Conclusion" stages of legal reasoning more difficult than issue-spotting or rule-recall. Performance sharply degraded on tasks requiring the model to integrate factual details with legal rules and synthesize a reasoned determination. This pattern suggests that LLMs are better at recognizing and retrieving legal knowledge than they are at performing the kind of context-specific analysis and synthesis that characterizes skilled legal reasoning.

Prompt Sensitivity

Model performance showed notable sensitivity to the choice of in-context examples and the phrasing of task instructions. Using plain-language descriptions of tasks sometimes outperformed more technical legal language. This finding has practical implications for how legal AI tools should be designed: the way a question is framed can significantly affect the quality of the model's response.

Notable Task Groups

CUAD Tasks

The Contract Understanding Atticus Dataset tasks comprise the largest single source of LegalBench tasks. Originally designed for contract review, these tasks were restructured into binary classification format. Each task focuses on a specific type of contractual clause, such as:

Anti-assignment provisions
Audit rights
Change of control clauses
Exclusivity provisions
IP ownership
Liquidated damages
Non-compete clauses
Renewal terms
Termination for convenience

These tasks are practically relevant because contract review is one of the most common applications of legal AI. The restructured CUAD tasks in LegalBench allow researchers to evaluate how well a model can identify specific provisions without requiring full span prediction.

MAUD Tasks

The Mergers and Acquisitions Understanding Dataset tasks are among the most challenging in LegalBench. They require models to answer detailed questions about real merger agreements, testing the model's ability to understand complex transactional documents. The low performance of even GPT-4 on these tasks (47.8% average balanced accuracy) highlights the difficulty of M&A document analysis and suggests that this area represents a significant frontier for legal AI.

Learned Hands Tasks

The Learned Hands tasks bring an access-to-justice dimension to LegalBench. Derived from the Learned Hands platform (a collaboration between the Stanford Legal Design Lab and the Suffolk Legal Innovation and Technology Lab), these tasks use real descriptions posted by individuals seeking legal help. The model must classify each description into the correct legal domain.

These tasks are especially relevant for legal aid organizations and court self-help programs, where automated issue-spotting could help route people to the correct resources more quickly. Performance on Learned Hands tasks varied widely by legal domain, with models achieving over 95% accuracy on some topics (immigration, traffic law) but struggling on others (torts, consumer protection).

Supply Chain Disclosure Tasks

These ten tasks evaluate whether models can accurately classify corporate supply chain disclosures under laws such as the California Transparency in Supply Chains Act. The documents are typically one to two pages long, making them among the longer inputs in LegalBench. Large commercial models achieved 74% to 75% balanced accuracy on these tasks, indicating room for improvement.

The IRAC Framework and Legal AI

LegalBench's taxonomy is rooted in the IRAC (Issue, Rule, Application, Conclusion) framework, which is one of the most widely taught methods of legal analysis in law schools. Understanding this connection is important for interpreting LegalBench results.

In legal education, students learn to analyze problems by first identifying the issue (what legal question is presented), then stating the rule (what law governs the issue), then applying the rule to the specific facts, and finally reaching a conclusion. This sequential process mirrors how experienced lawyers approach unfamiliar legal problems.

LegalBench maps its task categories onto this framework: issue-spotting corresponds to "Issue," rule-recall corresponds to "Rule," rule-application corresponds to "Application," and rule-conclusion corresponds to "Conclusion." The additional categories of interpretation and rhetorical understanding capture reasoning types that operate alongside or in support of the IRAC process.

This mapping creates a diagnostic tool. When a model performs well on rule-recall but poorly on rule-application, researchers can infer that the model has encoded legal knowledge but struggles to deploy it in context-specific analysis. When a model performs well on issue-spotting but poorly on rule-conclusion, the bottleneck may lie in the synthesis of multiple reasoning steps rather than in any single component.

Updated Leaderboard Results

Since the original 2023 paper, numerous additional models have been evaluated on LegalBench. The VALS.ai platform maintains an updated leaderboard. As of early 2026, the top-performing models on LegalBench include:

Rank	Model	Accuracy
1	Gemini 3 Pro	87.0%
2	Gemini 3 Flash	86.9%
3	GPT-5	86.0%
4	GPT-5.1	85.7%
5	GPT-5	84.6%
6	Gemini 2.5 Pro	83.6%
7	Grok 4	83.4%
8	Gemini 2.5 Flash	82.8%
9	o3	82.5%
10	GPT-4.1	81.9%

These results demonstrate that legal reasoning performance has continued to improve with newer model generations. However, even the best models in 2026 have not surpassed 90% accuracy across all LegalBench tasks, indicating that the benchmark still presents meaningful challenges.

Impact and Adoption

Research Community

LegalBench has become one of the standard benchmarks for evaluating legal AI systems. It is included in the HELM (Holistic Evaluation of Language Models) evaluation suite from Stanford CRFM and is widely cited in papers evaluating legal language models. The paper has been cited extensively since its publication, and the benchmark has been used by researchers exploring topics such as legal prompt engineering, domain-specific fine-tuning, and the development of legal-specific LLMs.

Legal Profession

For legal technology companies and law firms evaluating AI tools, LegalBench provides a standardized way to compare model capabilities on legally relevant tasks. Rather than relying on anecdotal demonstrations or general-purpose benchmarks, practitioners can examine how a model performs on specific task categories that align with their use cases, whether that is contract review (interpretation tasks), legal research (rule-recall tasks), or case assessment (issue-spotting tasks).

Access to Justice

The inclusion of Learned Hands tasks and the involvement of legal impact organizations in the benchmark's construction reflect a commitment to ensuring that legal AI evaluation considers access-to-justice applications. This is significant because some of the most promising applications of legal AI involve helping individuals who cannot afford legal representation to navigate the legal system.

Extensions and Follow-Up Work

LegalBench has inspired follow-up projects, including LegalBench-RAG, a benchmark specifically designed to evaluate retrieval-augmented generation in the legal domain. The open science model of LegalBench, in which legal professionals contribute tasks and the benchmark grows over time, has also influenced the design of other domain-specific benchmarks.

Limitations

While LegalBench represents a major advance in legal AI evaluation, it has several acknowledged limitations:

English-only: LegalBench focuses exclusively on English-language legal reasoning, primarily within the context of U.S. law. Legal systems in other jurisdictions and languages are not represented.
Static evaluation: The benchmark measures model performance at a fixed point in time. It does not capture how models perform on evolving legal questions or newly enacted legislation.
Interpretation dominance: The heavy concentration of tasks in the interpretation category (119 out of 162) means that aggregate scores may disproportionately reflect interpretation performance rather than providing a balanced assessment across all reasoning types.
Binary format limitations: Many tasks are framed as binary classification, which may not capture the full nuance of legal reasoning. Real-world legal analysis often involves degrees of certainty, competing interpretations, and context-dependent judgments.
Formatting sensitivity: Some models, particularly Claude 1, had difficulty producing outputs in the required single-word format. This introduces a confound where poor formatting compliance is conflated with poor legal reasoning ability.

Availability and Access

LegalBench is fully open and available to the research community:

Dataset: Hugging Face
Code and documentation: GitHub
Project website: hazyresearch.stanford.edu/legalbench
Paper: arXiv:2308.11462
License: Creative Commons Attribution 4.0 (individual tasks may have additional licensing requirements from their original dataset creators)

The project website includes a "Getting Started" guide for researchers who want to evaluate new models on LegalBench, as well as information on how to contribute new tasks.

References

Guha, N., Nyarko, J., Ho, D. E., Re, C., Chilton, A., et al. (2023). "LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS) 2023, Datasets and Benchmarks Track.* arXiv:2308.11462.
Guha, N., et al. (2023). "LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning." Stanford Institute for Human-Centered Artificial Intelligence (HAI). Available at: https://hazyresearch.stanford.edu/legalbench/
VALS.ai. (2025). "LegalBench Leaderboard." Available at: https://www.vals.ai/benchmarks/legal_bench
Katz, D. M., Bommarito, M. J., Gao, S., & Arredondo, P. (2024). "GPT-4 Passes the Bar Exam." *Philosophical Transactions of the Royal Society A*, 382(2270). doi:10.1098/rsta.2023.0254.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." *Proceedings of the International Conference on Learning Representations (ICLR).*
HazyResearch. (2023). "LegalBench GitHub Repository." Available at: https://github.com/HazyResearch/legalbench
Guha, N. (2023). "nguha/legalbench: Datasets at Hugging Face." Available at: https://huggingface.co/datasets/nguha/legalbench
Learned Hands Project. Stanford Legal Design Lab and Suffolk LIT Lab. Available at: https://learnedhands.law.stanford.edu/
Hendrycks, D., et al. (2021). "CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review." *Proceedings of NeurIPS 2021.*
Koreeda, Y. & Manning, C. D. (2021). "ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts." *Findings of EMNLP 2021.*

LegalBench

Background and Motivation

Construction and Methodology

Collaborative Task Design

Task Format and Types

Data Sources

Quality Assurance

Task Taxonomy: Six Categories of Legal Reasoning

Issue-Spotting (16 Tasks)

Rule-Recall (5 Tasks)

Rule-Application (16 Tasks)

Rule-Conclusion (16 Tasks)

Interpretation (119 Tasks)

Rhetorical Understanding (10 Tasks)

Summary of Task Distribution

Models Evaluated

Commercial Models

Open-Source Models

Evaluation Methodology

Prompting Setup

Metrics

Evaluation Scope

Key Findings

GPT-4 Leads but Does Not Dominate

Performance Varies Dramatically Across Tasks

Open-Source Models Show Surprising Strengths

GPT-3.5 and Claude 1 Perform Similarly

Application and Conclusion Steps Are Hardest

Prompt Sensitivity

Notable Task Groups

CUAD Tasks

MAUD Tasks

Learned Hands Tasks

Supply Chain Disclosure Tasks

The IRAC Framework and Legal AI

Updated Leaderboard Results

Impact and Adoption

Research Community

Legal Profession

Access to Justice

Extensions and Follow-Up Work

Limitations

Availability and Access

See Also

References

Improve this article

Related Articles

DeepSeek 3.0

Humanity's Last Exam

Harvey

Agentic Context Engineering

Claude Sonnet 4.5

Context window

LegalBench

Background and Motivation

Construction and Methodology

Collaborative Task Design

Task Format and Types

Data Sources

Quality Assurance

Task Taxonomy: Six Categories of Legal Reasoning

Issue-Spotting (16 Tasks)

Rule-Recall (5 Tasks)

Rule-Application (16 Tasks)

Rule-Conclusion (16 Tasks)

Interpretation (119 Tasks)

Rhetorical Understanding (10 Tasks)

Summary of Task Distribution

Models Evaluated

Commercial Models

Open-Source Models

Evaluation Methodology

Prompting Setup

Metrics

Evaluation Scope

Key Findings

GPT-4 Leads but Does Not Dominate

Performance Varies Dramatically Across Tasks

Open-Source Models Show Surprising Strengths

GPT-3.5 and Claude 1 Perform Similarly