FinanceBench
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 2,203 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 2,203 words
Add missing citations, update stale details, or suggest a clearer explanation.
FinanceBench is an AI benchmark for open-book financial question answering, designed to test whether large language models can answer the kinds of questions a financial analyst asks about a publicly traded company when the relevant filings are available. The full benchmark comprises 10,231 questions about publicly traded companies in the United States, each paired with a human-verified answer and an "evidence string" drawn from a real public filing such as a 10-K, 10-Q, 8-K, or earnings report [1][2]. It was introduced in November 2023 by researchers at Patronus AI, together with collaborators from Contextual AI and Stanford University, in the paper "FinanceBench: A New Benchmark for Financial Question Answering" by Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen [1].
The headline result of the original study was that off-the-shelf models performed poorly: a closed-book GPT-4-Turbo answered only 9 percent of a 150-question evaluation sample correctly, and the same model paired with a shared retrieval vector store answered incorrectly or refused to answer 81 percent of those questions [1]. FinanceBench is structured as a test suite rather than a training set, and an open subset of 150 cases is released publicly while the full 10,231-question set is held out, with researchers directed to contact Patronus AI to run evaluations against the complete benchmark [2][3]. Since its release it has become a standard reference point reported by finance-focused large language model and retrieval-augmented generation systems, and it is one of the datasets folded into later financial-AI leaderboards [4].
Finance specialists routinely locate information about companies and industries, summarize it, and reason about it to support investment decisions, financial strategy, and due diligence. The FinanceBench authors argue that this is exactly the kind of labor that large language models might augment or automate, yet that the industry had few rigorous ways to measure whether models are good enough to trust in high-stakes settings [1].
The paper identifies several properties of the financial domain that make it hard for general-purpose models. Models need domain-specific knowledge of financial terminology, companies, and industries; they need up-to-date information because training data lags filings by months or years; financial questions frequently require numerical reasoning, a known weakness of large language models; answers often combine unstructured free text with structured tabular data, which many models handle poorly; and the necessary facts may be scattered across long documents rather than sitting in a single short passage [1]. A 10-K can run to roughly 250 pages, so finding the right figure is itself a retrieval problem.
The authors deliberately frame the task as open-book question answering, which includes a retrieval component, rather than simply handing the model the answer-bearing text. They contend that prior financial datasets were not grounded in the day-to-day activities of analysts, and that strong performance on a generic open-domain benchmark cannot be assumed to transfer to a specialized domain such as finance [1]. To motivate finance-specific evaluation, the paper points to BloombergGPT, a 50-billion-parameter model trained on a large finance-specific corpus and released in early 2023, as evidence of growing industry interest in finance-adapted large language models [1].
FinanceBench is a dataset of 10,231 question-answer-evidence triplets. It covers 40 companies publicly traded in the USA across 361 public filings released between 2015 and 2023, including 10-Ks, 10-Qs, 8-Ks, and earnings reports [1]. Each entry includes the question, the answer, an evidence string containing the information needed to verify that answer, and a page number from the relevant document; some entries also carry a "justification" explaining how a number was calculated. Every entry is labeled with the company name, the company's GICS sector, the document name, the document year, and the document type, which supports fine-grained analysis [1].
The benchmark was built by a multidisciplinary team and contains three kinds of questions, constructed in three different ways [1]:
| Question category | Count | How constructed |
|---|---|---|
| Domain-relevant | 925 | 25 standardized questions per company, posed for 37 of the 40 companies, generically relevant to analyzing any public company (for example, whether it paid a dividend in the last year) |
| Novel-generated | 1,323 | Company-specific questions written by financial analysts, posed for 37 of the 40 companies, averaging 36 per company, intended to be realistic, varied, and to require reasoning rather than pure extraction |
| Metrics-generated | 7,983 | Programmatically generated from 18 base financial metrics extracted from 10-K income statements, balance sheets, and cash flow statements over 2015 to 2022, posed for 32 of the 40 companies, averaging 249 per company |
For the metrics-generated questions, annotators extracted base metrics that could be computed from a single financial statement, then derived additional metrics from them (for example, net income margin from net income and total revenue) and instantiated questions using templates specific to each company, fiscal year, and statement, with phrasing variations to keep the questions varied [1].
Separately, the paper applies a taxonomy of reasoning types to the domain-relevant and metrics-generated questions (8,908 questions in total). It classifies 2,493 questions (28 percent) as information extraction, where a specific value or text span is read directly from a filing; 5,897 questions (66 percent) as numerical reasoning, which involve calculations or comparisons; and 518 questions (6 percent) as logical reasoning, which involve qualitative inference, contrast, or judgment [1]. A team of about 20 annotators with finance backgrounds, including treasury analysts, finance MBAs, and junior analysts, built and quality-checked the data, with roughly 10 percent of the domain-relevant and novel-generated questions reviewed in a final quality pass [1].
From the full dataset, the authors constructed a balanced sample of 150 cases for human evaluation: 50 domain-relevant questions (stratified evenly across the 25 unique question templates), 50 randomly sampled novel-generated questions, and 50 randomly sampled metrics-generated questions [1]. This 150-case set is what is released openly so that others can reproduce the headline analysis [2][3]. The full 10,231-question benchmark is held out and not released for direct download; researchers are pointed to Patronus AI to evaluate models against the complete set [3].
The study evaluated four models from three providers: OpenAI's GPT-4 and GPT-4-Turbo, Anthropic's Claude 2 (with a 100,000-token context window), and Meta's Llama 2. Because not every model was paired with every setup and prompt order, the cross product yielded 16 distinct configurations, and the research team manually reviewed each model's answer to each of the 150 cases, for a total of n=2,400 labeled answers (the headline Table 2 reports 8 representative configurations covering 1,200 labels) [1]. The evaluation used five settings designed to span both unrealistic reference points and deployment-like pipelines [1]:
Each model response was labeled as one of three outcomes: correct answer (allowing minor deviations such as small rounding differences), incorrect answer (including calculations that are off or that contradict the evidence), or failure to answer (the model explicitly states it cannot answer, for example because it lacks access to the data) [1]. Refusing to answer is treated as distinct from answering incorrectly, on the view that a refusal carries less risk to a user than a confident wrong answer.
The central finding is that models perform poorly without the right information and remain error-prone even when retrieval works. GPT-4-Turbo in the closed-book setting answered only 9 percent of the 150 cases correctly, was incorrect on 3 percent, and failed to answer the remaining 88 percent [1]. The widely quoted figure that GPT-4-Turbo "incorrectly answered or refused to answer 81 percent of questions" comes from the shared vector store setting, where the model was correct on 19 percent, incorrect on 13 percent, and failed to answer 68 percent, so 13 plus 68 equals 81 percent of answers were either wrong or refused [1].
Performance improved substantially as the retrieval problem was made easier, but no configuration was reliable. The table below reproduces the per-configuration results on the 150-case human-evaluation sample [1].
| Model | Configuration | Correct | Incorrect | Failed to answer |
|---|---|---|---|---|
| GPT-4-Turbo | Closed book | 9% | 3% | 88% |
| GPT-4-Turbo | Shared vector store | 19% | 13% | 68% |
| Llama 2 | Shared vector store | 19% | 70% | 11% |
| GPT-4-Turbo | Single vector store | 50% | 11% | 39% |
| Llama 2 | Single vector store | 41% | 54% | 5% |
| Claude 2 | Long context | 76% | 21% | 3% |
| GPT-4-Turbo | Long context | 79% | 17% | 4% |
| GPT-4-Turbo | Oracle | 85% | 15% | 0% |
Several patterns stand out. A per-document vector store outperformed a single shared store for both GPT-4-Turbo (50 percent versus 19 percent correct) and Llama 2 (41 percent versus 19 percent), confirming that correct retrieval is critical [1]. Even in the Oracle setting, where the model is handed the evidence pages, GPT-4-Turbo still got 15 percent of answers wrong, showing that reasoning errors persist once the right text is in front of the model [1]. Models also failed in characteristically different ways: GPT-4-Turbo tended to refuse when unsure, whereas Llama 2 far more often produced confident incorrect answers (70 percent and 54 percent incorrect in the two vector-store settings), which the authors flag as a greater hallucination risk [1].
The authors caution that the strong long-context and Oracle numbers are not a clean bill of health. Feeding entire filings into a long context window is slow and expensive, cannot accommodate documents larger than the context window, and is therefore unrealistic for many enterprise deployments; the Oracle setting is unrealistic by construction because it assumes perfect retrieval [1]. Prompt order also mattered: in the long-context setting, placing the filing before the question (context-first) lifted GPT-4-Turbo from 25 percent to 78 percent correct, indicating sensitivity to how the prompt is arranged [1]. Qualitatively, the paper documents cases where models produced superficially coherent, well-justified answers with extensive calculations that were nonetheless wrong, a textbook example of hallucination in a high-stakes domain. The overall conclusion is that all models examined exhibit weaknesses that limit their suitability for enterprise financial QA [1].
FinanceBench was presented as a first-of-its-kind test suite for open-book financial QA, and its emphasis on grounding answers in real filings, citing primary evidence, and measuring hallucination made it a natural diagnostic for retrieval-augmented generation pipelines applied to finance [1][2]. In the years since, it has been used as a reference benchmark by work on financial RAG retrieval and by finance-adapted embedding and reranking systems, and it has been folded into broader financial-AI leaderboards that aggregate many tasks and datasets [4].
FinanceBench sits within a lineage of financial-NLP benchmarks but differs in scope. FinQA (Chen et al., 2021) is an expert-written dataset of more than 8,000 question-answer pairs that require chains of arithmetic operations over the tables in financial reports. ConvFinQA (Chen et al., 2022) extended FinQA to a conversational setting, with 3,892 conversations comprising 14,115 questions in which later questions can depend on earlier ones. TAT-QA (Zhu et al., 2021) targets numerical reasoning over hybrid tabular and textual content from financial reports, with 16,552 question-answer pairs built from 182 reports [1]. The paper also discusses earlier finance-NLP resources such as FiQA (aspect-based sentiment and opinionated QA) and the FLANG model with its FLUE evaluation suite [1].
The key distinction the authors draw is that FinQA, ConvFinQA, and TAT-QA are largely grounded in specific tables or short provided contexts, whereas FinanceBench is explicitly an open-book test over entire public filings, with a retrieval component that mirrors how analysts actually work [1]. By pairing realistic, ecologically valid questions with full-document evidence and a held-out scale of more than 10,000 items, FinanceBench aimed to fill a gap that table-grounded datasets left open, and its public 150-case subset gave the community a reproducible slice for measuring progress on grounded financial question answering.