Deep Research Bench

Deep Research Bench
Overview
Full name	Deep Research Bench
Abbreviation	DRB
Description	A benchmark evaluating LLM agents' web research capabilities using frozen web snapshots for reproducible evaluation
Initial release	6 May 2025
Latest version	Updated continuously on public leaderboard
Authors	Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, Jack Wildman
Organization	FutureSearch
Technical details
Type	Web research, multi-step agent tasks, information retrieval
Modality	Text, HTML, web content
Task format	Multi-step research questions with verified answers
Number of tasks	89 task instances across 8 categories
Evaluation metric	Binary, precision, recall, F1, fuzzy numerical match (task-dependent), composite score scaled 0 to 1
Action budget	50 ReAct actions per task attempt
Languages	English
Domains	General web research, fact-checking, dataset discovery, evidence gathering, claim validation
Performance
Noise ceiling	Approximately 0.8 (estimated max possible score)
SOTA score	0.51 (May 2025 launch); Claude Opus 4.6 high-effort 55.0% (2026 update)
SOTA model	OpenAI o3 (initial paper); Claude Opus 4.6 (later effort-scaling update)
Saturated	No
Resources
Website	evals.futuresearch.ai
Leaderboard	drb.futuresearch.ai
Paper	arXiv:2506.06287
Contact	evals@futuresearch.ai
License	Tasks and frozen corpora are proprietary; results published openly

Deep Research Bench (DRB) is a benchmark for evaluating large language model agents that perform multi-step web research, created by the artificial intelligence research lab FutureSearch. The benchmark, released on 6 May 2025 with an accompanying paper (arXiv:2506.06287), addresses a long-standing problem in agent evaluation: the live web changes constantly, which makes any web-grounded test non-reproducible and lets newer models look better simply because the indexed information has shifted. DRB solves this with a system called RetroSearch, which serves agents a frozen, previously scraped slice of the internet so that the same task can be re-run on every new model and produce comparable scores. The 89 multi-step tasks span eight categories of real research work, with answers carefully verified by skilled human analysts and a public leaderboard hosted at drb.futuresearch.ai.^[1]^[2]

At launch, FutureSearch reported that the best score on DRB was 0.51 out of 1.0, achieved by OpenAI's o3 reasoning model invoked through ChatGPT with web search, which outperformed the dedicated "Deep Research" products from OpenAI, Perplexity, Anthropic, and Google by a comfortable margin. The team estimated that even a flawless agent would plateau near 0.8 because of irreducible task ambiguity, which makes 0.51 a meaningful figure rather than a small one. Later updates have evaluated Claude Opus 4.x, GPT-5, and Gemini 3 variants on the same corpus, with the score gap between top models narrowing to a few percentage points.^[2]^[3]^[4]

Origin and creators

Deep Research Bench was built by FutureSearch, a San Francisco area startup founded in August 2023 by Dan Schwarz (CEO) and Lawrence Phillips (CTO), both of whom previously worked at Metaculus, the public forecasting platform where Schwarz was CTO and Phillips led the AI team. Schwarz had earlier spent years as a senior software engineer at Google and Waymo and built Google's internal prediction market. FutureSearch raised a seed round in November 2024 and operates as a research and engineering team focused on AI forecasting and evaluation. The company's mission has centered on building systems that can predict the future at human or superhuman levels, and rigorous benchmarks are a core part of that program.^[5]^[6]

The paper's full author list reads Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, and Jack Wildman, all affiliated with FutureSearch. The same group has produced the related Bench To the Future (BTF) benchmark, which applies the same pastcasting methodology to forecasting questions, using a corpus of 15 million documents and 1,417 hard forecasting questions whose resolutions are already known. BTF and DRB share the RetroSearch infrastructure and represent two complementary tests: BTF measures forecasting calibration while DRB measures factual research quality.^[7]

Motivation

Before Deep Research Bench, evaluating a web-browsing AI agent presented several problems that no benchmark fully resolved. The live web changes: pages move, sites go down, content gets edited, and search engines re-rank results. A model that scored 60 percent on a web research task in March might score 40 percent in October because relevant pages have disappeared, not because the model got worse. Conversely, a new model can appear stronger because the answer has become easier to find on Google. Stable comparison across model generations becomes impossible.

A second problem is that most existing research benchmarks reduce knowledge work to short-form question answering, like MMLU or GPQA, which test what a model already knows from training rather than how well it can find and synthesize new information. Web-research evaluations that did exist either gave models offline corpora that did not look like the open web (so results did not transfer to deployment) or used live web in non-reproducible ways. The result was that AI labs and customers could not tell which deep research product genuinely worked best, and progress was hard to measure over time.

FutureSearch designed DRB to be the first benchmark that captures real-world web research tasks fully offline while keeping the carefully worked human answers correct as the internet changes. The team chose tasks that take skilled human researchers between roughly 30 minutes and 4 hours to solve, which forces multi-step planning, source evaluation, and synthesis rather than single-query lookup. About 40 percent of the task pool was drawn from FutureSearch's own consulting work for paying clients, which means the benchmark reflects the kinds of questions companies actually pay researchers to answer.^[2]^[3]

RetroSearch architecture

RetroSearch is the technical foundation that makes Deep Research Bench reproducible. For each task, the FutureSearch team scrapes a large set of web pages relevant to the question, typically between 10,000 and 100,000 pages, using tools such as the Serper search API, Playwright for headless browsing, and ScraperAPI for sites that block bots. The most complex "Gather Evidence" tasks ship with as many as 189,000 frozen pages. These pages are stored offline and exposed to agents through an interface that mimics the live Google search API as closely as possible.

The key engineering goal of RetroSearch is to minimize the gap between offline and live agent behavior. When an agent issues a search query, RetroSearch returns ranked results drawn from the frozen corpus that look like Google results. When an agent fetches a URL, it gets the page exactly as it appeared on the scrape date. This lets researchers run the same agent stack against the frozen corpus repeatedly without worrying about flakiness, paywalls, geo-blocking, or rate limits. It also enables "pastcasting": running models on questions whose answers became known after the scrape date, so the model cannot have memorized the resolution from training.

A central empirical finding of the DRB paper is that offline RetroSearch agents perform comparably to live web agents, at least when the live web has not changed dramatically. This validates the methodology and means an offline benchmark is a faithful proxy for real deployment. The same RetroSearch infrastructure underlies the Bench to the Future forecasting benchmark, where it serves frozen views of the world for thousands of past resolution dates.^[1]^[2]^[7]

Task design

Deep Research Bench includes 89 distinct task instances spanning 8 categories. Each task has a natural-language prompt, a frozen corpus of relevant web pages, and one or more carefully worked human answers used as ground truth. Tasks were designed to take a skilled human researcher between 30 minutes and 4 hours, and the answers were validated by FutureSearch's internal research analysts. Tasks span domains including finance, science, regulation, technology trends, government data, and academic literature.

The eight task categories

Category	What it tests	Example	Scoring
Find Number	Locate a single specific number on the web	"How many FDA Class II medical device recalls occurred in a given period?"	Binary (correct or incorrect within tolerance)
Find Dataset	Identify a specific dataset that answers the question	"Find dataset of monthly software developer job postings 2019 to 2023"	Binary or precision-style match
Find Original Source	Trace a claim back to its primary source	"Find the original paper or filing for a quoted statistic"	Binary
Validate Claim	Decide if a public statement is true, false, or unverifiable	"Is ChatGPT 10 times more energy intensive than a Google search?"	Binary with required evidence
Derive Number	Compute a number by combining multiple sources	"Total renewable energy capacity in a country derived from regional reports"	Fuzzy numerical match against ground truth
Gather Evidence	Compile supporting and contradicting evidence around a claim	"Evidence on climate impact on staple crop yields"	Recall-weighted scoring against curated evidence list
Populate Reference Class	Produce a comprehensive list of items meeting a definition	"List all fintech unicorns founded since 2020"	Precision and recall against curated set
Compile Dataset	Build a structured dataset from scattered sources	"Annual count of US IPOs with offer price at least $5.00, 1980 to 2024"	Completeness and accuracy against ground truth dataset

The split between binary tasks (Find Number, Find Original Source, Validate Claim) and quantitative tasks (Derive Number, Gather Evidence, Populate Reference Class, Compile Dataset) is deliberate. The binary tasks reward correct end answers and resemble traditional QA. The quantitative tasks measure how thorough an agent is, which tends to be the harder skill and is where current models lose the most ground to humans.^[2]^[3]^[8]

Real task examples

The accompanying paper and FutureSearch's blog posts disclose a number of representative tasks, which give a sense of the benchmark's flavor:

A Find Number task asking how many FDA Class II medical device recalls occurred during a defined period, where the agent must navigate the FDA's public databases and avoid double-counting amendments.
A Validate Claim task asking whether a widely circulated statistic comparing the energy use of ChatGPT to Google Search is true, which requires locating original utility filings and computing per-query estimates from incomplete data.
A Compile Dataset task requiring annual counts of US IPOs at or above a specified offer price from 1980 to 2024, which forces the agent to combine SEC filings, news archives, and structured databases.
A Populate Reference Class task asking for a complete list of software developer job-posting trends across 2019 to 2023, which requires multiple labor-market sources and careful definitional choices.

Ground truth for each task was produced by domain experts at FutureSearch, often spending hours per task to verify each fact, and was double-checked against alternative interpretations of ambiguous wording.

Scoring methodology

Every task produces a score in the range 0 to 1. The scoring formula depends on the category:

Scoring type	Categories using it	Description
Binary	Find Number, Find Original Source, Validate Claim	1 if the agent's final answer matches ground truth, 0 otherwise. Find Number uses a tolerance.
Precision	Find Dataset, Populate Reference Class	Correct items returned divided by total items returned
Recall	Gather Evidence, Compile Dataset	Correct items returned divided by total ground-truth items
F1	Combined precision-recall tasks	Harmonic mean of precision and recall, see F1 score
Fuzzy numerical	Derive Number	A bounded distance from the correct value, normalized to 0 to 1

The overall DRB score is the average across all 89 tasks (sometimes reported as a percentage, sometimes a 0 to 1 fraction). Because the eight categories vary in difficulty, FutureSearch also reports per-category breakdowns. A model with strong binary QA but weak set-building skills will look very different on Find Number versus Populate Reference Class.

An important calibration is the noise ceiling. FutureSearch estimates that a hypothetical perfect agent would still score only around 0.8 because of task ambiguity, disagreements about which sources count as authoritative, and small errors in ground truth itself. Scores should therefore be interpreted against that ceiling rather than 1.0. The 0.51 achieved by o3 corresponds to roughly 64 percent of the achievable ceiling.^[3]

Trace-level analytics

Beyond the final answer, DRB analyzes the agent's reasoning trace to produce diagnostic metrics that explain why an agent fails:

Metric	What it measures	Why it matters
Forgetting	How often the agent loses track of prior findings or goals across the trace	Strongest negative predictor of final score, correlation around -0.84
Action hallucination	Frequency of fabricated tool inputs, like inventing URLs that never existed	Indicates poor calibration of the agent's tool-use loop
Repetitive tool use	Identical or near-identical searches issued repeatedly	A sign of strategy collapse
Query mistakes	Over-quoting, missing operators, lazy keyword matching	Shows weak search-engine literacy
Premature stopping	The agent commits to an answer before verification	Common in lower-tier models
Source citation	Whether outputs cite verifiable URLs from the corpus	Tracks reliability

The paper reports that the three most heavily studied failure modes (forgetting, action hallucination, repetitive tool use) explain only between 9 percent and 13 percent of the variance in final scores. This suggests that strategic planning, source judgment, and domain understanding (which are harder to quantify) dominate research performance, which is broadly consistent with what skilled human researchers report about their own work.^[2]^[3]

Models evaluated

The May 2025 launch evaluated 12 LLM backbones running through a ReAct-style agent with web search and document fetch tools, plus 11 commercial "Deep Research" products. The action budget for each task is 50 tool calls, and reasoning effort for thinking models is set manually so that comparisons are not skewed by token-budget differences.

Custom ReAct agent results (May 2025)

Model	Notes
OpenAI o3 (via ChatGPT with web search)	Top of the leaderboard at 0.51 overall, double-checks own findings frequently
Claude 3.7 Sonnet (Anthropic)	Close second, strong in both thinking and non-thinking modes
Gemini 2.5 Pro (Google)	Strong on tasks requiring structured planning and stepwise reasoning
GPT-4 Turbo (OpenAI)	Solid baseline, lagged thinking models on multi-step tasks
DeepSeek-R1	Best open-weight performer, narrowed the gap to GPT-4 Turbo but more prone to hallucination
Grok (xAI)	Moderate, varied across categories

FutureSearch's qualitative observations on the May 2025 cohort include the result that newer "thinking" models clearly outperformed earlier non-thinking models, that closed models maintained an edge over open-weight models, and that Claude 4 Sonnet and Opus surpassed o3 on a continuously updated agent leaderboard run separately from the paper figures. As of the May 2025 paper, Claude could not directly read PDFs in the FutureSearch harness, which materially limited its score.^[2]^[3]^[8]

Commercial "Deep Research" products tested

Product	Underlying model(s)	Result class
OpenAI Deep Research	o3-based	Underperformed plain o3 + web search
Perplexity Deep Research	Mixed Perplexity stack	Mid-pack
Anthropic Claude Research with extended thinking	Claude 3.x family	Improved on Anthropic's standard web search variant
Google Gemini Deep Research	Gemini 2.5 family	Improved over plain Gemini search
Grok DeepSearch	xAI Grok	Moderate
DeepSeek + Search	DeepSeek open-weights	Cost-competitive but weaker accuracy

One of the most discussed findings of the paper is that the bespoke "Deep Research" products did not always beat the base reasoning model with plain web search. Plain ChatGPT with o3 and search outperformed OpenAI's own dedicated Deep Research mode, and Perplexity's specialized deep research lagged behind its standard Pro search in several categories. The Anthropic and Google deep research modes did improve over their plain-search counterparts.^[1]^[2]

2026 effort-scaling update

A FutureSearch follow-up post titled "More reasoning tokens helps Claude, but not GPT or Gemini" reported on a 2026 sweep that re-ran DRB with multiple reasoning effort settings. Both Claude Opus 4.6 and Claude Sonnet 4.6 gained meaningfully from higher reasoning budgets, with Sonnet 4.6 rising from 50.4 percent to 54.9 percent and Opus 4.6 reaching 55.0 percent at high effort. GPT-5 scores fell from 49.6 percent to 48.1 percent as effort increased, and Gemini 3 Flash dropped about 2 percentage points across the same range. The takeaway was that effort scaling does not improve every model and that Claude 4.6 benefited the most.^[4]

Cost and runtime

FutureSearch also publishes a cost-per-task breakdown using Pareto frontier analysis. As of the 2026 cost post, Gemini 3 Flash at low effort was the cheapest option at roughly $0.05 per task, Claude 4.6 Opus at low effort cost about $0.24 per task with 53.1 percent accuracy and runtime near 130 seconds, and Claude 4.6 Opus at high effort cost about $0.55 per task at 55.0 percent accuracy with runtime around 6 minutes. Most models clustered well under one dollar per task, which FutureSearch summarized as "deep research is surprisingly affordable". Wall-clock times partly reflected the per-provider token rate limits available during evaluation.^[9]

Findings and lessons

Deep Research Bench produced several findings that have shaped subsequent work on research agents.

Thinking and verification matter

The single most consistent result is that reasoning models that explicitly verify their own intermediate findings outperform models that do not. OpenAI's o3 was notable for actively double-checking findings during its trace, which correlated with a higher final score and lower hallucination rate. Claude models with extended thinking enabled also gained measurably over their non-thinking variants. The implication is that good web research is less a function of raw model size and more a function of an iterative verification loop.

Forgetting is the dominant failure mode

Forgetting, defined as losing track of earlier findings or goals as the trace grows, had the strongest negative correlation with final score in the paper (about -0.84). The benchmark's long traces make this measurable in ways shorter QA benchmarks cannot. Practical implications include the need for explicit scratchpad management, summary checkpoints, and longer effective context, all of which several frontier models have iterated on since the paper.

Tool-equipped agents do not always beat tool-less agents

FutureSearch also tested "toolless" agents (LLMs answering from internal knowledge with no web access) on a subset of tasks. On the Validate Claim category, toolless agents averaged 0.61 while tool-enabled agents averaged 0.62, an essentially flat result. This is a striking finding for tasks where the claim's truth value is widely covered in training data. On tasks that require current information or synthesis from multiple specific sources, like Derive Number and Gather Evidence, tool access was essential and the toolless gap was wide.

Specialized "deep research" products do not always help

The paper's most counterintuitive headline is that some of the most heavily marketed "Deep Research" products underperformed simpler agent stacks built on the same model. Plain o3 with web search beat OpenAI's Deep Research; plain Perplexity Pro often beat Perplexity Deep Research. This pattern, while not universal, suggests that the product-level scaffolding around deep research can sometimes hurt more than it helps, possibly because longer agentic loops accumulate more forgetting and more hallucinated tool calls.

Open versus closed gap is narrowing but real

DeepSeek-R1 was the strongest open-weight contender in May 2025 and roughly matched GPT-4 Turbo, but it lagged the frontier closed models and showed higher hallucination rates. The gap has narrowed in later runs as open-weight reasoning models improved, but as of the most recent leaderboard snapshots Claude Opus 4.x and OpenAI o3 still hold the top spots for accuracy at high effort.^[1]^[2]^[3]

Deep Research Bench sits in a crowded field of agentic and retrieval-augmented benchmarks. The table below compares the most prominent ones.

Benchmark	Organization	Tasks	What it measures	Reproducibility	Notes
Deep Research Bench (DRB)	FutureSearch	89 multi-step, 8 categories	Web research quality with verified human answers	High, frozen RetroSearch corpus	Pastcasting design, focuses on practical research
BrowseComp	OpenAI	1,266 short questions	Persistent web navigation for hard-to-find facts	Lower, depends on live web	OpenAI Deep Research solves about half
Humanity's Last Exam	Center for AI Safety, Scale	2,500 expert-level closed questions	Frontier expert knowledge across many fields	High, static	Multi-modal, no web tool requirement
FRAMES	Google	Multi-document factoid set	Long-context factual QA with reasoning	Static	Often paired with retrieval evaluations
GAIA	Hugging Face and partners	450 general-assistant questions	Reasoning, tool use, multi-modality	Static	Broader scope than DRB but shorter tasks
DeepResearch Bench	Independent academic group	100 PhD-level long-form tasks	Long-form research report quality, RACE and FACT scores	Live web	Different project, similar name; English and Chinese
SimpleQA	OpenAI	Many short factual questions	Closed-book factual recall	Static	No browsing

A few comparative points are worth pulling out. BrowseComp is closest in spirit to DRB but uses single-shot hard-to-find questions on the live web, while DRB uses multi-step tasks on a frozen corpus. Humanity's Last Exam tests what a model already knows; DRB tests how well a model finds and synthesizes information it does not already know. FRAMES is essentially a multi-document QA benchmark without the agent loop. GAIA covers more modalities than DRB but each task is shorter. And the similarly named "DeepResearch Bench" published by an independent academic group focuses on long-form report quality across 100 PhD-level tasks evaluated by the RACE and FACT frameworks, which is a different evaluation philosophy from DRB's per-task numerical scoring; the two are commonly mistaken for each other but are not the same benchmark.^[2]^[10]^[11]^[12]

Reception and use

Deep Research Bench received attention in the practitioner press and academic preprint discussions through the second half of 2025. Coverage in outlets such as Unite.AI, DEV Community, and various AI newsletters emphasized two themes: that current AI research agents are useful but not yet ready to be left unsupervised, and that a benchmark with a noise ceiling near 0.8 is a more honest scorecard than one that pretends a model can score 1.0.

The benchmark has been adopted internally by several agent-building teams. NVIDIA's AI-Q Blueprint, for example, includes a Deep Research Bench evaluation pipeline that runs the NeMo Agent Toolkit against the frozen DRB corpus, formats outputs to a JSONL contract, and submits scores to the official evaluator. The Parallel Web Systems product team uses both DRB and BrowseComp to benchmark the Pareto frontier of their own research stack and competing offerings. These integrations illustrate that DRB has become one of the de facto reference benchmarks for vendors selling "deep research" capabilities.^[8]^[12]^[13]

The FutureSearch effort-scaling and cost analyses, which use DRB as the scoring axis, have also been cited in coverage of new model releases. Claims about a model's research ability are increasingly accompanied by a DRB number rather than (or in addition to) a generic chat-quality figure. The benchmark has thus shifted from a one-off paper into a continuously updated leaderboard used in product positioning.^[4]^[9]

Criticisms and limitations

Deep Research Bench is widely respected, but several limitations are well-known.

Coverage and size

Eighty-nine tasks is a modest sample. Critics note that some categories contain only a handful of instances each, which means per-category scores carry meaningful variance. The benchmark is also English-only, which limits its applicability to multilingual deployment. Earlier versions of the FutureSearch roadmap suggested expanded language coverage as a future direction, but as of mid-2026 the public release remains English-only.

Task selection bias

About 40 percent of the DRB tasks were sourced from FutureSearch's paying clients. This grounds the benchmark in real demand but may also bias the distribution toward business and finance topics relative to scientific or humanities research. The remaining tasks were authored in-house, and some observers have asked for greater transparency about how the team selected them and how representative they are of the broader space of research questions.

Noise ceiling and disagreement

The estimated 0.8 noise ceiling acknowledges that ground truth itself is not fully crisp. Some tasks have answers that depend on the interpretation of ambiguous wording ("all unicorn startups in fintech" can mean different things depending on the date and the definition of unicorn). When models cluster within a few points of one another, ranking depends heavily on these edge cases. The FutureSearch team has documented this and adjusted ground truth over time, but the issue is inherent to evaluating subjective research.

Snapshot freshness

A frozen corpus loses relevance as the web evolves. Information that mattered in May 2025 may be misleading or obsolete in 2027. FutureSearch refreshes some tasks and adds new ones, but the corpus cannot be re-scraped for every leaderboard refresh without losing comparability across runs. This trade-off between freshness and longitudinal comparability is fundamental to the pastcasting approach.

Limited explanation of failure variance

The paper's own analysis shows that the three quantified failure modes (forgetting, action hallucination, repetitive tool use) account for only 9 to 13 percent of variance in final scores. This is a candid admission that the diagnostic toolkit cannot yet explain why one strong model beats another by 3 points on a given task. The unexplained variance presumably involves harder-to-measure factors like source-quality judgment and strategic planning.

Closed dataset

Deep Research Bench's tasks and frozen RetroSearch corpora are not publicly downloadable. Evaluations are run by FutureSearch or by approved partners who contact evals@futuresearch.ai. This protects the benchmark from being trained against but reduces external reproducibility and slows independent academic study. Several follow-up benchmarks, including "Why Your Deep Research Agent Fails" (DeepHalluBench, arXiv:2601.22984) and Microsoft's LiveDRBench, have responded by releasing more open variants.^[14]

Marketing-versus-reality gap

Independent commentary on the benchmark has emphasized that DRB scores demonstrate a clear gap between vendor marketing and measured performance. Tools described as "AI research analysts" routinely score below 60 percent of the achievable ceiling, deliver overconfident summaries, and miss material information. This is a finding about the field, not about DRB, but it is one of the benchmark's most visible legacies.^[8]

Deep Research Bench is one of a family of benchmarks that emerged in 2025 and 2026 to evaluate research-style AI agents.

Bench to the Future (BTF / BTF-2). A FutureSearch follow-up using the same RetroSearch infrastructure to evaluate forecasting calibration on 1,417 past resolution questions over a 15 million document corpus. Reported in arXiv:2506.21558.
LiveDRBench. A Microsoft Research effort to build an objective live-web deep-research benchmark in the same spirit as DRB but with a more open release model.
BrowseComp-Plus (arXiv:2508.06600). A more fair and transparent evaluation framework for deep-research browsing agents, focused on calibration and citation auditing.
DeepResearch Bench II (arXiv:2601.08536). A follow-up to the similarly named independent academic project, with 132 rubric-graded long-form tasks and 9,430 atomic rubrics for analysis, recall, and presentation.
Why Your Deep Research Agent Fails (arXiv:2601.22984). A trajectory-level hallucination analysis that introduces the PIES taxonomy of planning and summarization errors.
DRACO. Perplexity's internal deep-research evaluation, which the company has used to characterize "in the wild" research performance on user-supplied queries.
ResearchRubrics and DEER. Rubric-driven evaluations of long-form research reports across many domains, complementary to DRB's short-form numerical scoring.

This ecosystem reflects the broader recognition that there is no single "best" way to evaluate a research agent: DRB measures verifiable answers, BrowseComp measures hard-to-find facts, DeepResearch Bench measures long-form report quality, and BTF measures forecasting calibration. Most serious agent developers now report numbers on several of these benchmarks rather than one.^[7]^[14]^[15]^[16]^[17]

Influence on agent design

DRB findings have influenced concrete agent engineering choices. The strong negative correlation between forgetting and final score has pushed teams to invest in scratchpad summarization, episodic memory, and longer effective context windows. The poor performance of some specialized "Deep Research" modes versus plain reasoning models has pushed product teams to simplify their agent loops rather than add scaffolding. The high cost of high-effort runs has spurred work on adaptive effort allocation, where the agent uses minimal effort on simple tasks and reserves long thinking budgets for hard ones.

More broadly, DRB helped legitimize the pastcasting and frozen-corpus methodology for evaluating web-grounded systems, an idea that has since spread to forecasting benchmarks (BTF), live-web variants (LiveDRBench), and citation-audit benchmarks (BrowseComp-Plus). The architectural pattern of separating the agent loop from a controlled retrieval environment is now common in research-agent papers, which is part of DRB's lasting contribution.

Significance

Deep Research Bench was the first widely cited benchmark to combine three properties at once: realistic multi-step web research tasks rather than single-shot QA, human-verified answers that stay correct as the web changes, and a public leaderboard that is updated as new models are released. By doing so it gave the field a stable yardstick for one of the most commercially important AI workflows.

The benchmark's specific numbers (o3 at 0.51, a noise ceiling near 0.8, dominant failure mode of forgetting) have entered the working vocabulary of agent developers. Its methodology (frozen RetroSearch, ReAct loop with a 50-action budget, trace-level analytics) has been adopted, extended, or contrasted with by most subsequent benchmarks in the area. And its public findings (specialized deep-research products do not always beat their underlying models, thinking models verify themselves more, open-weight models are narrowing the gap) have been repeatedly confirmed and refined in follow-on work.

As AI assistants take on more of the work of research, due diligence, journalism, and analysis, the stability of benchmarks like DRB matters beyond a single leaderboard. The benchmark is one of the clearest current answers to the question "how good are these agents really at doing the job they are marketed for" and the answer, as of mid-2026, is "useful but still firmly assistive rather than autonomous".

References

Bosse, N. I., Evans, J., Gambee, R. G., Hnyk, D., Mühlbacher, P., Phillips, L., Schwarz, D., Wildman, J. (2025). "Deep Research Bench: Evaluating AI Web Research Agents." arXiv:2506.06287. https://arxiv.org/abs/2506.06287
FutureSearch. "Deep Research Bench Leaderboard: LLM Web Research Agent Rankings." https://futuresearch.ai/deep-research-bench/
Unite.AI. "How Good Are AI Agents at Real Research? Inside the Deep Research Bench Report." https://www.unite.ai/how-good-are-ai-agents-at-real-research-inside-the-deep-research-bench-report/
FutureSearch. "More reasoning tokens helps Claude, but not GPT or Gemini." https://futuresearch.ai/effort-scaling/
FutureSearch. "Company." https://futuresearch.ai/company/
FutureSearch. Crunchbase company profile. https://www.crunchbase.com/organization/varuna-ai
Wildman, J. et al. (2025). "Bench to the Future: A Pastcasting Benchmark for Forecasting Agents." arXiv:2506.21558. https://arxiv.org/abs/2506.21558
DEV Community. "The Reality Check." https://dev.to/rawveg/the-reality-check-3jc5
FutureSearch. "How Much Does Deep Research Cost? A Model-by-Model Breakdown." https://futuresearch.ai/blog/cost-of-deep-research/
Wei, J. et al. (2025). "BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents." OpenAI. https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf
Parallel Web Systems. "A new pareto-frontier for Deep Research price-performance." https://parallel.ai/blog/deep-research-benchmarks
Epoch AI. "DeepResearchBench." https://epoch.ai/benchmarks/deepresearchbench
NVIDIA. "Deep Research Bench Evaluation of NVIDIA AI-Q Blueprint." https://docs.nvidia.com/aiq-blueprint/1.2.1/evaluation/benchmarks/deep-research-bench.html
Microsoft Research. "LiveDRBench." https://github.com/microsoft/LiveDRBench
BrowseComp-Plus paper. arXiv:2508.06600. https://arxiv.org/abs/2508.06600
"Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory." arXiv:2601.22984. https://arxiv.org/abs/2601.22984
DeepResearch Bench (independent project). https://deepresearch-bench.github.io/
LessWrong. "A Guide For LLM-Assisted Web Research." https://www.lesswrong.com/posts/uAEhvX6scvcZANWwg/a-guide-for-llm-assisted-web-research
Medium. Pranam Shetty. "Can LLMs really do web research? (and why your agent still gets stuck)." https://medium.com/@prxshetty/can-llms-really-do-web-research-and-why-your-agent-still-gets-stuck-d74598b44e45

Origin and creators

Motivation

RetroSearch architecture

Task design

The eight task categories

Real task examples

Scoring methodology

Trace-level analytics

Models evaluated

Custom ReAct agent results (May 2025)

Commercial "Deep Research" products tested

2026 effort-scaling update

Cost and runtime

Findings and lessons

Thinking and verification matter

Forgetting is the dominant failure mode

Tool-equipped agents do not always beat tool-less agents

Specialized "deep research" products do not always help

Open versus closed gap is narrowing but real

Comparison with related benchmarks

Reception and use

Criticisms and limitations

Coverage and size

Task selection bias

Noise ceiling and disagreement

Snapshot freshness

Limited explanation of failure variance

Closed dataset

Marketing-versus-reality gap

Subsequent and related work

Influence on agent design

Significance

See also

References

Improve this article

Related Articles

BrowseComp

Factorio Learning Environment

ARC-AGI 3

Aider Polyglot

BALROG

IFBench

Origin and creators

Motivation

RetroSearch architecture

Task design

The eight task categories

Real task examples

Scoring methodology

Trace-level analytics

Models evaluated

Custom ReAct agent results (May 2025)

Commercial "Deep Research" products tested

2026 effort-scaling update

Cost and runtime

Findings and lessons

Thinking and verification matter

Forgetting is the dominant failure mode

Tool-equipped agents do not always beat tool-less agents

Specialized "deep research" products do not always help

Open versus closed gap is narrowing but real

Comparison with related benchmarks

Reception and use

Criticisms and limitations

Coverage and size

Task selection bias

Noise ceiling and disagreement

Snapshot freshness

Limited explanation of failure variance

Closed dataset

Marketing-versus-reality gap

Subsequent and related work

Influence on agent design

Significance

See also

References

Related Articles

BrowseComp

Factorio Learning Environment

ARC-AGI 3