Deep Research Bench
Last reviewed
May 16, 2026
Sources
19 citations
Review status
Source-backed
Revision
v2 · 5,386 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
19 citations
Review status
Source-backed
Revision
v2 · 5,386 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Deep Research Bench | |
|---|---|
| Overview | |
| Full name | Deep Research Bench |
| Abbreviation | DRB |
| Description | A benchmark evaluating LLM agents' web research capabilities using frozen web snapshots for reproducible evaluation |
| Initial release | 6 May 2025 |
| Latest version | Updated continuously on public leaderboard |
| Authors | Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, Jack Wildman |
| Organization | FutureSearch |
| Technical details | |
| Type | Web research, multi-step agent tasks, information retrieval |
| Modality | Text, HTML, web content |
| Task format | Multi-step research questions with verified answers |
| Number of tasks | 89 task instances across 8 categories |
| Evaluation metric | Binary, precision, recall, F1, fuzzy numerical match (task-dependent), composite score scaled 0 to 1 |
| Action budget | 50 ReAct actions per task attempt |
| Languages | English |
| Domains | General web research, fact-checking, dataset discovery, evidence gathering, claim validation |
| Performance | |
| Noise ceiling | Approximately 0.8 (estimated max possible score) |
| SOTA score | 0.51 (May 2025 launch); Claude Opus 4.6 high-effort 55.0% (2026 update) |
| SOTA model | OpenAI o3 (initial paper); Claude Opus 4.6 (later effort-scaling update) |
| Saturated | No |
| Resources | |
| Website | evals.futuresearch.ai |
| Leaderboard | drb.futuresearch.ai |
| Paper | arXiv:2506.06287 |
| Contact | evals@futuresearch.ai |
| License | Tasks and frozen corpora are proprietary; results published openly |
Deep Research Bench (DRB) is a benchmark for evaluating large language model agents that perform multi-step web research, created by the artificial intelligence research lab FutureSearch. The benchmark, released on 6 May 2025 with an accompanying paper (arXiv:2506.06287), addresses a long-standing problem in agent evaluation: the live web changes constantly, which makes any web-grounded test non-reproducible and lets newer models look better simply because the indexed information has shifted. DRB solves this with a system called RetroSearch, which serves agents a frozen, previously scraped slice of the internet so that the same task can be re-run on every new model and produce comparable scores. The 89 multi-step tasks span eight categories of real research work, with answers carefully verified by skilled human analysts and a public leaderboard hosted at drb.futuresearch.ai.[1][2]
At launch, FutureSearch reported that the best score on DRB was 0.51 out of 1.0, achieved by OpenAI's o3 reasoning model invoked through ChatGPT with web search, which outperformed the dedicated "Deep Research" products from OpenAI, Perplexity, Anthropic, and Google by a comfortable margin. The team estimated that even a flawless agent would plateau near 0.8 because of irreducible task ambiguity, which makes 0.51 a meaningful figure rather than a small one. Later updates have evaluated Claude Opus 4.x, GPT-5, and Gemini 3 variants on the same corpus, with the score gap between top models narrowing to a few percentage points.[2][3][4]
Deep Research Bench was built by FutureSearch, a San Francisco area startup founded in August 2023 by Dan Schwarz (CEO) and Lawrence Phillips (CTO), both of whom previously worked at Metaculus, the public forecasting platform where Schwarz was CTO and Phillips led the AI team. Schwarz had earlier spent years as a senior software engineer at Google and Waymo and built Google's internal prediction market. FutureSearch raised a seed round in November 2024 and operates as a research and engineering team focused on AI forecasting and evaluation. The company's mission has centered on building systems that can predict the future at human or superhuman levels, and rigorous benchmarks are a core part of that program.[5][6]
The paper's full author list reads Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, and Jack Wildman, all affiliated with FutureSearch. The same group has produced the related Bench To the Future (BTF) benchmark, which applies the same pastcasting methodology to forecasting questions, using a corpus of 15 million documents and 1,417 hard forecasting questions whose resolutions are already known. BTF and DRB share the RetroSearch infrastructure and represent two complementary tests: BTF measures forecasting calibration while DRB measures factual research quality.[7]
Before Deep Research Bench, evaluating a web-browsing AI agent presented several problems that no benchmark fully resolved. The live web changes: pages move, sites go down, content gets edited, and search engines re-rank results. A model that scored 60 percent on a web research task in March might score 40 percent in October because relevant pages have disappeared, not because the model got worse. Conversely, a new model can appear stronger because the answer has become easier to find on Google. Stable comparison across model generations becomes impossible.
A second problem is that most existing research benchmarks reduce knowledge work to short-form question answering, like MMLU or GPQA, which test what a model already knows from training rather than how well it can find and synthesize new information. Web-research evaluations that did exist either gave models offline corpora that did not look like the open web (so results did not transfer to deployment) or used live web in non-reproducible ways. The result was that AI labs and customers could not tell which deep research product genuinely worked best, and progress was hard to measure over time.
FutureSearch designed DRB to be the first benchmark that captures real-world web research tasks fully offline while keeping the carefully worked human answers correct as the internet changes. The team chose tasks that take skilled human researchers between roughly 30 minutes and 4 hours to solve, which forces multi-step planning, source evaluation, and synthesis rather than single-query lookup. About 40 percent of the task pool was drawn from FutureSearch's own consulting work for paying clients, which means the benchmark reflects the kinds of questions companies actually pay researchers to answer.[2][3]
RetroSearch is the technical foundation that makes Deep Research Bench reproducible. For each task, the FutureSearch team scrapes a large set of web pages relevant to the question, typically between 10,000 and 100,000 pages, using tools such as the Serper search API, Playwright for headless browsing, and ScraperAPI for sites that block bots. The most complex "Gather Evidence" tasks ship with as many as 189,000 frozen pages. These pages are stored offline and exposed to agents through an interface that mimics the live Google search API as closely as possible.
The key engineering goal of RetroSearch is to minimize the gap between offline and live agent behavior. When an agent issues a search query, RetroSearch returns ranked results drawn from the frozen corpus that look like Google results. When an agent fetches a URL, it gets the page exactly as it appeared on the scrape date. This lets researchers run the same agent stack against the frozen corpus repeatedly without worrying about flakiness, paywalls, geo-blocking, or rate limits. It also enables "pastcasting": running models on questions whose answers became known after the scrape date, so the model cannot have memorized the resolution from training.
A central empirical finding of the DRB paper is that offline RetroSearch agents perform comparably to live web agents, at least when the live web has not changed dramatically. This validates the methodology and means an offline benchmark is a faithful proxy for real deployment. The same RetroSearch infrastructure underlies the Bench to the Future forecasting benchmark, where it serves frozen views of the world for thousands of past resolution dates.[1][2][7]
Deep Research Bench includes 89 distinct task instances spanning 8 categories. Each task has a natural-language prompt, a frozen corpus of relevant web pages, and one or more carefully worked human answers used as ground truth. Tasks were designed to take a skilled human researcher between 30 minutes and 4 hours, and the answers were validated by FutureSearch's internal research analysts. Tasks span domains including finance, science, regulation, technology trends, government data, and academic literature.
| Category | What it tests | Example | Scoring |
|---|---|---|---|
| Find Number | Locate a single specific number on the web | "How many FDA Class II medical device recalls occurred in a given period?" | Binary (correct or incorrect within tolerance) |
| Find Dataset | Identify a specific dataset that answers the question | "Find dataset of monthly software developer job postings 2019 to 2023" | Binary or precision-style match |
| Find Original Source | Trace a claim back to its primary source | "Find the original paper or filing for a quoted statistic" | Binary |
| Validate Claim | Decide if a public statement is true, false, or unverifiable | "Is ChatGPT 10 times more energy intensive than a Google search?" | Binary with required evidence |
| Derive Number | Compute a number by combining multiple sources | "Total renewable energy capacity in a country derived from regional reports" | Fuzzy numerical match against ground truth |
| Gather Evidence | Compile supporting and contradicting evidence around a claim | "Evidence on climate impact on staple crop yields" | Recall-weighted scoring against curated evidence list |
| Populate Reference Class | Produce a comprehensive list of items meeting a definition | "List all fintech unicorns founded since 2020" | Precision and recall against curated set |
| Compile Dataset | Build a structured dataset from scattered sources | "Annual count of US IPOs with offer price at least $5.00, 1980 to 2024" | Completeness and accuracy against ground truth dataset |
The split between binary tasks (Find Number, Find Original Source, Validate Claim) and quantitative tasks (Derive Number, Gather Evidence, Populate Reference Class, Compile Dataset) is deliberate. The binary tasks reward correct end answers and resemble traditional QA. The quantitative tasks measure how thorough an agent is, which tends to be the harder skill and is where current models lose the most ground to humans.[2][3][8]
The accompanying paper and FutureSearch's blog posts disclose a number of representative tasks, which give a sense of the benchmark's flavor:
Ground truth for each task was produced by domain experts at FutureSearch, often spending hours per task to verify each fact, and was double-checked against alternative interpretations of ambiguous wording.
Every task produces a score in the range 0 to 1. The scoring formula depends on the category:
| Scoring type | Categories using it | Description |
|---|---|---|
| Binary | Find Number, Find Original Source, Validate Claim | 1 if the agent's final answer matches ground truth, 0 otherwise. Find Number uses a tolerance. |
| Precision | Find Dataset, Populate Reference Class | Correct items returned divided by total items returned |
| Recall | Gather Evidence, Compile Dataset | Correct items returned divided by total ground-truth items |
| F1 | Combined precision-recall tasks | Harmonic mean of precision and recall, see F1 score |
| Fuzzy numerical | Derive Number | A bounded distance from the correct value, normalized to 0 to 1 |
The overall DRB score is the average across all 89 tasks (sometimes reported as a percentage, sometimes a 0 to 1 fraction). Because the eight categories vary in difficulty, FutureSearch also reports per-category breakdowns. A model with strong binary QA but weak set-building skills will look very different on Find Number versus Populate Reference Class.
An important calibration is the noise ceiling. FutureSearch estimates that a hypothetical perfect agent would still score only around 0.8 because of task ambiguity, disagreements about which sources count as authoritative, and small errors in ground truth itself. Scores should therefore be interpreted against that ceiling rather than 1.0. The 0.51 achieved by o3 corresponds to roughly 64 percent of the achievable ceiling.[3]
Beyond the final answer, DRB analyzes the agent's reasoning trace to produce diagnostic metrics that explain why an agent fails:
| Metric | What it measures | Why it matters |
|---|---|---|
| Forgetting | How often the agent loses track of prior findings or goals across the trace | Strongest negative predictor of final score, correlation around -0.84 |
| Action hallucination | Frequency of fabricated tool inputs, like inventing URLs that never existed | Indicates poor calibration of the agent's tool-use loop |
| Repetitive tool use | Identical or near-identical searches issued repeatedly | A sign of strategy collapse |
| Query mistakes | Over-quoting, missing operators, lazy keyword matching | Shows weak search-engine literacy |
| Premature stopping | The agent commits to an answer before verification | Common in lower-tier models |
| Source citation | Whether outputs cite verifiable URLs from the corpus | Tracks reliability |
The paper reports that the three most heavily studied failure modes (forgetting, action hallucination, repetitive tool use) explain only between 9 percent and 13 percent of the variance in final scores. This suggests that strategic planning, source judgment, and domain understanding (which are harder to quantify) dominate research performance, which is broadly consistent with what skilled human researchers report about their own work.[2][3]
The May 2025 launch evaluated 12 LLM backbones running through a ReAct-style agent with web search and document fetch tools, plus 11 commercial "Deep Research" products. The action budget for each task is 50 tool calls, and reasoning effort for thinking models is set manually so that comparisons are not skewed by token-budget differences.
| Model | Notes |
|---|---|
| OpenAI o3 (via ChatGPT with web search) | Top of the leaderboard at 0.51 overall, double-checks own findings frequently |
| Claude 3.7 Sonnet (Anthropic) | Close second, strong in both thinking and non-thinking modes |
| Gemini 2.5 Pro (Google) | Strong on tasks requiring structured planning and stepwise reasoning |
| GPT-4 Turbo (OpenAI) | Solid baseline, lagged thinking models on multi-step tasks |
| DeepSeek-R1 | Best open-weight performer, narrowed the gap to GPT-4 Turbo but more prone to hallucination |
| Grok (xAI) | Moderate, varied across categories |
FutureSearch's qualitative observations on the May 2025 cohort include the result that newer "thinking" models clearly outperformed earlier non-thinking models, that closed models maintained an edge over open-weight models, and that Claude 4 Sonnet and Opus surpassed o3 on a continuously updated agent leaderboard run separately from the paper figures. As of the May 2025 paper, Claude could not directly read PDFs in the FutureSearch harness, which materially limited its score.[2][3][8]
| Product | Underlying model(s) | Result class |
|---|---|---|
| OpenAI Deep Research | o3-based | Underperformed plain o3 + web search |
| Perplexity Deep Research | Mixed Perplexity stack | Mid-pack |
| Anthropic Claude Research with extended thinking | Claude 3.x family | Improved on Anthropic's standard web search variant |
| Google Gemini Deep Research | Gemini 2.5 family | Improved over plain Gemini search |
| Grok DeepSearch | xAI Grok | Moderate |
| DeepSeek + Search | DeepSeek open-weights | Cost-competitive but weaker accuracy |
One of the most discussed findings of the paper is that the bespoke "Deep Research" products did not always beat the base reasoning model with plain web search. Plain ChatGPT with o3 and search outperformed OpenAI's own dedicated Deep Research mode, and Perplexity's specialized deep research lagged behind its standard Pro search in several categories. The Anthropic and Google deep research modes did improve over their plain-search counterparts.[1][2]
A FutureSearch follow-up post titled "More reasoning tokens helps Claude, but not GPT or Gemini" reported on a 2026 sweep that re-ran DRB with multiple reasoning effort settings. Both Claude Opus 4.6 and Claude Sonnet 4.6 gained meaningfully from higher reasoning budgets, with Sonnet 4.6 rising from 50.4 percent to 54.9 percent and Opus 4.6 reaching 55.0 percent at high effort. GPT-5 scores fell from 49.6 percent to 48.1 percent as effort increased, and Gemini 3 Flash dropped about 2 percentage points across the same range. The takeaway was that effort scaling does not improve every model and that Claude 4.6 benefited the most.[4]
FutureSearch also publishes a cost-per-task breakdown using Pareto frontier analysis. As of the 2026 cost post, Gemini 3 Flash at low effort was the cheapest option at roughly $0.05 per task, Claude 4.6 Opus at low effort cost about $0.24 per task with 53.1 percent accuracy and runtime near 130 seconds, and Claude 4.6 Opus at high effort cost about $0.55 per task at 55.0 percent accuracy with runtime around 6 minutes. Most models clustered well under one dollar per task, which FutureSearch summarized as "deep research is surprisingly affordable". Wall-clock times partly reflected the per-provider token rate limits available during evaluation.[9]
Deep Research Bench produced several findings that have shaped subsequent work on research agents.
The single most consistent result is that reasoning models that explicitly verify their own intermediate findings outperform models that do not. OpenAI's o3 was notable for actively double-checking findings during its trace, which correlated with a higher final score and lower hallucination rate. Claude models with extended thinking enabled also gained measurably over their non-thinking variants. The implication is that good web research is less a function of raw model size and more a function of an iterative verification loop.
Forgetting, defined as losing track of earlier findings or goals as the trace grows, had the strongest negative correlation with final score in the paper (about -0.84). The benchmark's long traces make this measurable in ways shorter QA benchmarks cannot. Practical implications include the need for explicit scratchpad management, summary checkpoints, and longer effective context, all of which several frontier models have iterated on since the paper.
FutureSearch also tested "toolless" agents (LLMs answering from internal knowledge with no web access) on a subset of tasks. On the Validate Claim category, toolless agents averaged 0.61 while tool-enabled agents averaged 0.62, an essentially flat result. This is a striking finding for tasks where the claim's truth value is widely covered in training data. On tasks that require current information or synthesis from multiple specific sources, like Derive Number and Gather Evidence, tool access was essential and the toolless gap was wide.
The paper's most counterintuitive headline is that some of the most heavily marketed "Deep Research" products underperformed simpler agent stacks built on the same model. Plain o3 with web search beat OpenAI's Deep Research; plain Perplexity Pro often beat Perplexity Deep Research. This pattern, while not universal, suggests that the product-level scaffolding around deep research can sometimes hurt more than it helps, possibly because longer agentic loops accumulate more forgetting and more hallucinated tool calls.
DeepSeek-R1 was the strongest open-weight contender in May 2025 and roughly matched GPT-4 Turbo, but it lagged the frontier closed models and showed higher hallucination rates. The gap has narrowed in later runs as open-weight reasoning models improved, but as of the most recent leaderboard snapshots Claude Opus 4.x and OpenAI o3 still hold the top spots for accuracy at high effort.[1][2][3]
Deep Research Bench sits in a crowded field of agentic and retrieval-augmented benchmarks. The table below compares the most prominent ones.
| Benchmark | Organization | Tasks | What it measures | Reproducibility | Notes |
|---|---|---|---|---|---|
| Deep Research Bench (DRB) | FutureSearch | 89 multi-step, 8 categories | Web research quality with verified human answers | High, frozen RetroSearch corpus | Pastcasting design, focuses on practical research |
| BrowseComp | OpenAI | 1,266 short questions | Persistent web navigation for hard-to-find facts | Lower, depends on live web | OpenAI Deep Research solves about half |
| Humanity's Last Exam | Center for AI Safety, Scale | 2,500 expert-level closed questions | Frontier expert knowledge across many fields | High, static | Multi-modal, no web tool requirement |
| FRAMES | Multi-document factoid set | Long-context factual QA with reasoning | Static | Often paired with retrieval evaluations | |
| GAIA | Hugging Face and partners | 450 general-assistant questions | Reasoning, tool use, multi-modality | Static | Broader scope than DRB but shorter tasks |
| DeepResearch Bench | Independent academic group | 100 PhD-level long-form tasks | Long-form research report quality, RACE and FACT scores | Live web | Different project, similar name; English and Chinese |
| SimpleQA | OpenAI | Many short factual questions | Closed-book factual recall | Static | No browsing |
A few comparative points are worth pulling out. BrowseComp is closest in spirit to DRB but uses single-shot hard-to-find questions on the live web, while DRB uses multi-step tasks on a frozen corpus. Humanity's Last Exam tests what a model already knows; DRB tests how well a model finds and synthesizes information it does not already know. FRAMES is essentially a multi-document QA benchmark without the agent loop. GAIA covers more modalities than DRB but each task is shorter. And the similarly named "DeepResearch Bench" published by an independent academic group focuses on long-form report quality across 100 PhD-level tasks evaluated by the RACE and FACT frameworks, which is a different evaluation philosophy from DRB's per-task numerical scoring; the two are commonly mistaken for each other but are not the same benchmark.[2][10][11][12]
Deep Research Bench received attention in the practitioner press and academic preprint discussions through the second half of 2025. Coverage in outlets such as Unite.AI, DEV Community, and various AI newsletters emphasized two themes: that current AI research agents are useful but not yet ready to be left unsupervised, and that a benchmark with a noise ceiling near 0.8 is a more honest scorecard than one that pretends a model can score 1.0.
The benchmark has been adopted internally by several agent-building teams. NVIDIA's AI-Q Blueprint, for example, includes a Deep Research Bench evaluation pipeline that runs the NeMo Agent Toolkit against the frozen DRB corpus, formats outputs to a JSONL contract, and submits scores to the official evaluator. The Parallel Web Systems product team uses both DRB and BrowseComp to benchmark the Pareto frontier of their own research stack and competing offerings. These integrations illustrate that DRB has become one of the de facto reference benchmarks for vendors selling "deep research" capabilities.[8][12][13]
The FutureSearch effort-scaling and cost analyses, which use DRB as the scoring axis, have also been cited in coverage of new model releases. Claims about a model's research ability are increasingly accompanied by a DRB number rather than (or in addition to) a generic chat-quality figure. The benchmark has thus shifted from a one-off paper into a continuously updated leaderboard used in product positioning.[4][9]
Deep Research Bench is widely respected, but several limitations are well-known.
Eighty-nine tasks is a modest sample. Critics note that some categories contain only a handful of instances each, which means per-category scores carry meaningful variance. The benchmark is also English-only, which limits its applicability to multilingual deployment. Earlier versions of the FutureSearch roadmap suggested expanded language coverage as a future direction, but as of mid-2026 the public release remains English-only.
About 40 percent of the DRB tasks were sourced from FutureSearch's paying clients. This grounds the benchmark in real demand but may also bias the distribution toward business and finance topics relative to scientific or humanities research. The remaining tasks were authored in-house, and some observers have asked for greater transparency about how the team selected them and how representative they are of the broader space of research questions.
The estimated 0.8 noise ceiling acknowledges that ground truth itself is not fully crisp. Some tasks have answers that depend on the interpretation of ambiguous wording ("all unicorn startups in fintech" can mean different things depending on the date and the definition of unicorn). When models cluster within a few points of one another, ranking depends heavily on these edge cases. The FutureSearch team has documented this and adjusted ground truth over time, but the issue is inherent to evaluating subjective research.
A frozen corpus loses relevance as the web evolves. Information that mattered in May 2025 may be misleading or obsolete in 2027. FutureSearch refreshes some tasks and adds new ones, but the corpus cannot be re-scraped for every leaderboard refresh without losing comparability across runs. This trade-off between freshness and longitudinal comparability is fundamental to the pastcasting approach.
The paper's own analysis shows that the three quantified failure modes (forgetting, action hallucination, repetitive tool use) account for only 9 to 13 percent of variance in final scores. This is a candid admission that the diagnostic toolkit cannot yet explain why one strong model beats another by 3 points on a given task. The unexplained variance presumably involves harder-to-measure factors like source-quality judgment and strategic planning.
Deep Research Bench's tasks and frozen RetroSearch corpora are not publicly downloadable. Evaluations are run by FutureSearch or by approved partners who contact evals@futuresearch.ai. This protects the benchmark from being trained against but reduces external reproducibility and slows independent academic study. Several follow-up benchmarks, including "Why Your Deep Research Agent Fails" (DeepHalluBench, arXiv:2601.22984) and Microsoft's LiveDRBench, have responded by releasing more open variants.[14]
Independent commentary on the benchmark has emphasized that DRB scores demonstrate a clear gap between vendor marketing and measured performance. Tools described as "AI research analysts" routinely score below 60 percent of the achievable ceiling, deliver overconfident summaries, and miss material information. This is a finding about the field, not about DRB, but it is one of the benchmark's most visible legacies.[8]
Deep Research Bench is one of a family of benchmarks that emerged in 2025 and 2026 to evaluate research-style AI agents.
This ecosystem reflects the broader recognition that there is no single "best" way to evaluate a research agent: DRB measures verifiable answers, BrowseComp measures hard-to-find facts, DeepResearch Bench measures long-form report quality, and BTF measures forecasting calibration. Most serious agent developers now report numbers on several of these benchmarks rather than one.[7][14][15][16][17]
DRB findings have influenced concrete agent engineering choices. The strong negative correlation between forgetting and final score has pushed teams to invest in scratchpad summarization, episodic memory, and longer effective context windows. The poor performance of some specialized "Deep Research" modes versus plain reasoning models has pushed product teams to simplify their agent loops rather than add scaffolding. The high cost of high-effort runs has spurred work on adaptive effort allocation, where the agent uses minimal effort on simple tasks and reserves long thinking budgets for hard ones.
More broadly, DRB helped legitimize the pastcasting and frozen-corpus methodology for evaluating web-grounded systems, an idea that has since spread to forecasting benchmarks (BTF), live-web variants (LiveDRBench), and citation-audit benchmarks (BrowseComp-Plus). The architectural pattern of separating the agent loop from a controlled retrieval environment is now common in research-agent papers, which is part of DRB's lasting contribution.
Deep Research Bench was the first widely cited benchmark to combine three properties at once: realistic multi-step web research tasks rather than single-shot QA, human-verified answers that stay correct as the web changes, and a public leaderboard that is updated as new models are released. By doing so it gave the field a stable yardstick for one of the most commercially important AI workflows.
The benchmark's specific numbers (o3 at 0.51, a noise ceiling near 0.8, dominant failure mode of forgetting) have entered the working vocabulary of agent developers. Its methodology (frozen RetroSearch, ReAct loop with a 50-action budget, trace-level analytics) has been adopted, extended, or contrasted with by most subsequent benchmarks in the area. And its public findings (specialized deep-research products do not always beat their underlying models, thinking models verify themselves more, open-weight models are narrowing the gap) have been repeatedly confirmed and refined in follow-on work.
As AI assistants take on more of the work of research, due diligence, journalism, and analysis, the stability of benchmarks like DRB matters beyond a single leaderboard. The benchmark is one of the clearest current answers to the question "how good are these agents really at doing the job they are marketed for" and the answer, as of mid-2026, is "useful but still firmly assistive rather than autonomous".