Deep Research refers to a category of AI agent capabilities that autonomously conduct multi-step research across the internet on behalf of users. These systems browse the web, read and interpret sources, evaluate credibility, synthesize findings, and produce comprehensive research reports with citations. Unlike simple search queries or single-turn large language model interactions, deep research agents execute extended reasoning chains that may span several minutes to over an hour, iteratively refining their search strategies based on information discovered during the research process.
The term gained widespread use in late 2024 and early 2025 as major AI companies released products under this label, beginning with Google's Gemini Deep Research in December 2024 and followed by OpenAI's Deep Research in February 2025. These tools represent a shift from conversational AI toward agentic AI, where the system operates with a degree of autonomy to accomplish complex tasks that would traditionally require hours of manual effort by a human researcher.
Deep research systems combine several core capabilities: query planning, web browsing, source evaluation, iterative reasoning, and report synthesis. While each implementation differs in technical details, the general workflow follows a consistent pattern.
When a user submits a complex research question, the system first formulates a detailed research plan. It breaks the overarching question into smaller, manageable sub-tasks that can be investigated individually. For example, a question about the competitive landscape of electric vehicle battery technology might be decomposed into sub-queries about current market leaders, recent patent filings, cost benchmarks, emerging chemistries, and supply chain constraints.
This planning phase distinguishes deep research from standard information retrieval. Rather than executing a single search query and returning top results, the system creates a structured investigation strategy that guides its subsequent actions.
The agent then executes its plan by searching the web, clicking through results, scrolling pages, and reading full documents. Deep research systems can process a wide variety of content types including web pages, PDFs, academic papers, spreadsheets, and images. During a single research session, these agents typically perform dozens of searches and read hundreds of individual sources.
A key capability is the ability to follow links across multiple hops. If an initial search reveals a promising lead, the agent can navigate to the referenced source, read it, and continue following references deeper into the topic. This multi-hop browsing mirrors how a human researcher would trace citations back to primary sources.
Perhaps the most important aspect of deep research is its ability to adapt its strategy in real time. Rather than following a predetermined search plan rigidly, the agent reasons about what it has found and dynamically decides which additional sources to consult, which claims to verify, and which data points require deeper investigation. If the agent encounters contradictory information, it can formulate new queries to resolve the discrepancy.
This iterative process relies on the underlying model's reasoning capabilities. Models trained with reinforcement learning on browsing and reasoning tasks learn to plan and execute multi-step trajectories, backtracking when necessary and pivoting in response to new information.
After gathering sufficient information, the agent synthesizes its findings into a structured report. These reports typically include an executive summary, organized sections addressing different aspects of the research question, data tables where appropriate, and inline citations pointing to specific sources. The citation mechanism allows users to verify claims by following links back to the original material.
The entire process, from receiving the user's query to delivering the final report, typically takes between 2 and 30 minutes depending on the complexity of the question and the specific platform. Some systems can spend up to 45 minutes on especially complex investigations.
Google announced Gemini Deep Research on December 11, 2024, as part of its broader Gemini 2.0 launch event. It was initially available exclusively to Gemini Advanced subscribers.
At launch, Deep Research was powered by Gemini 1.5 Pro. Google subsequently upgraded the underlying model to Gemini 2.0 Flash Thinking (experimental), a reasoning-capable variant of the Gemini 2.0 Flash model. According to Google, the thinking model's innate capacity for self-reflection and planning made it well suited for long-running agentic tasks. The upgrade improved both the quality and the serving efficiency of the product, enabling broader access. Google later further enhanced Deep Research with Gemini 3 capabilities.
When a user submits a query, Gemini Deep Research presents a research plan that the user can review and modify before execution. The system then autonomously browses the web, gathering and analyzing sources before compiling a structured report. Reports can be exported to Google Docs for further editing.
As of 2025, Google offers Deep Research across multiple tiers. Free users receive 5 deep research reports per month. Google One AI Premium subscribers ($19.99/month in the US) receive substantially higher limits. The product is available in over 45 languages globally.
OpenAI launched Deep Research on February 2, 2025, initially available to ChatGPT Pro subscribers ($200/month). The feature was described as a "next-generation" agentic capability powered by a version of the o3 model optimized for web browsing and data analysis.
OpenAI's Deep Research was trained using end-to-end reinforcement learning on challenging browsing and reasoning tasks across a range of domains. Through this training, the model learned core browsing capabilities (searching, clicking, scrolling, interpreting files), how to use a Python tool in a sandboxed environment for calculations, data analysis, and graph plotting, and how to reason through and synthesize a large number of websites to find specific pieces of information or write comprehensive reports.
The system can browse user-uploaded files, generate and iterate on graphs using its Python tool, embed both generated graphs and images from websites in its responses, and cite specific sentences or passages from its sources. A typical research session takes between 5 and 30 minutes.
OpenAI expanded access over the following months. As of April 24, 2025, Pro users receive 250 queries per month, Plus, Team, Enterprise, and Edu users receive 25 queries per month, and Free users receive 5 queries per month. A lightweight version powered by a variant of o4-mini was introduced for more cost-efficient queries; once users reach their limit for the full version, queries automatically switch to this lightweight alternative.
On June 26, 2025, OpenAI released the Deep Research API with two models: o3-deep-research and o4-mini-deep-research, extending the capability to developers for integration into custom applications.
Perplexity launched its Deep Research feature on February 14, 2025. Unlike the subscription-first approach of OpenAI, Perplexity made Deep Research available to all users with a free tier of 5 queries per day, while Pro subscribers ($20/month) receive 500 queries daily.
When a user submits a Deep Research question, Perplexity performs dozens of searches, reads hundreds of sources, and reasons through the material to deliver a comprehensive report. The system completes most research tasks in under 3 minutes, making it notably faster than competing implementations.
Perplexity's Deep Research is built on the company's Sonar model family, which uses Llama 3.3 70B as its base and runs on Cerebras's inference platform, generating answers at approximately 1,200 tokens per second. The system excels at tasks spanning finance, marketing, product research, and competitive analysis.
Perplexity later released an Advanced Deep Research update that introduced improved accuracy, expanded capabilities, a redesigned interface, and an API (Sonar Deep Research) for programmatic access.
Anthropic launched its Research feature for Claude on April 15, 2025. The feature allows Claude to conduct in-depth investigations by searching across the web and Google Workspace (Gmail, Google Calendar, Google Docs) to deliver comprehensive, citation-backed answers.
When Research is activated, Claude operates agentically, conducting multiple searches that build on each other. The system determines what to investigate next, explores different angles of the question automatically, and works through open questions systematically. Research sessions can last from 5 to 45 minutes depending on complexity.
On May 1, 2025, Anthropic expanded the Research feature with Integrations support, allowing Claude to pull information from connected third-party services including Atlassian Jira, Confluence, Zapier, Cloudflare, Intercom, Asana, Square, Sentry, PayPal, Linear, and Plaid.
Research is available on Claude Max, Team, and Enterprise plans. During its initial rollout, it was limited to users in the United States, Japan, and Brazil.
xAI introduced DeepSearch as part of the Grok 3 launch on February 17, 2025. DeepSearch scans the internet and the X (formerly Twitter) platform to generate detailed summaries in response to complex queries. The integration with X data gives DeepSearch a distinctive capability for analyzing real-time social discourse and trending topics.
DeepSearch is available to X Premium+ subscribers and SuperGrok subscribers ($30/month). xAI also released DeeperSearch, an enhanced version that uses extended search and additional reasoning for more thorough investigations.
| Feature | OpenAI Deep Research | Google Gemini Deep Research | Perplexity Deep Research | Anthropic Claude Research | xAI DeepSearch |
|---|---|---|---|---|---|
| Launch Date | February 2, 2025 | December 11, 2024 | February 14, 2025 | April 15, 2025 | February 17, 2025 |
| Underlying Model | o3 (optimized for browsing) | Gemini 2.0 Flash Thinking | Sonar (Llama 3.3 70B base) | Claude (unspecified variant) | Grok 3 |
| Typical Processing Time | 5 to 30 minutes | Under 15 minutes | Under 3 minutes | 5 to 45 minutes | Varies |
| Free Tier | 5 queries/month | 5 reports/month | 5 queries/day | Not available | Limited trial |
| Paid Tier Pricing | $20/month (Plus) to $200/month (Pro) | $19.99/month (AI Premium) | $20/month (Pro) | Max/Team/Enterprise plans | $30/month (SuperGrok) |
| Paid Tier Query Limit | 25/month (Plus) to 250/month (Pro) | Adaptive daily limits | 500/day (Pro) | Not publicly specified | Not publicly specified |
| File Upload Support | Yes (PDFs, images, spreadsheets) | Limited | Limited | Via Google Workspace | No |
| API Access | Yes (since June 2025) | Yes (Gemini API) | Yes (Sonar Deep Research) | Not publicly available | Not publicly available |
| Unique Strength | Depth of analysis, benchmark performance | Google ecosystem integration, multilingual | Speed, citation precision | Third-party integrations, workspace search | Real-time X/Twitter data |
Deep research systems are evaluated on benchmarks designed to test complex reasoning, multi-step information retrieval, and factual accuracy.
Humanity's Last Exam is a comprehensive benchmark consisting of over 3,000 questions across more than 100 academic subjects, ranging from rocket science to analytic philosophy. It was designed to be extremely difficult for AI systems, with the best traditional models achieving only around 9% accuracy.
Deep research agents showed significant improvements over standard models on this benchmark:
| System | HLE Accuracy |
|---|---|
| OpenAI Deep Research (o3) | 26.6% |
| Perplexity Deep Research | 21.1% |
| OpenAI o3-mini (high) | 13.0% |
| OpenAI o3-mini | 10.5% |
| OpenAI o1 | ~9% |
| DeepSeek R1 | ~9% |
OpenAI's Deep Research achieved the highest score, representing a nearly threefold improvement over the previous best models. The largest gains appeared on questions related to chemistry, humanities and social sciences, and mathematics.
BrowseComp is a benchmark released by OpenAI in April 2025, specifically designed to evaluate AI browsing agents. It contains 1,266 challenging problems that require agents to persistently navigate through multiple websites to retrieve difficult-to-find, entangled information.
| System | BrowseComp Accuracy (Single Attempt) |
|---|---|
| OpenAI Deep Research | 51.5% |
| OpenAI o1 | 9.9% |
| GPT-4o with browsing | 1.9% |
| GPT-4.5 | ~0% |
Deep Research significantly outperformed all other models, solving roughly half of the problems on single attempts. With 64 sampled outputs and majority voting, performance improved to 78%. The near-zero scores from GPT-4o and GPT-4.5 highlight that without strong reasoning and tool use, models fail to retrieve the kinds of obscure, multi-hop facts the benchmark targets.
GAIA (General AI Assistants) is a benchmark containing 450 questions that require fundamental abilities including reasoning, multi-modality handling, web browsing, and tool-use proficiency. OpenAI's Deep Research achieved a score of 67.36% on the GAIA validation set, setting a new state of the art. For comparison, GPT-4 without an agentic setup scored below 7% on the same benchmark.
SimpleQA is a factuality benchmark consisting of several thousand straightforward questions. Perplexity Deep Research scored 93.9% accuracy on SimpleQA, substantially exceeding the performance of leading standalone models and demonstrating strong factual reliability for direct questions.
Deep research agents are applied across a wide range of professional and academic contexts.
Businesses use deep research tools to analyze competitive landscapes, track industry trends, evaluate potential partners or acquisition targets, and compile market sizing estimates. The ability to synthesize information from dozens of sources into a single coherent report saves analysts hours of manual work.
Researchers use deep research to survey existing literature on a topic, identify key papers and their findings, map the relationships between different research threads, and identify gaps in current knowledge. While deep research agents do not replace thorough academic review, they serve as an effective starting point for orienting researchers in unfamiliar fields.
Investors and technology teams use these tools to evaluate technical claims, review patent landscapes, assess the maturity of specific technologies, and compare vendor offerings. The structured report format with citations makes it straightforward to share findings with colleagues.
Policy researchers and compliance teams use deep research to track regulatory changes across jurisdictions, understand the implications of proposed legislation, and compile summaries of government guidance on specific topics.
Consumers and procurement teams use deep research for detailed product comparisons, gathering specifications, reviews, pricing information, and availability data from across the web into organized reports.
Scientists use deep research tools to explore interdisciplinary connections, gather data from multiple published studies, and identify methodological approaches used by researchers in adjacent fields.
Despite their capabilities, deep research systems face several significant limitations that users should understand.
Like all systems built on large language models, deep research agents can generate factually incorrect information. OpenAI acknowledged in its Deep Research system card (published February 25, 2025) that the model may produce factually incorrect content, and that its chain-of-thought reasoning occasionally halluccinates about access to tools or capabilities it does not actually have. While most hallucinations can be caught by checking provided references, the system card noted that misinformation by omission is also possible: the tool could miss crucial details because they did not appear in the searches it conducted.
OpenAI's own research has demonstrated that hallucinations in language models stem from fundamental mathematical properties of the training process, including epistemic uncertainty when information appears rarely in training data, model limitations where tasks exceed current architectures' representational capacity, and computational intractability. These factors mean that hallucinations cannot be entirely eliminated through engineering improvements alone.
Deep research agents rely on web search results, which means they are susceptible to the biases inherent in search engine rankings. Content that is heavily optimized for search engines may be prioritized over more authoritative but less visible sources. Google's Gemini Deep Research has been noted as being more susceptible to SEO bias, sometimes citing sources that are not directly relevant to the query.
Information that is very recent, behind paywalls, or not indexed by major search engines may be inaccessible to deep research agents. The systems can only analyze content that is publicly available on the web at the time of the search. Events occurring after the search session or information locked behind authentication barriers will be missed.
While deep research agents can cover a broad range of subtopics, they may lack the depth of a domain expert's analysis on any single point. The reports they produce are comprehensive overviews rather than expert-level analyses, and they may miss nuances that would be apparent to a specialist.
Access to the most capable deep research systems requires paid subscriptions, and even paid users face query limits. OpenAI's Pro plan at $200/month provides 250 queries, while the Plus plan at $20/month offers only 25. These constraints can limit the utility of deep research for users who need to conduct many investigations.
Although deep research reports include citations, users bear the responsibility of verifying critical claims. The presence of a citation does not guarantee that the source actually supports the claim as stated. Careful verification remains necessary, particularly for high-stakes applications such as legal research, medical information, or financial analysis.
The release of commercial deep research products spurred significant open-source development. Hugging Face developed Open Deep Research using its smolagents framework, which achieved the top rank among open submissions on the GAIA benchmark leaderboard. Other notable open-source projects include:
These open-source efforts are broadening access to deep research capabilities and enabling researchers and developers to customize the technology for specialized domains.
OpenAI's Deep Research system card classified the model as medium risk overall, including medium risk assessments for cybersecurity, persuasion, CBRN (chemical, biological, radiological, nuclear), and model autonomy. Before launching, OpenAI conducted safety testing focused on several areas specific to browsing agents:
The agentic nature of deep research systems introduces safety considerations that go beyond those of standard conversational AI. Because these systems browse the web autonomously and may encounter adversarial content, they must be robust against manipulation attempts while still being able to extract useful information from diverse sources.
Deep research technology is evolving rapidly. Several trends are shaping its development: