Academic Research
Last reviewed
May 13, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 ยท 4,880 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 ยท 4,880 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Academic Research ChatGPT Plugins
Academic research is among the fields most visibly reshaped by artificial intelligence since the late 2010s. The change has come from two directions at once. On one side, scientific results themselves have started to depend on AI systems, with AlphaFold predicting protein structures, AlphaProof writing formal mathematical proofs, and graph neural networks proposing new materials. On the other side, the day-to-day mechanics of doing research, finding papers, summarising them, drafting manuscripts, and reviewing the work of peers, have been absorbed into a fast-growing market of large language model tools such as Elicit, Consensus, Scite, SciSpace, Semantic Scholar, ResearchRabbit, Undermind, and PaperQA2.
This shift has produced both productivity gains and a substantial backlash. Journals have rewritten authorship policies, conferences have banned the use of AI in peer review, and Retraction Watch has logged a growing pile of papers withdrawn for ChatGPT-related errors, including hallucinated citations, garbled figure captions, and even AI-generated images of impossible anatomy. The phrase "vegetative electron microscopy", a 2024 marker of AI-translated nonsense, has become shorthand for the kind of slop that now leaks into the scientific record.
AI in academic research can be split into roughly five overlapping uses.
Each category has produced its own controversies. Writing tools created the first wave of authorship rows in early 2023. Reading and synthesis tools sparked a debate about citation hallucination through 2023 and 2024. Peer-review use was effectively banned at several flagship conferences in 2024 after the Stanford study by Liang and colleagues estimated that between 6.5 percent and 16.9 percent of ICLR 2024 reviews showed signs of LLM influence.
Long before the current generation of chatbots, academic computing produced its own research aids. Bibliographic databases such as MEDLINE, the Web of Science, and Scopus already used machine indexing in the 1990s and 2000s. The launch of Google Scholar in November 2004 made full text search of journal articles a normal part of literature review.
The move toward modern AI assistants began with the founding of the Allen Institute for AI in 2014, which released Semantic Scholar in 2015 and added the TLDR neural summariser to paper pages in 2020. Smart Citations, a richer way of classifying citations as supporting, contrasting, or merely mentioning, were introduced by Scite (founded 2018). Connected Papers launched in 2019 as a visualisation tool that turned a single seed paper into a graph of similar work.
Elicit was created inside the nonprofit research lab Ought in 2020 and was spun out as Elicit.com. Consensus was founded in 2022 by Eric Olson and Christian Salem in the United States, with a product focused on extracting claim-level evidence from biomedical and social science literature. ResearchRabbit, Undermind, and SciSpace (formerly Typeset) followed in the same window.
The arrival of ChatGPT in November 2022 changed the audience for these tools from a few thousand specialists to the general scientific community almost overnight. By early 2023, ChatGPT was being listed as a co-author on multiple preprints and on at least one published paper, which forced the major journal families to rewrite their authorship guidelines. From 2023 onward, almost every large bibliographic database, including Web of Science, Scopus, and Dimensions, has added its own retrieval-augmented generation feature.
The table below lists the most widely used assistants, in order of public release. "Focus" describes the main user-facing capability.
| Tool | Launch year | Organisation | Focus |
|---|---|---|---|
| Semantic Scholar | 2015 | Allen Institute for AI | Search, citation graph, TLDR neural summaries |
| Scite | 2018 | Scite Inc., New York | Smart Citations classifying supporting and contrasting evidence |
| Connected Papers | 2019 | Connected Papers | Visual map of related literature from a seed paper |
| Elicit | 2020 | Ought (now Elicit.com) | Question answering, claim extraction, systematic review support |
| ResearchRabbit | 2021 | Human Intelligence Technologies | Citation tree exploration, collaborative collections |
| SciSpace | 2021 | Cactus Communications (formerly Typeset.io) | Paper explanations, formatting, Copilot chat |
| Scholarcy | 2018 | Scholarcy Ltd, UK | Flash card style summaries of individual papers |
| Consensus | 2022 | Consensus Inc. | Claim-level evidence search, "Consensus Meter" |
| Iris.ai | 2015 | Iris.ai AS, Norway | Concept search, systematic review automation |
| R Discovery | 2021 | Cactus Communications | Personalised reading feed and audio summaries |
| Inciteful.xyz | 2021 | Independent | Free citation network exploration |
| Undermind | 2023 | Undermind, Inc. | Deep search agent that explores literature iteratively |
| PaperQA2 | 2024 | FutureHouse | Open source retrieval-augmented question answering on papers |
| FutureHouse Crow, Falcon, Phoenix, Owl | September 2024 | FutureHouse | Agent platform for literature, chemistry, and search tasks |
Several of these tools share infrastructure. SciSpace and R Discovery are both products of Cactus Communications, a Mumbai-based publishing services company. Elicit and PaperQA2 both grew out of the AI-for-science effort funded in part by Open Philanthropy. Consensus relies on Semantic Scholar's corpus for its underlying paper index, as do many other RAG tools.
FutureHouse, a nonprofit founded by chemist Andrew White and others with funding from Eric Schmidt's philanthropic operations, released its full agent platform on 25 September 2024. The four named agents handle different roles: Crow searches the literature, Falcon performs deep literature review, Phoenix plans chemistry experiments, and Owl checks whether a research question has already been answered.
A second class of tools focuses on the writing side of the research process, including drafting, language polishing, formatting, and translation. The most established are listed below.
| Tool | Year | Organisation | Notes |
|---|---|---|---|
| Trinka AI | 2020 | Crimson AI / Cactus | Grammar and style checker tuned for academic English |
| Paperpal | 2022 | Cactus Communications | Successor of Trinka; integrates with Word and Overleaf |
| Jenni AI | 2021 | Jenni AI Ltd | LLM-driven essay and paper writing assistant |
| Manuscripts.app | 2020 | Atypon | Collaborative manuscript drafting |
| Writefull | 2017 | Writefull (acquired by Digital Science 2021) | Language checking, paraphrasing for academic writing |
| Grammarly | 2009 | Grammarly Inc. | General writing assistance, widely used by students |
These tools tend to position themselves as language correctors rather than author replacements, which avoids the worst of the authorship arguments. Paperpal is the most visible example because of its integration with Microsoft Word and its sale through institutional licences. Jenni AI is more controversial because it markets itself directly at students writing essays, and it has been criticised by writing centres for producing fluent but shallow drafts.
General-purpose chatbots, especially ChatGPT, Claude, and Gemini, are now used informally for almost every part of paper drafting. Surveys published in Nature in 2023 and 2024 suggested that roughly a third of researchers had used a chatbot to write or edit at least part of a paper. The number was higher in computer science and lower in biomedical fields.
The single most cautionary episode of the LLM era for science came not from a retraction but from a product launch. On 15 November 2022, Meta AI released Galactica, a 120 billion parameter language model trained on 48 million papers, textbooks, lecture notes, and encyclopedia entries. The public demo allowed anyone to generate scientific text, complete with equations and citations. Meta framed Galactica as "a new language model that can store, combine and reason about scientific knowledge".
Within hours, researchers were posting examples of Galactica producing confident-sounding articles on subjects that did not exist, including a Wikipedia-style entry for "the benefits of eating crushed glass" and a fluent literature review citing fabricated papers. Critics including Gary Marcus and Michael Black argued that the model presented hallucinations in a register designed to fool readers into trusting them. Meta took the demo down on 17 November 2022, three days after launch.
Galactica became a fixed reference point in later debates. It demonstrated, before ChatGPT was even released, that the combination of fluent academic style and unreliable factual content was particularly dangerous in scientific contexts. Meta's chief AI scientist Yann LeCun defended the model on Twitter, arguing that the criticism was unfair, but the company did not relaunch the product.
Many of the patterns that later showed up in retractions, including invented citations, plausible but wrong author lists, and authoritative summaries of nonexistent fields, were already present in the three day Galactica demo.
While the writing and reading tools have grabbed most of the media attention, the deepest changes to academic output have come from AI systems that produce scientific results themselves.
DeepMind released the first version of AlphaFold in 2018. The second version won the CASP14 protein structure prediction competition in November 2020 with a median Global Distance Test score above 90, comparable to experimental accuracy for most categories. The 2021 release of AlphaFold 2's source code and the launch of the AlphaFold Protein Structure Database, a partnership with the European Bioinformatics Institute, gave free access to predicted structures for over 200 million proteins, including the entire human proteome. Demis Hassabis and John Jumper shared the 2024 Nobel Prize in Chemistry for this work.
AlphaFold 3, released in May 2024, extended the system to interactions between proteins, nucleic acids, small molecules, and ions. Unlike AlphaFold 2 the new model was not initially open sourced, although the company later released the source code in November 2024 under a noncommercial licence.
In September 2023, DeepMind published AlphaMissense, a system that classifies missense mutations as likely benign or likely pathogenic. The accompanying paper in Science predicted effects for 71 million possible human missense variants. AlphaMissense is widely used in clinical genetics as a prioritisation tool, although its predictions are not on their own sufficient evidence for variant classification under American College of Medical Genetics rules.
In November 2023, DeepMind released GNoME (Graph Networks for Materials Exploration), which it claimed had predicted 2.2 million new crystal structures, of which 380,000 were considered likely to be stable. The work was published in Nature alongside a paper from Lawrence Berkeley National Laboratory using A-Lab, an autonomous synthesis lab, to attempt to make the predicted compounds. Subsequent reviews raised questions about the novelty rate and the experimental validation of the synthesised materials, although nobody has disputed the basic claim that machine learning has expanded the catalogue of plausible inorganic materials by an order of magnitude.
In July 2024, DeepMind announced that a combined system of AlphaProof and AlphaGeometry 2 had reached silver medal level on the 2024 International Mathematical Olympiad problems. The system used Lean as its formal verification environment, with a reinforcement learning loop generating proofs that Lean could check. Of the six 2024 IMO problems, the combined system solved four within the competition time limits.
This result built on a long line of work in machine-assisted theorem proving. The Lean community had already used neural networks to suggest tactics through tools such as LeanDojo and ReProver. OpenAI's earlier mini-experiments with the Mizar library and Anthropic's work on formal verification with Claude are also part of this thread.
AI has also been applied to weather forecasting (GraphCast and FourCastNet), galaxy classification, particle physics event reconstruction at CERN, drug repositioning, and synthetic chemistry route planning. In each of these areas, AI systems have moved from experimental curiosities in 2020 to standard parts of the research toolkit by 2024.
Academic search has its own history of disruption. Microsoft Academic, a successor to Microsoft Academic Search, was retired on 31 December 2021. Its open metadata was used to seed OpenAlex, launched by the nonprofit OurResearch in January 2022. OpenAlex now indexes more than 240 million scholarly works and is the most widely used open replacement for Microsoft Academic Graph.
The big commercial providers have all bolted LLM assistants onto their existing products.
| Service | AI assistant | Released | Provider |
|---|---|---|---|
| Web of Science | Web of Science Research Assistant | September 2023 | Clarivate |
| Scopus | Scopus AI | January 2024 (beta), August 2024 (general) | Elsevier |
| Dimensions | Dimensions AI Assistant | 2023 | Digital Science |
| Crossref | Crossref Labs experiments | ongoing | Crossref |
These assistants are retrieval-augmented systems built on top of the publisher's existing metadata and full-text indexes. They aim to combine the trustworthiness of curated bibliographic data with the natural-language interface of a chatbot. Coverage varies. Scopus AI is restricted to papers in Scopus, which excludes many preprints. Web of Science Research Assistant is similarly limited.
The most public early controversy in AI and academia was the question of whether a chatbot could be an author.
In January 2023, several preprints appeared with ChatGPT listed as a co-author. The biggest publishers responded quickly. Nature published an editorial on 24 January 2023 setting out two rules: large language models cannot be credited as authors, because authorship implies accountability for the work; and any use of an LLM must be documented in the methods or acknowledgements section. The Springer Nature group applied the same rules across all of its journals.
Science followed on 26 January 2023 with an even stricter line. The journal's editor in chief Holden Thorp wrote that text generated by ChatGPT "or any other AI tools" could not be used in a paper at all, and that figures, images, and graphics could not be produced by AI. This was later revised in November 2023 to allow AI-assisted editing with disclosure, bringing Science closer to Nature's position.
Other journal families adopted variations of the same template.
| Publisher or journal | Policy date | Position on LLM authorship |
|---|---|---|
| Nature, Springer Nature | 24 January 2023 | Not authors; use must be disclosed |
| Science, AAAS | 26 January 2023 | Initially banned; revised 2023 to allow disclosed use |
| JAMA Network | 31 January 2023 | Not authors; use must be disclosed in methods |
| The Lancet | 2023 | Not authors; use must be disclosed |
| Elsevier | February 2023 | Not authors; AI tools may be used to improve language with disclosure |
| Wiley | 2023 | Not authors; AI-generated content not permitted without permission |
| Taylor & Francis | February 2023 | Not authors; use must be disclosed |
| ICMJE recommendations | May 2023 update | Not authors; clear disclosure required |
A recurring requirement across all of these statements is that responsibility for the content rests with the human authors. ChatGPT and similar systems cannot consent to authorship, cannot agree to ICMJE conditions, and cannot be sued.
A more difficult question is whether AI should be used by reviewers, rather than by authors. Conferences and journals depend on a free supply of reviewer labour, which has been under strain for years. In 2023 and 2024, several large machine learning conferences explicitly considered whether reviewers could use LLMs to write critiques.
The most cited empirical study of AI in peer review is "Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews" by Weixin Liang, Zachary Izzo, Yaohui Zhang, James Zou, and colleagues at Stanford, posted to arXiv on 11 March 2024. The authors developed a statistical method that estimates the fraction of a corpus of reviews that has been substantially modified by an LLM, using shifts in the frequency of stylistic markers.
The headline result was that between 6.5 percent and 16.9 percent of reviews at ICML 2024, ICLR 2024, NeurIPS 2023, and EMNLP 2023 showed evidence of substantial LLM modification, with ICLR 2024 at the higher end. Nature 2023 was below 2 percent in the same analysis. The fraction was higher for reviews written close to the deadline and for reviewers with less confidence in their own scores.
The paper does not claim that any specific review was written by ChatGPT. The method is a population-level estimator, not a detector for individual cases. But it provided the first quantitative evidence that LLM use in peer review had become common.
Andrew Ng wrote in his newsletter The Batch in March 2024 that AI assistance in peer review was "increasingly hard to avoid" and that the community needed clearer norms rather than an outright ban. He argued that a reviewer using an LLM to polish language or check for missed references was very different from a reviewer asking ChatGPT to write the entire review.
NeurIPS 2024 published a reviewer code of conduct in mid-2024 that explicitly banned the use of LLMs to generate review text. Reviewers were allowed to use AI for limited tasks such as language polishing of their own draft, but feeding submitted manuscripts into a third-party LLM was prohibited on confidentiality grounds. ICLR 2024 and 2025 adopted very similar language, as did ACL 2024 and CVPR 2024.
The confidentiality argument turns out to matter as much as the originality argument. Papers under review are confidential. Pasting them into ChatGPT or Claude exposes them to the operator of the model, which most authors did not consent to.
Retraction Watch began tracking ChatGPT-related retractions in 2023. By the end of 2024 the count was in the hundreds, with several recurring patterns.
In 2024, science integrity researchers noticed that the phrase "vegetative electron microscopy" had appeared in multiple papers, especially from authors in China and Iran. The phrase is nonsensical in English. Tracing the term back through translation tools showed it was a mistranslation of a Farsi or Chinese expression that probably referred to scanning electron microscopy. The fact that the same garbled term appeared in dozens of papers strongly suggested that authors and translators were running text through AI systems without checking the output. By mid 2025, more than twenty papers containing the phrase had been corrected or retracted.
In February 2024, Frontiers in Cell and Developmental Biology published a review article that included an obviously AI-generated figure showing a rat with anatomically impossible reproductive organs, several times larger than the rat itself, along with labels written in nonsense English ("dck", "iollotte sserotgomar cell"). Social media reaction was immediate. The paper was retracted within three days. The episode embarrassed Frontiers and led to commitments to strengthen peer review and image review across its titles.
A distinct genre of ChatGPT artefact involves passages that begin with chatbot pleasantries such as "Certainly, here is", "As an AI language model", or "I'm sorry, but I cannot". These appear when an author has copy and pasted the chatbot's response without removing the conversational framing. In March 2024 a paper in Surfaces and Interfaces, published by Elsevier, contained the introductory phrase "Certainly, here is a possible introduction for your topic". Several other Elsevier journals were later found to contain similar tells. Many were corrected or retracted.
A related pattern appears when an author asks a chatbot to summarise a topic and includes the chatbot's hedging language about being unable to access the literature. Multiple 2023 and 2024 papers have been retracted after the phrase "As an AI language model, I do not have access to recent literature" appeared in their introductions.
The Retraction Watch database now includes a tag for ChatGPT-related retractions. Editor Ivan Oransky has been a leading public voice in arguing that the volume of such cases is a sign of broader problems with quality control, not just a few rogue authors.
A persistent and well-documented failure mode of LLMs in academic work is citation hallucination. Asked for sources to support a claim, a chatbot may produce author names, journal titles, years, and DOIs that look plausible but do not exist.
The most quoted early study is by Bhattacharyya, Miller, Bhattacharyya, and Miller, published in Cureus in 2023. The authors asked ChatGPT to write a short medical article and then checked every citation. They found that 47 percent of the references were entirely fabricated, with another 46 percent containing significant errors. Only 7 percent were both real and correctly described.
Similar results have been reported in legal scholarship, social science, and the humanities. The 2023 Mata v. Avianca case in the United States Southern District of New York, although not strictly academic, has become the standard cautionary tale: attorneys at Levidow, Levidow and Oberman submitted a brief citing six entirely fictitious cases that ChatGPT had invented. Judge P. Kevin Castel sanctioned the lawyers in June 2023.
Later models have improved on this. Retrieval-augmented systems such as Elicit, Consensus, Scite Assistant, and PaperQA2 limit themselves to citations from a curated corpus, which mostly eliminates fabrication. But unprompted use of ChatGPT or Gemini to draft a literature review still produces invented references at a non-trivial rate. A 2024 study by Walters and Wilder in Scientific Reports found that GPT-4 hallucinated about 18 percent of references it produced, down from 55 percent for GPT-3.5 but still well above zero.
The demand for AI text detection ran ahead of the technology. By mid 2023 dozens of detectors were on sale to universities, including GPTZero, Turnitin's AI writing detector, Originality.ai, and Copyleaks.
OpenAI itself launched an "AI Text Classifier" in January 2023 and quietly retired it in July 2023, citing "low rate of accuracy". The withdrawal was widely read as an admission that reliable detection of LLM output is fundamentally hard.
A Stanford paper by Liang, Yuksekgonul, Mao, Wu, and Zou, published in Patterns in July 2023, found that GPT detectors were systematically biased against non-native English writers. The authors fed essays by non-native English students into seven commercial detectors and found that more than half of the essays were falsely flagged as AI-generated. Native English essays were rarely flagged. Trinka and similar tools, which suggest language fixes, can therefore push a writer's text into the same statistical zone as LLM output, raising false positive rates further.
The consensus among researchers studying detection is that there is no reliable, low-false-positive way to identify whether a given piece of text was written by a human or by a chatbot, especially after light editing. This has weakened the case for universities punishing students based on detector output, and many institutions in the US and UK now advise against relying on AI detection scores as evidence in disciplinary cases.
Large-scale evidence synthesis is one of the more promising application areas. The Cochrane Collaboration has run several pilots since 2023 using LLMs to screen titles and abstracts in systematic reviews. Covidence, the most widely used systematic review platform, integrated an AI title and abstract screener in 2024. Most of these tools position AI as a second reviewer working alongside human judges, rather than as a replacement.
Elicit's systematic review workflow, the FutureHouse Falcon agent, and Iris.ai's review automation all aim at the same target: shrinking the four-to-six-month timeline of a typical systematic review without lowering its quality. Independent validation studies through 2024 found that AI screening agreement with human reviewers was high for clear inclusion or exclusion cases but worse on borderline ones, which suggests human review of contested cases will remain necessary for some time.
A loose set of norms has emerged across journals, conferences, and funding agencies, even where formal policies vary.
These norms are still moving. The UK Research Excellence Framework, the European Research Council, and several US funding agencies issued guidance in 2024 stating that they would not accept proposals "written by AI" without disclosure, while leaving the threshold for what counts as written by AI deliberately vague. The National Institutes of Health prohibited the use of generative AI tools in its peer review of grant applications in June 2023.