Health

See also: Health ChatGPT Plugins

Health, in the context of this encyclopedia, covers the use of artificial intelligence in medicine, biomedical research, and the broader business of caring for sick people. It is one of the oldest application areas in AI (the first medical expert systems date to the early 1970s) and, after a long stretch of hype and disappointment, has become one of the most commercially active in the years since the deep learning revolution. The U.S. Food and Drug Administration's running list of authorized AI/ML-enabled medical devices passed 1,000 entries by the end of 2024, with roughly three quarters of those clearances coming after 2020. For a more concept-focused entry on the same subject, see AI in healthcare. The treatment here emphasizes named systems, named studies, and named clinical deployments.

Overview

Work on AI in medicine clusters into a few distinct activities. There is the discovery side, where machine learning is used to predict protein structures, score molecules, and pick targets. There is the clinical side, where AI is read by radiologists, pathologists, and ophthalmologists, or used to scribe a visit, suggest a differential, or screen for sepsis. There is the population health and operations side, which covers triage chatbots, scheduling, prior authorization, and disease surveillance. Each has its own regulatory regime, its own publication culture, and a different relationship to the patient.

The largest research efforts run out of Google DeepMind (AlphaFold, AlphaMissense, Med-PaLM, AMIE), Microsoft Research and its Nuance subsidiary, OpenAI (HealthBench, clinical evaluations of GPT-4), Anthropic, NVIDIA (Clara, MONAI), and a long list of focused companies including Tempus, Hippocratic AI, Abridge, Aidoc, Viz.ai, PathAI, Paige, Insilico Medicine, Recursion, Owkin, and BenevolentAI. Major academic centers including Mayo Clinic, Stanford, Johns Hopkins, and Moorfields Eye Hospital host ongoing trials and deployments.

History

Expert systems era

The first serious attempt at clinical AI was MYCIN, a backward-chaining rule based expert system built at Stanford in the early 1970s by Edward Shortliffe under Bruce Buchanan. MYCIN encoded about 600 if-then rules elicited from infectious-disease specialists and recommended antibiotic regimens for bacteremia and meningitis, with the dose adjusted for body weight. In published evaluations it matched or beat junior faculty on test cases, with an acceptability rating of about 65% (versus 42.5% to 62.5% for five Stanford faculty members). It was never used clinically, partly because terminals were rare and integration into hospital workflows did not exist, and partly because the medico-legal questions about a computer prescribing drugs had no answers.

MYCIN was followed by INTERNIST-1 at the University of Pittsburgh, the DXplain decision support system at Massachusetts General Hospital (released in 1986 and still used in some teaching settings), and a wave of clinical decision support tools through the 1980s. None of them stuck in routine practice. By the late 1990s the expert systems movement in medicine was largely seen as a failure, for reasons later catalogued by Edward Shortliffe and others: knowledge acquisition bottlenecks, brittleness outside the cases the rules were designed for, and a lack of integration with electronic records that mostly did not yet exist.

IBM Watson Health

The most expensive chapter in this history belongs to IBM. After Watson won Jeopardy in 2011, IBM announced a health business and signed a flagship $62.5 million partnership with the MD Anderson Cancer Center in 2013 to build an Oncology Expert Advisor for leukemia. MD Anderson cancelled the project in 2017 after a critical audit by PricewaterhouseCoopers; the system was never used on a patient. STAT News, in a 2017 investigation, reported that the Memorial Sloan Kettering team training Watson for Oncology had been feeding the model synthetic patient cases rather than real records, producing recommendations that hospitals in other countries found unsafe.

IBM is reported to have spent roughly $4 billion on acquisitions for Watson Health (including the 2016 purchase of Truven Health Analytics for $2.6 billion) while pulling in only about $1 billion a year in revenue. In January 2022 IBM sold the Watson Health assets to the private equity firm Francisco Partners for a reported sum of more than $1 billion. The buyer renamed the unit Merative. Watson Health became the canonical cautionary tale in medical AI: a model trained on the wrong data, applied to the wrong problem, sold to hospitals on terms the technology could not meet.

Deep learning era

The deep learning era in medicine starts with two landmark papers in 2016 and 2017.

In December 2016, Varun Gulshan and colleagues at Google published "Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs" in JAMA. The team trained a convolutional neural network on 128,175 retinal fundus images graded by 54 U.S. ophthalmologists. On the EyePACS-1 validation set the model achieved sensitivity of 90.3% and specificity of 98.1% for referable diabetic retinopathy; on the Messidor-2 set it reached 87.0% and 98.5%. That paper, more than any other, established that off-the-shelf deep learning could match specialist performance on a real diagnostic task with sufficient labeled data.

In February 2017, Andre Esteva and colleagues at Stanford published "Dermatologist-level classification of skin cancer with deep neural networks" in Nature, using an Inception v3 CNN fine-tuned on 129,450 clinical images covering 2,032 different skin diseases. The system reached dermatologist-level accuracy against 21 board-certified dermatologists on two binary tasks (keratinocyte carcinomas versus benign seborrheic keratoses, and malignant melanomas versus benign nevi).

Within a few years the same idea (transfer learning a vision CNN on labeled medical images, then validating against specialists) had been applied to radiology, pathology, cardiology, gastroenterology, and ophthalmology. The 2018 paper by De Fauw and colleagues at DeepMind and Moorfields Eye Hospital, published in Nature Medicine, showed a referral system matching retinal specialists across more than 50 sight-threatening conditions.

Foundation model era

The foundation model era is best dated to late 2022 and 2023. ChatGPT, released in November 2022, was found within weeks to pass the USMLE at or near the boundary of competent performance, although the published evaluations soon disagreed about exactly which thresholds had been cleared. A March 2023 Microsoft Research paper by Harsha Nori, Nicholas King, Scott McKinney, Dean Carignan, and Eric Horvitz, "Capabilities of GPT-4 on Medical Challenge Problems," reported that GPT-4 without specialized prompting scored above 85% on USMLE practice materials, more than 20 points above the passing threshold and more than 30 points above ChatGPT.

Google's Med-PaLM, described in the December 2022 preprint and the July 2023 Nature paper "Large language models encode clinical knowledge" by Karan Singhal and colleagues, reached 67.2% on MedQA, becoming the first AI system to surpass the 60% USMLE-style passing mark. Med-PaLM 2, presented at Google Health's The Check Up event in March 2023 and published in Nature Medicine, raised that to 86.5% on MedQA.

Foundation models in pathology, radiology, and genomics followed quickly. Microsoft's GigaPath (2024) and Paige's Virchow (2024) trained on tens of thousands of whole-slide images. Stanford CheXzero, Google's CXR Foundation, and Harvard's Med-Gemini variants pushed the same idea into radiology. By 2024 the field had pivoted from task-specific CNNs to general-purpose pretrained models adapted with relatively small amounts of supervised data.

Medical imaging

Medical imaging is the most regulated and most commercially mature corner of clinical AI. The FDA's list of AI/ML-enabled medical devices, which it began publishing in 2021, grew to 691 entries by August 2023, more than 1,000 by the end of 2024, and roughly 1,250 by the end of 2025, with radiology accounting for about 75% of authorizations.

The defining moment for autonomous AI diagnostics was April 11, 2018, when the FDA granted De Novo authorization (DEN180001) to IDx-DR, a screening device for more-than-mild diabetic retinopathy developed by IDx Technologies in Coralville, Iowa (the company is now Digital Diagnostics; the product was renamed LumineticsCore in 2022). IDx-DR uses retinal fundus photographs from the Topcon NW400 camera. The pivotal trial in 900 patients across 10 primary care sites reported sensitivity of 87.4% and specificity of 89.5%. It was the first FDA-cleared device authorized to make a clinical decision without specialist review.

Selected FDA-cleared imaging AI products

Product	Company	Indication	First clearance	Pathway
IDx-DR / LumineticsCore	Digital Diagnostics	Diabetic retinopathy screening	April 2018 (DEN180001)	De Novo
ContaCT	Viz.ai	Large vessel occlusion stroke triage	February 2018 (DEN170073)	De Novo
Aidoc ICH	Aidoc	Intracranial hemorrhage on head CT	August 2018	510(k)
Aidoc PE	Aidoc	Pulmonary embolism on chest CT	May 2019	510(k)
Aidoc LVO	Aidoc	Large vessel occlusion on head CTA	2020	510(k)
HealthCCSng	Bunkerhill Health/Nanox	Coronary calcium scoring	2021	510(k)
HeartFlow FFRct	HeartFlow	Coronary CT fractional flow reserve	2014	510(k)
Paige Prostate Detect	Paige	Prostate cancer on biopsy slides	September 2021 (DEN200080)	De Novo
AutoLung Nodule	RadAI / various	Lung nodule detection on chest CT	various	510(k)
OsteoDetect	Imagen	Wrist fracture detection	May 2018 (DEN180005)	De Novo
Caption Guidance	Caption Health (acquired by GE)	AI-guided cardiac ultrasound	February 2020 (DEN190040)	De Novo
EchoGo Pro	Ultromics	Echocardiographic stress imaging analysis	2020	510(k)
Critical Care Suite	GE Healthcare	Pneumothorax flagging on chest X-ray	2019	510(k)

Aidoc, an Israeli company founded in 2016, holds the largest number of FDA authorizations of any single radiology AI vendor (more than a dozen by 2024) covering intracranial hemorrhage, pulmonary embolism, large vessel occlusion, aortic dissection, vertebral compression fractures, and intra-abdominal free gas. Viz.ai, founded the same year, focused initially on stroke and won the first FDA clearance for an AI-based stroke triage product (ContaCT) in 2018; the system pages a neuro-interventionalist when an LVO is detected on a CT angiogram, shortening door-to-needle and door-to-puncture times in published cohort studies.

Paige Prostate Detect, cleared by the FDA in September 2021 (DEN200080), was the first AI-based pathology product authorized for clinical use in the United States. The pivotal study had 16 pathologists examine 527 prostate biopsy slides; using Paige Prostate Detect, average cancer detection rose from 89.5% to 96.8%, with a 70% reduction in false negative rate. PathAI, founded by Andrew Beck and Aditya Khosla, focuses on assistive pathology for clinical trials and runs the largest pathologist labeling operation in the field.

HeartFlow FFRct, cleared by the FDA in 2014, uses computational fluid dynamics applied to coronary CT angiograms to estimate fractional flow reserve, the same hemodynamic measure used to decide whether a coronary lesion requires stenting. The technology was incorporated into the 2021 ACC/AHA Chest Pain Guideline as a class IIa recommendation and is reimbursed by Medicare under CPT code 0503T.

Drug discovery

The headline result in computational structural biology was AlphaFold 2, presented in November 2020 at CASP14 and described in detail in Nature on July 15, 2021 by John Jumper, Demis Hassabis, and colleagues at DeepMind. The system reached a median Global Distance Test score of 92.4 on the CASP14 free modeling targets, a level the protein folding community had been chasing for half a century. The companion AlphaFold Protein Structure Database, hosted with EMBL-EBI, opened in July 2021 and expanded a year later to cover 200+ million predicted structures, essentially every protein sequence in the UniProt database. By late 2024 it held about 214 million predicted structures.

In September 2023, DeepMind released AlphaMissense in Science. AlphaMissense is a variant of AlphaFold fine-tuned to score the pathogenicity of missense mutations, the single amino-acid substitutions responsible for many genetic diseases. The model assigned likely benign or likely pathogenic labels to 89% of the roughly 71 million possible missense variants in the human proteome. Only about 0.1% had been clinically classified before the model was released, making it a meaningful expansion of what is interpretable in clinical genetics, particularly for rare disease diagnosis.

In May 2024, DeepMind and Isomorphic Labs (the drug discovery spinout led by Hassabis) released AlphaFold 3, described in Nature by Josh Abramson and colleagues. AlphaFold 3 generalizes AlphaFold 2 to complexes including proteins paired with DNA, RNA, ligands, ions, and post-translationally modified residues. On internal benchmarks it outperformed specialized docking software on protein-ligand interaction prediction. The AlphaFold Server, a free interface, opened to academics on the same day, although AlphaFold 3's code and weights were not released until November 2024, a decision that drew complaints from the open-science community.

Isomorphic Labs, founded in 2021, has partnerships with Novartis and Eli Lilly (both announced in January 2024, with up to $1.7 billion in milestone payments combined) and has stated that it intends to put its first internally-discovered candidates into clinical trials in 2025-2026.

AI-discovered drugs in clinical trials

The number of compounds in clinical development whose discovery involved generative AI or large-scale ML has grown from roughly five in 2020 to more than 75 by 2024 according to the BiopharmaTrend AI Pipeline tracker, though the precise definition of "AI-discovered" varies.

Compound	Sponsor	Indication	Status (as of 2025)	Notes
ISM001-055 / rentosertib	Insilico Medicine	Idiopathic pulmonary fibrosis	Phase 2a results published June 2025 in Nature Medicine	TNIK inhibitor, 60mg arm showed mean FVC gain of 98.4 mL vs placebo decline of 62.3 mL
INS019_055	Insilico Medicine	COVID-19 / inflammation	Phase 1
REC-994	Recursion Pharmaceuticals	Cerebral cavernous malformation	Phase 2 readout 2024, discontinued 2025	Phase 2 missed efficacy on long-term extension
REC-2282	Recursion	NF2-mutated meningioma	Phase 2/3
REC-4881	Recursion	Familial adenomatous polyposis	Phase 2
BEN-2293	BenevolentAI	Atopic dermatitis	Phase 2a failed (2023)	Triggered restructuring and layoffs at BenevolentAI
Baricitinib	Eli Lilly / BenevolentAI	COVID-19	Approved (FDA EUA, then full approval)	Repurposing identified by BenevolentAI's knowledge graph, March 2020
Atomwise programs	Atomwise / partners	Various	Mostly preclinical/Phase 1	AtomNet platform
Exscientia DSP-1181	Exscientia / Sumitomo	OCD	Phase 1 (discontinued 2022)	First AI-designed drug to enter human trials (2020)
Exscientia EXS-21546	Exscientia / Evotec	Oncology	Phase 1	A2A receptor antagonist

The Insilico Medicine readout in 2024 and 2025 is, so far, the strongest published evidence that an end-to-end generative pipeline can find a molecule that produces a meaningful clinical signal. ISM001-055, also called rentosertib, was discovered by Insilico's Pharma.AI platform, which used generative chemistry from the Chemistry42 module to optimize a small-molecule inhibitor of TNIK, a target proposed by the company's PandaOmics target identification system. The phase 2a was a 12-week, four-arm, placebo-controlled trial in 71 IPF patients across 21 sites in China (NCT05938920). The 60 mg once-daily arm showed a mean FVC improvement of 98.4 mL versus a 62.3 mL decline in placebo. The trial was published in Nature Medicine in June 2025.

Exscientia (now part of Recursion following a 2024 merger) ran the first AI-designed molecule into clinical trials in 2020 (DSP-1181, an OCD candidate co-developed with Sumitomo Dainippon Pharma). That program was discontinued in 2022 for insufficient Phase 1 activity, a reminder that AI-discovered does not mean AI-validated.

Other discovery applications

Generative chemistry tools include Insilico's Chemistry42, MIT/Liverpool's Chemberta, Atomwise's AtomNet, and academic systems including REINVENT and JT-VAE. Target identification platforms include BenevolentAI's knowledge graph, Causaly, and Owkin's federated models trained across academic medical centers. Pharma companies have built internal AI groups (Novartis's Biomedical Research AI team, AstraZeneca's Discovery Sciences AI group, Roche's Genentech Computational Sciences). NVIDIA's BioNeMo framework, released in 2023, supplies large pretrained models for proteins (ESM-style), molecules, and genomics for industrial use.

Clinical large language models

The arrival of large language models capable of answering medical questions changed the conversation about clinical AI in 2022 and 2023. Three model families have been the most thoroughly evaluated.

Med-PaLM and Med-PaLM 2

Med-PaLM, from Google Research, was the first AI system to clear the USMLE-style 60% threshold (67.2% on MedQA in late 2022). The paper, "Large language models encode clinical knowledge," appeared in Nature in July 2023 with Karan Singhal as first author. Med-PaLM 2, introduced at Google's The Check Up event in March 2023, scored 86.5% on MedQA, and a blind panel of physicians preferred its answers to those of clinicians on eight of nine axes in a head-to-head consumer-question evaluation. Google later packaged the Med-PaLM 2 weights as MedLM, a family of healthcare foundation models available on Vertex AI starting December 2023.

GPT-4 and the OpenAI HealthBench

Microsoft's March 2023 paper by Harsha Nori and colleagues showed GPT-4 scoring above 85% on USMLE practice materials with no specialized prompting. A subsequent paper, "Can Generalist Foundation Models Outcompete Special-Purpose Tuning?", showed that GPT-4 with the Medprompt prompting strategy could match or exceed specialized medical models on every benchmark in the MultiMedQA suite. OpenAI's HealthBench, released in May 2025, is a benchmark of 5,000 realistic multi-turn health conversations with 48,562 unique grading criteria, built by 262 physicians from 60 countries across 26 specialties. GPT-3.5 Turbo scored 16% on HealthBench; GPT-4o scored 32%; OpenAI's o3 scored 60%. In a separate experiment, OpenAI reported that physician edits to o3 responses no longer improved the answers.

AMIE

AMIE (Articulate Medical Intelligence Explorer), a research system from Google DeepMind, is built specifically for diagnostic dialogue. The first AMIE paper appeared on arXiv in January 2024 and was published in Nature in April 2025. In a randomized double-blind crossover study using simulated patients (OSCE-style), AMIE outscored primary care physicians on 30 of 32 axes by specialist judges and 25 of 26 axes by the patient-actors, on text-based consultations. A multimodal version, AMIE-V, presented in May 2025, requests and reasons about images during the conversation. AMIE is a research system, not a product; Google has stated it has no near-term plans to commercialize it.

Other clinical LLM efforts

Hippocratic AI, founded in 2023 by Munjal Shah, releases a constellation of specialized models marketed as Polaris (versions 1.0, 2.0, and 3.0). Polaris 3.0, announced in March 2025, comprises 22 cooperating LLMs totaling about 4.2 trillion parameters and is positioned as a safety-focused system for telephonic patient outreach. Hippocratic claims a 99.38% clinical accuracy rate on internal evaluations and has signed deployments with WellSpan Health, Cincinnati Children's, and others. Anthropic's Claude is used inside Epic for chart summarization, ambient note generation, and patient question routing at multiple academic medical centers, though the work is mostly described in conference talks and Epic UGM materials rather than peer review.

Ambient AI scribes

Ambient documentation has become the highest-volume clinical use of generative AI. The premise is straightforward: a microphone in the exam room records the doctor-patient conversation, and an LLM produces a structured note in the SOAP format that the physician edits and signs. Adoption was minimal before 2023 and is now in the millions of visits per quarter.

Microsoft's Nuance Dragon Ambient eXperience (DAX) Copilot, announced in 2023 and rolled into the renamed Dragon Copilot in March 2025, is the most widely deployed product. DAX Copilot is embedded directly in Epic's workflow; Microsoft says more than 150 health systems have deployed it, including Mass General Brigham, Stanford Health Care, and Atrium Health. Published evaluations report 50% reductions in time on documentation and roughly 70% reductions in self-reported burnout among users.

Abridge, founded in 2018 by cardiologist Shiv Rao at the University of Pittsburgh Medical Center, has emerged as the leading independent in the category. Abridge passed $100 million in annual recurring revenue in May 2025 and raised a $300 million Series E led by Andreessen Horowitz in June 2025 at a $5.3 billion valuation, after a $250 million Series D in February 2025 at $2.75 billion. Major deployments include Kaiser Permanente (about 24,600 physicians across 40 hospitals and 600 clinics), Mayo Clinic (more than 2,000 physicians, expanding to nursing pilots), Johns Hopkins, Duke Health, UPMC, Yale New Haven, and Sutter Health.

Suki AI, founded in 2017 by Punit Soni (a former Flipkart executive), raised a $70 million Series D in 2024 and has integrations with Epic, Cerner/Oracle Health, Meditech, and athenahealth. A 2024 Phyx Primary Care evaluation reported a 41% reduction in documentation time per note and a 27% reduction in self-reported documentation burden. DeepScribe, Augmedix (acquired by Commure in 2024), Heidi Health, and Freed AI compete in the same category.

The economics are unusually favorable for vendors: most deployments price at roughly $1,500 to $3,000 per physician per year, on top of an EHR fee that is unchanged. KLAS has rated Abridge top of the segment for two consecutive years (2025 and 2026). The unresolved questions are about safety (hallucinated facts in the note, especially in pediatrics and psychiatry) and about who bears liability for an error introduced by the model.

Diagnostics, triage, and risk prediction

Triage chatbots

The consumer-facing triage chatbot category, hot through the late 2010s, has had a hard decade. Babylon Health, founded by Ali Parsa in 2013, went public via SPAC in October 2021 at a $4.2 billion valuation. By 2022 the company had abandoned two NHS contracts; the U.S. business filed for Chapter 7 liquidation in August 2023, and the UK business entered administration on September 11, 2023, with the GP at Hand service and other assets sold to eMed Healthcare for about £500,000. Ada Health (Berlin, founded 2011) and K Health (New York, founded 2016) continue to operate as symptom checkers tied to telehealth services, and Buoy Health serves as a routing layer for several US employers.

Sepsis prediction and the Epic Sepsis Model

Epic Systems's Sepsis Prediction Model (ESM), bundled into the Epic EHR since 2017 and used at hundreds of US hospitals, became the highest-profile case study of a clinical AI failing in deployment. In June 2021, Andrew Wong and colleagues at the University of Michigan published an external validation in JAMA Internal Medicine across 27,697 patients and 38,455 hospitalizations at Michigan Medicine. ESM showed a sensitivity of 33%, specificity of 83%, positive predictive value of 12%, and an AUC of 0.63 (Epic had reported an AUC range of 0.76 to 0.83). The model missed 67% of sepsis cases and generated alerts on 18% of all hospitalized patients, prompting concerns about alert fatigue. Epic released a redesigned predictive model in 2022, and subsequent validations have been more cautious about generalizing claims across sites.

Other risk prediction

The Stanford-Penn team's Deterioration Index (Epic-based, but tuned per site), and various NEWS2 reimplementations dominate inpatient risk scoring. CMS supports several AI-enabled programs in cardiology and oncology under the CMS Innovation Center. Tempus, founded by Eric Lefkofsky in 2015 and listed on Nasdaq in June 2024 (TEM), runs a precision oncology platform combining tumor sequencing with proprietary AI for therapy matching. Tempus reports more than 60 deployed cardiology algorithms across 80+ hospitals. Glass Health, founded in 2021, is an AI clinical decision support tool aimed at differential diagnosis and care plan generation, used primarily for ambulatory referral and resident education.

Genomics and precision medicine

Genomics has been one of the steadier success stories. DeepVariant (Google, 2017) is a CNN-based variant caller that won the precisionFDA Truth Challenge in 2016 and is now in routine use in many sequencing labs. AlphaMissense (described above) extends the same approach to missense variants. The Broad Institute's gnomAD database, combined with population-scale ML, supports clinical-grade variant interpretation tools including Franklin (Genoox) and Mastermind (Genomenon).

In cancer genomics, Foundation Medicine (acquired by Roche in 2018) and Tempus dominate commercial tumor profiling. The COSMIC database, AACR's Project GENIE, and the NCI's Cancer Research Data Commons supply the supervised data for most ML approaches. Federated learning firms including Owkin and Lifebit have built pan-hospital cohorts for biomarker discovery without exfiltrating individual patient data; the largest published examples are Owkin's MELLODDY consortium with EFPIA (2020-2022) and the Sanofi partnership announced in 2021 (a $180 million equity investment plus a $90 million co-development partnership in oncology).

Microsoft's GigaPath, Paige's Virchow, and the Mahmood Lab's UNI foundation model (all 2024) are early examples of vision foundation models for pathology, trained on tens of thousands of whole-slide images and adapted to a wide variety of downstream tasks with limited labels. These systems are mostly still research, but the underlying datasets (TCGA, the PANDA challenge, CAMELYON17) are now public.

Public health and epidemiology

BlueDot, a Toronto-based company founded by infectious disease physician Kamran Khan, issued the earliest documented warning about the cluster of pneumonia cases in Wuhan on December 31, 2019, more than a week before the WHO's January 9 statement. BlueDot's natural language processing pipeline scrapes about 100,000 sources daily including news, official notices, livestock health bulletins, and IATA flight data; using the same approach, the company correctly identified Bangkok, Hong Kong, Tokyo, Taipei, Phuket, Seoul, and Singapore as the seven cities most likely to receive infected travelers from Wuhan in early January 2020.

Metabiota played a similar role, and the U.S. CDC's COVID-19 Forecast Hub, run by Nicholas Reich at UMass Amherst, aggregated dozens of models including ML approaches throughout the pandemic. Google's COVID-19 Forecasting model was published as a preprint in 2020. After the pandemic, the EpiCenter consortium (CDC, with collaborators) launched in 2021 and now supports several ML-based outbreak detection programs in the U.S. Public Health Surveillance Network.

AI's pandemic record was uneven. The Turing Institute's 2021 review of ML for COVID-19 found that the majority of papers published in 2020 on AI for COVID detection from chest imaging were not clinically usable, citing methodological flaws (data leakage, mixing pediatric controls with adult cases, small samples) in 415 of 415 papers reviewed. This was a watershed moment in medical AI methodology, leading to the CLAIM, TRIPOD-AI, and STARD-AI reporting standards.

Mental health and digital therapeutics

Woebot (Woebot Health, founded 2017 by Alison Darcy at Stanford) and Wysa (Touchkin eServices, India, founded 2015) are the two most studied conversational mental health agents. Woebot Health received an FDA Breakthrough Device Designation in May 2021 for WB001, a postpartum depression digital therapeutic. Wysa received a similar Breakthrough Device Designation in May 2022 for chronic musculoskeletal pain with comorbid depression and anxiety. Multiple RCTs and pre-post studies have reported reductions in PHQ-9 and GAD-7 scores; effect sizes are typically smaller than in-person CBT and most studies do not have active comparators.

In 2024 and 2025, several jurisdictions began regulating AI use in mental health more directly. Utah passed HB 452 in March 2025, requiring AI mental health chatbots to disclose their non-human nature, restricting data sale, and banning embedded advertising. Illinois passed the Wellness and Oversight for Psychological Resources Act in August 2025, prohibiting AI systems from performing or advertising therapy unless tied to direct licensed-clinician oversight; New York's 2025 companion-AI rules require persistent in-conversation disclosure that the user is talking to a machine. In June 2025, Woebot Health announced it was winding down its consumer-facing app, citing the changing regulatory landscape and the difficulty of running an evidence-based service in a market crowded with unregulated competitors.

The Character.AI lawsuit filed by Megan Garcia in October 2024, alleging that the chatbot contributed to her son's suicide, has been cited in several of these state laws. The FTC opened an inquiry into companion chatbots in 2025.

Regulation

United States

The FDA regulates AI in medicine primarily as Software as a Medical Device (SaMD), the same category that covers software that runs on general-purpose computing platforms. Authorizations follow three pathways: 510(k) clearance (substantial equivalence to a predicate device), De Novo classification (novel, low-to-moderate risk), and Premarket Approval (PMA, for the highest-risk devices). The FDA's 2019 "Proposed Regulatory Framework for Modifications to AI/ML-Based SaMD" discussion paper introduced the idea of a Predetermined Change Control Plan (PCCP) to allow algorithms to be updated post-market without a new submission. The PCCP guidance was finalized in December 2024.

The FDA's Action Plan for AI/ML-Based Software as a Medical Device, published in January 2021, set five goals including a tailored regulatory framework, good machine learning practice principles (released with Health Canada and the UK MHRA in October 2021), patient-centered transparency, regulatory science methods to evaluate bias, and real-world performance monitoring. The agency now publishes a public-facing list of AI/ML-enabled medical devices, updated several times per year, that anyone can search.

European Union

In the EU, AI medical devices are regulated under two overlapping frameworks. The Medical Device Regulation (MDR, in force since May 2021) and the In Vitro Diagnostic Regulation (IVDR, May 2022) cover safety, clinical performance, and conformity assessment by notified bodies. The EU AI Act, in force since August 1, 2024, designates AI systems that are themselves medical devices or safety components of medical devices in MDR risk class IIa and above as high-risk AI systems. High-risk obligations begin to apply on August 2, 2026, with full enforcement on August 2, 2027.

High-risk AI systems under the AI Act must meet requirements for risk management, training data quality, transparency and information for users, human oversight, accuracy, robustness, cybersecurity, and conformity assessment. The AI Act layers on top of MDR rather than replacing it. Notified bodies will assess both frameworks. Annex III also captures AI systems used for emergency call triage and AI systems used to evaluate eligibility for healthcare benefits, regardless of whether they are medical devices.

Other jurisdictions

The UK MHRA published a Software and AI as a Medical Device Change Programme in 2021 and an updated roadmap in 2024. The MHRA, Health Canada, and the FDA jointly published Good Machine Learning Practice guiding principles in October 2021 and a Predetermined Change Control Plans paper in 2023. The WHO published Ethics and Governance of AI for Health in June 2021 and updated guidance on large multi-modal models in 2024. The Coordination Group on AI in Health, convened by the OECD and the WHO, has tried to coordinate cross-border evaluation criteria.

HIPAA in the U.S. and GDPR in the EU continue to govern the processing of patient data for AI training and inference. Most US health systems require Business Associate Agreements with any vendor that touches protected health information, including LLM vendors, which has been a friction point for the deployment of consumer-grade frontier models such as ChatGPT and the public Claude API.

Bias, equity, and limitations

The canonical paper on bias in deployed clinical AI is Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan's "Dissecting racial bias in an algorithm used to manage the health of populations," published in Science in October 2019. The team studied a widely used algorithm (later identified as Optum's Impact Pro, deployed across millions of patients) and found that at any given algorithmic risk score, Black patients had substantially worse health than White patients, by about 26% more chronic conditions in the highest-risk percentile. The reason was straightforward in hindsight: the model predicted future healthcare costs, which were lower for Black patients at any health level because they had less access to care. Recalibrating the algorithm to predict future health needs rather than costs reduced bias in selected outcomes by about 84%.

The Obermeyer paper has been cited more than 4,000 times and is the most-discussed concrete example of how proxy variables can encode social inequities into clinical decision support. The New York Attorney General opened an inquiry into UnitedHealth Group (Optum's parent) shortly after publication; the AG's 2020 report demanded transparency and ongoing testing.

Other recurring concerns:

Dataset shift: An imaging model trained at one health system frequently underperforms at another because of differences in scanner manufacturers, patient demographics, and labeling conventions. The 2018 Zech and Badgeley study at Mount Sinai showed that a CNN trained to detect pneumonia learned the hospital identifier from image metadata rather than the radiologic finding.
Calibration drift: Models drift over time as care patterns and case mix change. There is no consistent regulatory requirement for ongoing recalibration once a device is cleared.
Bias by absence: Many models are trained on cohorts that skew white, male, and middle-aged. The all of us research program at the NIH has tried to address this for U.S. genomics, but most non-NIH datasets remain skewed.
Hallucination in clinical LLMs: A 2024 study by Asma Ben Abacha and colleagues found that GPT-4 hallucinated medication names or doses in 10% to 20% of summarization tasks depending on the prompt. Ambient scribes have shown similar rates.

In 2023, the Coalition for Health AI (CHAI), a multistakeholder group, published an Assurance Standards Guide and the Quality Health AI Framework, attempting to standardize ongoing assurance practices for deployed clinical AI. Adoption is voluntary.

Notable incidents and controversies

IBM Watson for Oncology: trained on synthetic cases at Memorial Sloan Kettering, the system generated treatment recommendations that other hospitals found unsafe. STAT News reported in 2017 that IBM's own internal documents acknowledged unsafe outputs.
DeepMind Streams and the Royal Free NHS Trust: In 2015, DeepMind began processing identifiable records on 1.6 million patients to support a kidney injury alerting app. The UK Information Commissioner's Office ruled in July 2017 that the Royal Free had violated the Data Protection Act by sharing the data without an adequate legal basis. The Streams app continued operating but Google moved the project to its main healthcare unit.
Epic Sepsis Model: see Wong et al. 2021 above.
Babylon Health: claims about the GP at Hand triage chatbot's safety were criticized in 2018 and 2019 by Hamish Fraser at Brown University and others, after Babylon-published evaluation papers used unrealistic vignettes. Collapse described above.
Character.AI mental health complaints: Multiple state attorneys general opened inquiries in 2024-2025 into companion chatbots after several teen suicides were associated with character.ai use.
ChatGPT in clinical settings: Several US health systems have published incident reports of physicians copy-pasting ChatGPT outputs into notes, including hallucinated citations to PubMed articles that did not exist.
MedPaLM 2 hospital trial: a leaked February 2024 Wall Street Journal report described early Med-PaLM 2 evaluations at HCA Healthcare and one other US system, in which physicians flagged a small but non-zero rate of hallucinated facts in radiology report drafts. Google said the system was not yet deployed at the time of those evaluations.

Selected academic medical center programs

Mayo Clinic Platform: launched in 2020 with deep partnerships with Google Cloud (announced 2019, expanded multiple times) and Cerebras Systems (announced JPM 2024 for a foundation model trained on Mayo's notes and images). Mayo's diagnostic radiology AI program operates several FDA-cleared models in production for chest imaging and stroke.
Stanford Center for Artificial Intelligence in Medicine and Imaging (AIMI), founded 2018 by Curtis Langlotz and Matthew Lungren. Stanford CheXpert and CheXzero are widely used research benchmarks.
Massachusetts General Hospital CCDS (Clinical Decision Support), originating in the QPID program.
Johns Hopkins Inhealth Precision Medicine Initiative.
University of California San Francisco Center for Digital Health Innovation.
The NHS AI Lab, funded with £250 million from 2019 to 2023, supports a portfolio of validation studies under the AI Award program (now the AI in Health and Care Award, run with NIHR).
Kaiser Permanente Division of Research's predictive analytics group developed the Advanced Alert Monitor sepsis-and-deterioration system, deployed in 21 hospitals.

References

Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115-118.
Gulshan, V., Peng, L., Coram, M., et al. (2016). Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 316(22), 2402-2410.
De Fauw, J., Ledsam, J. R., Romera-Paredes, B., et al. (2018). Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine, 24, 1342-1350.
Jumper, J., Evans, R., Pritzel, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589.
Abramson, J., Adler, J., Dunger, J., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630, 493-500.
Cheng, J., Novati, G., Pan, J., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381(6664), eadg7492.
Singhal, K., Azizi, S., Tu, T., et al. (2023). Large language models encode clinical knowledge. Nature, 620, 172-180.
Singhal, K., Tu, T., Gottweis, J., et al. (2024). Toward expert-level medical question answering with large language models. Nature Medicine.
Tu, T., Schaekermann, M., Palepu, A., et al. (2025). Towards conversational diagnostic artificial intelligence (AMIE). Nature, 642, 442-450.
Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. (2023). Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375.
Arora, R. K., Wei, J., Hicks, R. S., et al. (2025). HealthBench: Evaluating Large Language Models Towards Improved Human Health. OpenAI technical report.
Wong, A., Otles, E., Donnelly, J. P., et al. (2021). External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine, 181(8), 1065-1070.
Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453.
Roberts, M., Driggs, D., Thorpe, M., et al. (2021). Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence, 3, 199-217.
Zech, J. R., Badgeley, M. A., Liu, M., et al. (2018). Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, 15(11), e1002683.
Buchanan, B. G., and Shortliffe, E. H. (1984). Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley.
Strickland, E. (2019). How IBM Watson Overpromised and Underdelivered on AI Health Care. IEEE Spectrum.
STAT News investigation, Ross, C., and Swetlitz, I. (2017). IBM pitched its Watson supercomputer as a revolution in cancer care. It's nowhere close. STAT News, September 5, 2017.
UK Information Commissioner's Office (2017). Royal Free, Google DeepMind trial failed to comply with data protection law. ICO press release, July 3, 2017.
Insilico Medicine, Ren, F., Aliper, A., et al. (2025). A generative AI-discovered TNIK inhibitor for idiopathic pulmonary fibrosis: a randomized phase 2a trial. Nature Medicine.
FDA Center for Devices and Radiological Health (2024). Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices. FDA.gov device list, updated periodically.
FDA (2021). Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan.
European Commission (2024). Regulation (EU) 2024/1689 on Artificial Intelligence (the AI Act).
Wysa press release (May 12, 2022). Wysa Receives FDA Breakthrough Device Designation. BusinessWire.
Woebot Health press release (May 2021). FDA Breakthrough Device Designation for WB001 postpartum depression therapeutic.
Khan, K., et al. BlueDot blog and Wired magazine reporting on BlueDot's December 30, 2019 alert and subsequent COVID-19 travel forecasting.
Office of the Illinois Governor (2025). Pritzker signs Wellness and Oversight for Psychological Resources Act. August 4, 2025 press release.
Utah HB 452 (2025), Artificial Intelligence Amendments.
KLAS Research (2025-2026). Best in KLAS: Ambient Speech.
Sacra, FierceHealthcare, and STAT News reporting on Abridge funding rounds and customer announcements (2024-2025).

Overview

History

Expert systems era

IBM Watson Health

Deep learning era

Foundation model era

Medical imaging

Selected FDA-cleared imaging AI products

Drug discovery

AI-discovered drugs in clinical trials

Other discovery applications

Clinical large language models

Med-PaLM and Med-PaLM 2

GPT-4 and the OpenAI HealthBench

AMIE

Other clinical LLM efforts

Ambient AI scribes

Diagnostics, triage, and risk prediction

Triage chatbots

Sepsis prediction and the Epic Sepsis Model

Other risk prediction

Genomics and precision medicine

Public health and epidemiology

Mental health and digital therapeutics

Regulation

United States

European Union

Other jurisdictions

Bias, equity, and limitations

Notable incidents and controversies

Selected academic medical center programs

See also

References

Improve this article

Overview

History

Expert systems era

IBM Watson Health

Deep learning era

Foundation model era

Medical imaging

Selected FDA-cleared imaging AI products

Drug discovery

AI-discovered drugs in clinical trials

Other discovery applications

Clinical large language models

Med-PaLM and Med-PaLM 2

GPT-4 and the OpenAI HealthBench

AMIE

Other clinical LLM efforts

Ambient AI scribes

Diagnostics, triage, and risk prediction

Triage chatbots

Sepsis prediction and the Epic Sepsis Model

Other risk prediction

Genomics and precision medicine

Public health and epidemiology

Mental health and digital therapeutics

Regulation

United States

European Union

Other jurisdictions

Bias, equity, and limitations

Notable incidents and controversies

Selected academic medical center programs

See also

References