Basecamp Research
Last reviewed
Jun 8, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 2,161 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 2,161 words
Add missing citations, update stale details, or suggest a clearer explanation.
Basecamp Research is a London-based artificial intelligence and biotechnology company that has assembled what it describes as the world's largest and most diverse database of biological sequences, sampled directly from nature, and uses that proprietary data to train protein design and biological foundation models. Founded in 2019 by Glen Gowers and Oliver Vince, the company collects DNA from extreme and biodiverse environments worldwide under benefit-sharing agreements aligned with the Nagoya Protocol, and argues that the diversity of its data, rather than the size of its models alone, is the decisive advantage for AI in biology. [1][2][3]
Basecamp's central thesis is that public sequence repositories such as UniProt are too narrow to support frontier generative models: by the company's accounting, roughly 70 percent of existing public sequence data derives from only about ten heavily studied species. Its response has been to build a proprietary biological dataset, branded BaseData and structured as a knowledge graph called BaseGraph, that it says is more than ten times larger than all public databases combined and that catalogues over one million previously unknown species. The company has used this resource to launch a protein structure model (BaseFold) and a family of generative biology models (EDEN), and to enter partnerships including a genetic medicine collaboration with the laboratory of David R. Liu at the Broad Institute of MIT and Harvard. [1][3][4][5]
Basecamp Research operates at the intersection of biodiversity discovery, data infrastructure, and machine learning. Its business model treats curated, contextualized, and ethically sourced biological data as a defensible asset, or "data moat," that can be used both to train the company's own models and to supply data and tools to pharmaceutical, industrial biotechnology, and AI partners. Chief executive Glen Gowers has framed the company's position by contrasting it with conventional AI drug discovery, arguing that "AI is not magic. It's a pattern recognition tool that has creativity, but within the limits of what it's already seen," and that better data, not just larger models, expands those limits. [2][6]
The company is headquartered in London and has built a presence in the Boston area to support its U.S. partnerships. Beyond the two co-founders, its leadership includes chief technology officer Phil Lorenz and chief commercial officer Anupama Hoey, a biopharma executive hired in 2024 who had previously held roles at companies including Sutro Biopharma and Second Genome. [2][4][5]
Basecamp Research traces its origins to an off-grid DNA sequencing expedition led by Gowers and Vince, who met as researchers at Imperial College London. In 2019 the pair ran a fully self-contained, solar-powered sequencing operation on the Vatnajokull icecap in Iceland, sequencing microbial DNA in the field without grid power or internet connectivity. They found that roughly two-thirds of the organisms they sampled had never been recorded in any database, which convinced them that the planet's microbial diversity was vastly under-sampled and commercially untapped. The company was incorporated in 2019 around this insight. Gowers holds a doctorate in bioengineering and Vince a doctorate in synthetic biology; they serve as co-chief executives. [1][3][7]
Basecamp's core asset is its proprietary data. The company sends sampling expeditions to biodiverse and extreme environments, including acidic hot springs, Antarctic soils, and even World War II shipwrecks, and sequences the genetic material it recovers. Each data point is tagged with rich environmental and contextual metadata, such as the conditions and location where an organism was found, which the company says is critical for teaching models how proteins behave in their native context. The data is organized as a knowledge graph (BaseGraph) of sequences and the relationships between them. [3][8]
By June 2025 the company reported that its BaseData dataset contained 9.8 billion novel protein sequences and that, after removing redundant entries, it was over ten times larger than all public databases combined; it credited this work with the discovery of more than one million new species. The associated BaseGraph contained on the order of 5.5 to 6 billion biological relationships, which Basecamp describes as the largest such graph in existence. The company has consistently positioned this scale against the concentration of public data, noting that the bulk of public sequence information comes from a handful of model organisms. [3][5][9]
A defining feature of Basecamp's approach is that it claims to source 100 percent of its data ethically, under access and benefit-sharing arrangements consistent with the Nagoya Protocol to the Convention on Biological Diversity. Every sample is linked to the consent of the relevant landowner or authority and to an agreement under which the country or community of origin, which the company calls "guardians" or biodiversity partners, receives a share of revenue or royalties when the resulting digital sequence information is used commercially or to train AI models. Basecamp distributed its first wave of royalties in 2024 to 37 communities and organizations across 13 countries, a figure it later reported growing to roughly 60 organizations across 21 countries. In August 2024, ahead of the COP16 UN biodiversity conference in Cali, Colombia, Basecamp and the government of Cameroon announced a benefit-sharing deal that the company described as the first such digital-sequence-information agreement with a central African nation, pairing royalty payments with scientific training and laboratory funding. [3][8][10][11]
The table below summarizes self-reported scale figures, which have grown over time and should be read as company claims rather than independently audited statistics.
| Metric | Reported figure | Approximate date |
|---|---|---|
| Novel protein sequences (BaseData) | 9.8 billion | June 2025 |
| Biological relationships (BaseGraph) | 5.5 to 6 billion | 2024 to 2025 |
| New species identified | over 1 million | June 2025 |
| Sampling locations | 150-plus | 2025 to 2026 |
| Countries sampled | 26 to 28 | 2025 to 2026 |
| Community and organization partners | 125-plus (later 152-plus) | 2025 to 2026 |
| Size vs. all public databases combined | over 10 times larger | June 2025 |
Basecamp has released two notable model lines built on its proprietary data.
In March 2024 the company introduced BaseFold, a protein structure prediction model created by augmenting AlphaFold2 with BaseGraph data. In results posted to bioRxiv and evaluated against CASP15 and CAMEO benchmark proteins, Basecamp reported up to a sixfold improvement in structural accuracy over AlphaFold2 for certain large, complex proteins and up to a threefold improvement in modeling small-molecule (ligand) interactions with protein targets. CTO Phil Lorenz described BaseGraph as "the core driver of our advances in AI." [4][12]
The EDEN models are a family of generative biology foundation models developed in collaboration with NVIDIA. The largest disclosed member, EDEN-28B, is described as a GPT-4-scale model trained on roughly 9.7 trillion biological (nucleotide) tokens spanning about 10 billion genes drawn from more than one million species. In January 2026, Basecamp announced that it had used the EDEN models to achieve programmable gene insertion, the placement of large therapeutic DNA sequences at precise locations in the genome, going beyond the small edits typical of CRISPR. The company reported designing insertion tools active at over 10,000 disease-relevant genomic sites, demonstrating CAR-T integration with greater than 90 percent tumor-cell clearance in laboratory assays, and a 97 percent (32 of 33) functional success rate for AI-designed antimicrobial peptides against pathogens on the World Health Organization's critical-priority list. The gene-insertion work built in part on technology licensed from Tome Biosciences. Vince characterized the milestone by saying, "You can debate whether it's perfect or not, but it will change how we develop medicine." [13][14]
In October 2024, alongside its Series B financing, Basecamp announced a multi-year genetic medicine collaboration with the laboratory of David R. Liu, a Howard Hughes Medical Institute investigator and core member of the Broad Institute of MIT and Harvard who is known for pioneering base editing and prime editing. The collaboration pairs the Liu Lab's gene editing and wet-lab expertise with Basecamp's proprietary datasets and in-house AI models to invent novel fusion proteins and other large molecules for "programmable" genetic medicines. A rationale cited for the partnership is that gene-editing tools can be mined from the natural "warfare" between bacteria and viruses, a domain that Basecamp's biodiversity sampling is well placed to capture. [1][2][6]
Basecamp's CEO has said the company holds partnerships with around 15 organizations in the biological sciences, including three large drugmakers. In March 2026 the company unveiled a "Trillion Gene Atlas" initiative, a moonshot to expand catalogued genetic diversity roughly 100-fold by gathering genomic data from more than 100 million species across thousands of sites, undertaken with Anthropic, sequencing firms Ultima Genomics and PacBio, and NVIDIA's AI infrastructure. The company has compared the effort in ambition to the Human Genome Project. It has also worked with Microsoft and NVIDIA on cloud and accelerated-computing infrastructure for its data pipeline. [6][8][15]
Basecamp Research has raised approximately $85 million in venture capital since its founding. The company closed a $20 million Series A round in December 2022, led by Systemiq Capital, with participation from investors including True Ventures and Hummingbird Ventures. In October 2024 it announced a $60 million Series B led by the Paris-based firm Singular, with participation from S32, redalpine, and several prominent individual investors, including Roche vice-chairman Andre Hoffmann, Royal Philips chair and former DSM chief executive Feike Sijbesma, and former Unilever chief executive Paul Polman, alongside returning investors True Ventures and Hummingbird Ventures. NVIDIA has separately made a strategic investment in the company to accelerate its AI work. [1][2][7][16]
| Round | Amount | Date | Lead investor(s) |
|---|---|---|---|
| Series A | $20 million | December 2022 | Systemiq Capital |
| Series B | $60 million | October 2024 | Singular |
| Total raised | approximately $85 million | as of 2024 to 2026 | includes NVIDIA strategic investment |
Basecamp Research is frequently cited as a leading example of the thesis that proprietary, high-quality biological data is the binding constraint, and therefore the most durable competitive moat, for AI in the life sciences, much as web-scale text is for large language models. Where many computational biology efforts train on the same public repositories, Basecamp's wager is that systematically expanding the diversity and contextual richness of training data unlocks designs, enzymes, proteins, and gene-editing tools, that lie outside what existing models have ever seen. [2][6][8]
The company is equally notable for attempting to make large-scale biodiscovery ethical and traceable by design. Its insistence on linking every data point to documented consent and a benefit-sharing agreement is presented as a template for how commercial use of genetic resources can comply with the Nagoya Protocol and emerging digital-sequence-information rules, including those negotiated under the high-seas BBNJ agreement, while still returning value to the countries and communities where the biology originates. Whether Basecamp's self-reported scale claims and model performance translate into clinically and commercially validated products remains to be demonstrated, but its partnerships with the Broad Institute, NVIDIA, and Anthropic have made it one of the more closely watched companies positioning data, rather than model architecture alone, as the frontier of AI for biology. [3][8][10]