FunSearch
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,544 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,544 words
Add missing citations, update stale details, or suggest a clearer explanation.
FunSearch is a method from Google DeepMind that pairs a large language model with an automated evaluator to discover new solutions to hard problems in mathematics and computer science. The name is short for "searching in the function space," because the system does not look for answers directly. Instead it searches for short computer programs, written as functions, that generate or score those answers. DeepMind introduced FunSearch in a paper titled "Mathematical discoveries from program search with large language models," published in the journal Nature on 14 December 2023.[1][2] The work was widely reported as the first time an LLM had been used to make a verifiable new discovery on an open problem in the mathematical sciences.[2][3]
LLMs are fluent generators of plausible text and code, but they also confabulate: they produce statements that read convincingly yet are simply wrong. That tendency is a serious obstacle to using them for scientific discovery, where a single fabricated claim can poison a result. A model that invents a "proof" or a "construction" is useless if nobody can tell which of its outputs are real.[1][4]
FunSearch was designed around that weakness rather than against it. The insight is that creativity and correctness can be separated into two cooperating parts. The LLM supplies a flood of ideas, most of them junk, while a separate, deterministic evaluator throws out everything that does not actually work. Because only programs that have been executed and scored against the real problem are kept, hallucinations cannot survive the loop. As one description put it, FunSearch shows that LLMs can make discoveries if they are coaxed carefully and if you discard the large majority of what they come up with.[3]
The approach builds on DeepMind's broader line of work using AI for mathematics and algorithm design, which also includes systems such as AlphaTensor for matrix multiplication. FunSearch later served as a direct precursor to AlphaEvolve, DeepMind's 2025 evolutionary coding agent, which generalized the same evolve-and-evaluate idea to whole codebases and a wider range of problems.[2]
A user starts by describing a problem in code. They supply three things: an evaluation function that scores any candidate solution, a "solve" routine that uses a smaller helper function (the piece to be improved), and an initial version of that helper, called the seed program. The seed can be deliberately trivial, even a function that returns a constant.[1]
From there the system runs an evolutionary loop. At each step it samples one or more high-scoring programs from a database, builds them into a prompt, and asks the LLM to write a new and hopefully better version of the function to be improved. The candidate is then run and scored by the evaluator, and if it survives it is added back to the database. Over many iterations the population of programs "evolves," with good ideas getting recombined and refined into constructions that beat anything the seed could reach.[1][2]
Two design choices matter for why this works at scale. First, the LLM is Codey, a model built on top of the PaLM 2 family that DeepMind fine-tuned on a large corpus of code and accessed through its API.[1] The exact model is not critical: the original system was later shown to run with other LLMs, including Gemini models. Second, the program database does not keep a single population. It splits candidates across multiple "islands" that evolve independently, which preserves diversity and stops the search from collapsing onto one mediocre idea too early.[1]
The whole thing is run asynchronously and in parallel, with many evaluators working at once, because a single run can require millions of LLM samples before a strong program emerges.[2][3] A key payoff of searching over programs rather than raw answers is interpretability. The output is human-readable code that reveals how a solution is built, not just an opaque list of numbers. The mathematician Jordan Ellenberg, a co-author, said the programs FunSearch produced were conceptually richer than a list of numbers and that studying them taught him something.[2][5]
The first headline application was the cap set problem, a long-studied question in extremal combinatorics that the Fields medalist Terence Tao has called one of his favorite open problems. A cap set is a collection of points in a high-dimensional space over three elements such that no three of the points lie on a line. The question is how large such a set can be.[2][3]
FunSearch found new cap set constructions that beat the best previously known. In dimension 8, it discovered a cap set of size 512, the largest found in that setting.[1][6] More important than any single dimension was the effect on the asymptotic bound, the rate at which the largest cap set can grow as the dimension increases. Using FunSearch to build a large "admissible set," a related structure that yields cap sets in high dimensions, the team improved the lower bound on the cap set capacity from a prior value of 2.2180 to 2.2184, and a further construction pushed it to 2.2202.[1][6] DeepMind described this as the largest increase in the size of cap sets in about twenty years, and it outperformed state-of-the-art computational solvers on the problem.[3]
To show the method was not a one-off, the team applied it to a practical algorithmic problem: online bin packing. The task is to pack a stream of items of varying sizes into as few fixed-capacity bins as possible, deciding where each item goes the moment it arrives, with no chance to repack later. It underpins real workloads such as loading containers or scheduling compute jobs in data centers.[2][3]
Standard hand-designed rules for this problem are "first fit," which puts each item in the first bin it fits in, and "best fit," which uses the fullest bin that still has room. FunSearch evolved its own heuristic that beat both. The gains, measured as the percentage of wasted space above the optimal number of bins, were consistent across the OR-Library benchmark instances.[1]
| Dataset | First fit | Best fit | FunSearch |
|---|---|---|---|
| OR1 | 6.42% | 5.81% | 5.30% |
| OR2 | 6.45% | 6.06% | 4.19% |
| OR3 | 5.74% | 5.37% | 3.11% |
| OR4 | 5.23% | 4.94% | 2.47% |
On data drawn from a Weibull distribution the discovered heuristic was even stronger, scaling gracefully to large instances and landing only about 0.03% above the optimum on a problem with 100,000 items.[1] Because the result was a readable program rather than a black box, the researchers could see the strategy it had learned, such as avoiding leaving bins with only a small amount of remaining space.[1][2]
The framing that drew the most attention was that FunSearch produced genuinely new, verifiable knowledge. DeepMind called it the first time a new discovery had been made for an open problem in the mathematical sciences using LLMs, and MIT Technology Review described it as the first time a large language model had been used to discover a solution to a long-standing scientific puzzle that yielded verifiable and valuable new information.[1][2][3] DeepMind's research lead Pushmeet Kohli stressed that the cap set construction was not in the training data: "it wasn't even known."[3] Ellenberg argued the more exciting prospect was the new mode of human-machine collaboration: FunSearch generates a program that finds a solution, and a program is something a person can read, interpret, and build on for the next problem.[2][5]
That said, the result invited pushback, and the limits are worth stating plainly. The LLM did not solve anything on its own. It acted more like a creative mutation operator inside an evolutionary algorithm, and the heavy lifting of framing each problem (the evaluator, the program skeleton, the seed) was done by humans for each new task.[4] FunSearch works best on problems with two properties: a fast, reliable way to score candidate solutions automatically, and a search that can be expressed as improving a short piece of code. Problems without a cheap automatic check, or where the answer is not naturally a program, do not fit the mold. Critics also noted that the framing of "solving the unsolvable" oversold what is, at bottom, a clever combination of genetic programming and language models rather than autonomous machine reasoning.[4] Even so, the basic recipe, generate broadly with an LLM and filter ruthlessly with an evaluator, proved durable enough to carry forward into later DeepMind systems.