AlphaEvolve is an evolutionary coding agent developed by Google DeepMind and announced on May 14, 2025. The system uses an ensemble of Gemini large language models paired with automated evaluators and an evolutionary search framework to discover and optimize algorithms across mathematics, hardware design, and computing infrastructure. AlphaEvolve marked the first AI-driven improvement over Strassen's 1969 matrix multiplication algorithm for 4×4 complex-valued matrices, reducing the required scalar multiplications from 49 to 48 after 56 years without progress on that specific problem.
AlphaEvolve belongs to a line of Google DeepMind systems that apply AI to algorithm discovery and mathematical reasoning. Understanding where it fits requires tracing the research thread back through several earlier projects.
AlphaCode (2022) demonstrated that large language models could compete in programming competitions at a level roughly comparable to the median human contestant. That work established that LLMs could generate syntactically correct, logically coherent code for well-specified problems, but it operated in the register of competitive programming rather than open-ended algorithm discovery.
AlphaTensor (October 2022) tackled a specific and long-standing algorithmic problem: finding fast matrix multiplication algorithms. DeepMind framed the problem as a three-player game, then used a reinforcement learning agent to search for winning sequences of moves. AlphaTensor found algorithms for dozens of matrix sizes that beat previously known methods, including improvements over the standard 50-year-old approaches for certain small matrices. For 4×4 matrices over finite fields of characteristic two, it found an algorithm using 47 multiplications, but the analogous result over complex numbers (characteristic zero) remained at Strassen's 49.
FunSearch (December 2023) introduced the idea of pairing LLMs with evolutionary search to discover short, executable programs. The name came from "searching in the function space." FunSearch used relatively small language models trained primarily on code, generating candidate Python functions and scoring them with automated evaluators. It solved variants of the cap set problem (finding large sets of integers with no three-term arithmetic progressions) and discovered new bin-packing heuristics that outperformed known methods.
FunSearch established several principles that AlphaEvolve would later extend: LLMs as mutation operators rather than end-to-end solvers, automated evaluation to avoid hallucination risks, and an evolutionary database to manage the population of candidate programs. The main limitation was scope: FunSearch evolved single Python functions and used small code-specialized models, which constrained the complexity of algorithms it could find.
AlphaProof (2024) took a different direction, combining language models with formal proof verification to solve competition mathematics problems including International Mathematical Olympiad problems. AlphaProof worked in the domain of symbolic proof rather than algorithmic code, but it reinforced DeepMind's broader strategy of using LLMs in combination with automated verification rather than relying on LLM outputs alone.
AlphaEvolve operates as an asynchronous pipeline built around four main components: a prompt sampler, an LLM ensemble, a program database, and an evaluation system. The system is written to maximize throughput rather than minimize latency on individual tasks, running many candidate evaluations in parallel.
The prompt sampler constructs inputs for the language models by drawing on customizable templates that incorporate solutions sampled from the program database. Users can add explicit instructions, stochastic formatting, evaluation results from previous runs, and optional meta-prompt evolution (where the prompts themselves are subject to optimization). This flexibility lets the system adapt its search behavior over time as it learns which kinds of instructions produce useful mutations.
AlphaEvolve uses two Gemini models with complementary roles:
Both models generate code changes in the form of diff specifications, identifying specific blocks of the current program to replace. Rather than rewriting entire programs from scratch each iteration, the system makes targeted edits, which preserves working structure while exploring variations.
The program database stores all candidate solutions along with their evaluation scores and metadata. Its structure is inspired by MAP-elites (a quality-diversity evolutionary algorithm) combined with island-based population models. MAP-elites maintains a grid of solutions indexed by behavioral characteristics, ensuring diversity rather than convergence to a single local optimum. Island models keep separate sub-populations that evolve somewhat independently, periodically exchanging their best solutions.
This combination balances exploration (trying genuinely different approaches) against exploitation (refining approaches that are already working). The database provides context to the LLM during prompt construction, so the model can see what has worked before and generate mutations informed by that history.
Every problem posed to AlphaEvolve requires an evaluation function that maps a candidate program to one or more scalar scores. This is a hard requirement of the system: if you cannot automatically grade a solution, AlphaEvolve cannot run. The system supports cascading evaluation, starting with simpler test cases before moving to expensive ones, which reduces wasted compute on clearly poor candidates. It also supports parallel evaluation across multiple simultaneous metrics, enabling multi-objective optimization where earlier systems like FunSearch only handled single objectives.
Beyond numerical scoring, the evaluation component can incorporate LLM-generated feedback, where a language model reads the output of a candidate program and produces a qualitative assessment that the system can use alongside quantitative metrics.
The table below summarizes the main architectural and capability differences between FunSearch and AlphaEvolve.
| Dimension | FunSearch (2023) | AlphaEvolve (2025) |
|---|---|---|
| Code scope | Single Python functions | Entire codebases, hundreds of lines |
| Language models | Small, code-specialized LLMs | Frontier models (Gemini 2.0 Flash + Pro) |
| Natural language use | Minimal | Rich natural-language context and feedback |
| Optimization criteria | Single objective | Multi-objective optimization supported |
| Sample efficiency | Millions of samples per run | Thousands of samples per run |
| User controls | Fixed configuration | Multiple configurable parameters |
| Open access | Released to research community | Early access only (as of 2025) |
| Problem domains demonstrated | Cap set, bin packing | Matrix multiplication, data centers, hardware design, 50+ math problems |
The most significant architectural difference is that AlphaEvolve treats the LLM as an intelligent mutation operator rather than a function generator. Because Gemini models are trained on broad natural language and code, they implicitly know standard genetic operators (crossover, mutation, selection) and apply them through their world knowledge rather than through hand-coded procedures. This means AlphaEvolve does not need explicit evolutionary operators in its code; the LLM decides how and where to modify each candidate based on everything it learned during pretraining.
The workflow for using AlphaEvolve requires the user to supply two things: an initial program that represents a starting-point solution, and an evaluation function that scores programs against the target metric. The system then runs the evolutionary loop autonomously until improvement plateaus.
This is a notably different interface from typical LLM coding tools, where a human iterates interactively with the model. AlphaEvolve runs for hours or days without human oversight, accumulating a population of programs and gradually ratcheting up performance. The human role shifts from pair programmer to problem framer.
For each iteration, the system samples a parent solution from the database (biased toward higher-scoring programs but maintaining diversity), constructs a prompt with context from previous solutions, calls the LLM ensemble to generate a diff, applies that diff to the parent program, evaluates the result, and stores any program that achieves a new high score or a score above a quality threshold in a new region of the behavioral space. The loop continues asynchronously, with Flash handling the bulk of rapid-fire iterations and Pro called in when the search appears stuck.
AlphaEvolve needs only thousands of sample evaluations to converge, compared to the millions FunSearch required. This improvement in sample efficiency reflects the higher capability of frontier Gemini models: each suggestion is more likely to be useful, reducing wasted evaluations on dead ends.
The most mathematically significant result from AlphaEvolve is an algorithm that multiplies two 4×4 complex-valued matrices using 48 scalar multiplications, one fewer than Strassen's 1969 algorithm.
Volker Strassen's 1969 paper introduced a recursive approach to matrix multiplication that reduced the naive $n^3$ operation count. For 2×2 matrices, Strassen showed that 7 multiplications suffice instead of 8. Applying this recursively gives an algorithm for 4×4 matrices using $7^2 = 49$ multiplications. For more than five decades, mathematicians and computer scientists could not improve on 49 multiplications for 4×4 matrices over fields of characteristic zero (which includes the real and complex numbers).
AlphaTensor (2022) had found 47-multiplication algorithms for 4×4 matrices over fields of characteristic two (finite fields), but that result did not carry over to characteristic zero because of different algebraic rules. The challenge for complex matrices specifically remained at Strassen's 49.
AlphaEvolve found a solution that uses complex numbers in a way human researchers had not tried, creating algebraic cancellations that reduce the count to 48. The algorithm works over non-commutative rings, meaning it applies not just to scalar complex numbers but also to matrices of complex numbers (block matrix multiplication). Independent verification confirmed the result is correct. The fact that AlphaEvolve, a general-purpose system not specifically designed for matrix multiplication, outperformed AlphaTensor, which was purpose-built for that problem, demonstrated the flexibility of the evolutionary coding approach.
Practical impact scales non-linearly. An 8×8 matrix multiplication built by applying the 4×4 algorithm twice requires $48^2 = 2304$ scalar multiplications rather than $49^2 = 2401$. At each doubling of matrix size, the advantage compounds. For the very large matrix multiplications that appear in neural network training and inference, even small improvements to the fundamental algorithm can accumulate into meaningful reductions in compute cost.
AlphaEvolve also found improvements for 14 other matrix sizes beyond 4×4, suggesting that 56 years of human work had left a number of optimization opportunities unexplored.
Google's data centers run a task scheduling system called Borg, which allocates computing jobs across large clusters of machines. Scheduling is a combinatorially hard problem: given thousands of jobs with different resource requirements and priorities, find an assignment that maximizes utilization and minimizes latency.
AlphaEvolve developed a new scheduling heuristic for Borg that consistently recovers approximately 0.7% of Google's worldwide compute resources by finding better task placements that human engineers and prior automated approaches had missed. The heuristic outperformed solutions from deep reinforcement learning, which was the previous state of the art for this problem.
0.7% of Google's global compute is a large absolute number. Google operates millions of servers across dozens of data centers, and that percentage corresponds to computational capacity worth tens of millions of dollars per year. The heuristic code is human-readable and was deployed to production after review by Google engineers, which is notable: the system produced code that engineers could understand, audit, and maintain rather than an opaque function they had to take on trust.
AlphaEvolve proposed a simplification to the Verilog code for a matrix multiplication arithmetic circuit in Google's Tensor Processing Unit (TPU) hardware. Verilog is the hardware description language used to specify digital circuits before they are fabricated.
The optimization involved removing unnecessary bits from a highly optimized arithmetic circuit that performs matrix multiplications on the TPU. AlphaEvolve identified that certain bit positions in the circuit were redundant given the constraints of the inputs the circuit would actually receive. Removing those bits produces a functionally equivalent but simpler circuit.
The proposal passed robust hardware verification methods confirming that the modified circuit behaves identically to the original for all valid inputs. Engineers at Google's TPU team reviewed and accepted the change, which was integrated into the design of an upcoming TPU generation. Engineers noted that the simplification was the kind of finding that synthesis tools would eventually catch during the chip design flow, but AlphaEvolve identified it earlier, potentially shortening the design cycle.
A key factor in adoption was that AlphaEvolve communicated its findings in Verilog rather than in some intermediate representation. Hardware engineers could read the proposed change directly in the language they use daily, which reduced the verification burden and built trust in the result.
AlphaEvolve improved two separate kernel operations in the computational stack used to train Gemini models.
The first improvement addressed a matrix multiplication kernel that handles tiling: the problem of dividing a large matrix multiplication into subproblems that fit in fast on-chip memory. AlphaEvolve found a smarter tiling strategy that sped up this kernel by 23%. Because this kernel runs throughout Gemini training, the improvement reduced total training time for Gemini by approximately 1%. At the scale of multi-month training runs on tens of thousands of accelerators, a 1% reduction in training time is a substantial saving.
The second improvement targeted the FlashAttention kernel, which computes scaled dot-product attention in transformer architectures more efficiently than naive attention by fusing operations to reduce memory bandwidth. AlphaEvolve modified the compiler-generated XLA (Accelerated Linear Algebra) intermediate representation for the attention kernel, achieving up to a 32.5% speedup on that specific operation. This optimization was reported separately from the Gemini training improvement and reflects a different part of the LLM inference and training stack.
Both results were verified by running the modified kernels against the original and confirming correctness and performance on representative workloads.
Beyond the high-profile matrix multiplication result, AlphaEvolve was evaluated on a diverse set of over 50 open mathematical problems drawn from analysis, combinatorics, and geometry. The results across this benchmark were:
One notable result was the kissing number problem. The kissing number in dimension $d$ is the maximum number of non-overlapping unit spheres that can simultaneously touch a central unit sphere. For most dimensions above four, only upper and lower bounds are known, not exact values. AlphaEvolve improved the lower bound for the kissing number in 11 dimensions from 592 to 593, advancing the state of knowledge in a problem that has resisted progress for decades.
For context on what AlphaEvolve competes against: the kissing number result was later contested by human mathematician Mikhail Ganzhinov, who found new records in dimensions 10, 11, and 14 using classical methods. In dimension 11, Ganzhinov's bound surpassed AlphaEvolve's 593. This back-and-forth illustrates a genuine dynamic in the field: AI-driven search and human mathematical insight are currently competitive with each other, and advances from one side motivate responses from the other.
Other problems in the benchmark included variants of the Fourier analysis problem, the minimum overlap problem, and various extremal combinatorics questions. The team also made available a Google Colab notebook demonstrating the mathematical results for independent verification.
AlphaEvolve operates as an internal tool at Google, deployed across several teams and problem domains.
| Domain | Application | Result |
|---|---|---|
| Data centers | Borg task scheduling heuristic | 0.7% of global compute recovered |
| Hardware | TPU Verilog circuit simplification | Integrated into upcoming TPU generation |
| AI training | Gemini matrix multiplication kernel | 23% kernel speedup, 1% training time reduction |
| AI training | FlashAttention kernel (XLA) | Up to 32.5% speedup |
| Mathematics | Matrix multiplication (4×4 complex) | 48 multiplications, beating 56-year record |
| Mathematics | Open problems benchmark (50+ problems) | 20% of problems improved beyond prior best |
The scheduling and hardware results are production deployments, meaning AlphaEvolve's output is running in Google's live infrastructure. The AI training optimizations were applied to actual Gemini training runs. This distinguishes AlphaEvolve from many research systems that demonstrate capability in controlled benchmarks without deployment.
All three production results (scheduling, TPU circuit, training kernels) share a common property: the outputs are interpretable. The scheduling heuristic is readable code that engineers reviewed before deployment. The Verilog simplification is expressed in standard hardware description language. The kernel improvements are modifications to existing compiler output that engineers could inspect. The team has emphasized this interpretability as a design goal, arguing that autonomous systems producing unreadable black-box changes would face higher adoption barriers in engineering contexts.
As of mid-2025, AlphaEvolve has not been released publicly. Google DeepMind announced an early access program for selected academic researchers, with an application form for those wishing to participate. The accompanying paper (arXiv:2506.13131, submitted June 16, 2025) provides detailed architectural descriptions but does not include code or model weights.
Several open-source implementations appeared in the research community following the announcement. OpenEvolve, available on GitHub and PyPI, is a community implementation of the AlphaEvolve approach that supports distributed algorithms, multi-language programs, and GPU kernel optimization. CodeEvolve is a separate open-source project that implements the high-level principles of LLM-driven evolutionary search in a reproducible framework. These implementations allow researchers to experiment with the methodology without waiting for official access.
The closed nature of AlphaEvolve has drawn some criticism. Reviewers noted that prior DeepMind releases like AlphaFold 2 shipped without training scripts, and AlphaGeometry contained bugs that the community had to patch. Whether AlphaEvolve follows a similar trajectory remains to be seen.
AlphaEvolve has several meaningful constraints that shape where it can and cannot be applied.
The most fundamental limitation is the requirement for automatic evaluation. Every problem must have an evaluation function that scores candidate programs without human intervention. This rules out large classes of potentially valuable problems: drug discovery requires wet-lab validation, materials science needs physical synthesis and testing, and many engineering design problems require human judgment about trade-offs that are difficult to encode in a scalar metric. DeepMind has acknowledged this constraint and described it as a core architectural dependency rather than a temporary limitation.
AlphaEvolve also provides limited theoretical insight. The system finds algorithms that work, but it does not explain why they work. The matrix multiplication algorithm using 48 multiplications is verified correct, but no human has yet constructed a theoretical proof of why such an algorithm exists or what mathematical structure it exploits. For pure mathematics, where the goal is understanding rather than just finding solutions, this is a significant gap.
The system requires substantial compute to run effectively. While it uses thousands of evaluations rather than millions, each evaluation involves LLM calls and program execution, and running the system for days across a complex problem accumulates real compute costs. This resource requirement limits accessibility for smaller research groups even if the methodology were openly available.
Finally, AlphaEvolve inherits the limitations of the LLMs it uses. Gemini models can generate syntactically incorrect code or logically flawed algorithms, and the evolutionary framework depends on the evaluator catching those errors. For domains where subtle bugs are hard to catch automatically, the system may produce solutions that appear to pass evaluation but fail in edge cases.