RFM-1 (Robotics Foundation Model)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,580 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,580 words
Add missing citations, update stale details, or suggest a clearer explanation.
RFM-1 (Robotics Foundation Model 1) is an 8 billion parameter multimodal transformer for robotic manipulation announced by Covariant on March 11, 2024 at the MODEX 2024 trade show in Atlanta.[^1][^2] The model is configured as an any-to-any sequence model that ingests text, images, video, robot actions, and physical sensor measurements such as joint angles, gripper state, and force readings, and autoregressively predicts the next token in any of those modalities.[^1][^3] Covariant trained RFM-1 on tens of millions of warehouse picking trajectories collected from its globally deployed Covariant Brain robot fleet, supplemented with internet-scale text, image, and video data.[^1][^4] In late August 2024, Amazon entered into a non-exclusive license for Covariant's foundation models and hired the company's three co-founders (Pieter Abbeel, Peter Chen, and Rocky Duan) along with roughly a quarter of Covariant's staff, in a transaction widely described as a "reverse acquihire."[^5][^6] Subsequent development of the model continued at both Amazon and a smaller remaining Covariant under new leadership.[^7][^8]
| Field | Value |
|---|---|
| Developer | Covariant |
| Announced | March 11, 2024[^1] |
| Venue | MODEX 2024 trade show, Atlanta, Georgia[^2] |
| Parameters | 8 billion[^1][^3] |
| Architecture | Multimodal any-to-any autoregressive transformer[^1][^3] |
| Modalities | Text, images, video, robot actions, joint angles, gripper state, force, suction[^1][^3] |
| Primary training data | Covariant Brain warehouse trajectories plus internet text, image, and video[^1][^4] |
| Successor / status | Continued development under Amazon after license deal August 2024[^5][^6] |
| Update | High-fidelity scene prediction release, April 25, 2024[^9] |
Covariant was founded in 2017 under the original name Embodied Intelligence by four researchers from the University of California, Berkeley: Pieter Abbeel, Peter Chen, Rocky Duan, and Tianhao Zhang.[^10] Abbeel was at the time (and remains) a professor of electrical engineering and computer sciences at Berkeley, director of the Berkeley Robot Learning Lab, and co-director of the Berkeley AI Research lab; he received his PhD from Stanford University in 2008 under Andrew Ng, who he has described as his doctoral advisor.[^11] Chen, Duan, and Zhang were former Berkeley graduate students of Abbeel.[^10][^11] Abbeel, Chen, and Duan had also worked together at OpenAI in 2016 as part of its founding research staff, with research focused on reinforcement learning, imitation learning, and meta-learning; Zhang previously did research at Microsoft.[^10][^11]
The founders' stated objective was to combine techniques from imitation learning and reinforcement learning so that conventional robot arms could autonomously handle a wider range of manipulation tasks than possible with rule-based programming.[^10] Their early bet was that a single AI system, what the company would later call the Covariant Brain, could be deployed across many different robot hardware configurations and many different warehouse SKU sets without bespoke programming for each new site. The company chose to focus initially on the e-commerce piece-picking problem, a domain that combines high SKU diversity, deformable and irregular objects, and a clear economic incentive (labor shortages in fulfillment centers), all of which favored a learning-based approach over hand-engineered grasping heuristics.[^10][^12]
In its early years the company stayed in stealth, emerging in January 2020 with the commercial launch of the Covariant Brain, an AI platform initially focused on warehouse pick-and-place.[^10] In a 2020 evaluation organized by industrial robotics vendor ABB across 20 piece-picking AI startups facing 26 real-world tasks (half of which were undisclosed in advance), Covariant was the only entrant to clear every test, an outcome that led to a commercial partnership announced in February 2020.[^12] The first integrated ABB-Covariant deployment was at Active Ants, an e-commerce fulfillment provider in the Netherlands.[^12]
Between 2017 and 2023 Covariant raised approximately $222 million across a seed round and three priced rounds. According to public reports, an initial $7 million seed was led by Amplify Partners; a $20 million Series A; a $40 million Series B led by Index Ventures in May 2020; an $80 million Series C led by Index Ventures with Amplify Partners and Radical Ventures in July 2021; and a $75 million extension in April 2023 co-led by Radical Ventures and Index Ventures with participation from the Canada Pension Plan Investment Board, Amplify Partners, Gates Frontier Holdings, AIX Ventures, and Northgate Capital.[^13][^14] Reporting on a 2025 whistleblower disclosure later valued the company at $625 million as of the April 2023 round.[^15]
By the time RFM-1 was announced, Covariant Brain robots were operating in production warehouses across 15 countries at dozens of customers, including KNAPP installations for Würth and other large logistics operators.[^4][^16] This installed base was a deliberate strategic choice: as Abbeel framed it in 2024, "by building a valuable picking robot that's deployed across 15 countries with dozens of customers, we essentially have a data collection machine."[^4]
By 2023 large language models such as GPT and multimodal systems such as Google DeepMind's RT-2 and PaLM-E had demonstrated that scaling transformer architectures with tokenized multimodal data could yield strong generalization in language and vision.[^17] Robotics had remained data-bound. Whereas a language model can ingest hundreds of billions of tokens scraped from the public web, a robot policy needs trajectories that pair observations with actions in a specific embodiment, and these have historically been collected one task at a time via teleoperation or scripted policies. The Open X-Embodiment collaboration, released in late 2023, was the largest cross-lab response to this data bottleneck: it assembled roughly one million teleoperated trajectories from 22 robot embodiments contributed by 34 research labs, and was used to train the RT-X family of policies as a proof of concept that a single transformer could absorb data from many heterogeneous robots.[^17]
Covariant's pitch for RFM-1 was that its commercial fleet generated trajectories at a substantially higher rate than open research collaborations could match, allowing scaling experiments in the foundation model regime to be done with proprietary data. The company described this advantage in terms borrowed from autonomous vehicles: just as a deployed self-driving fleet generates more miles than any test program, a deployed picking fleet generates more pick attempts (and more interesting failures) than any teleoperation lab.[^4] Several recurring themes drove the company toward a foundation-model framing rather than a policy-per-task framing: (1) the long-tail distribution of warehouse SKUs requires models that generalize beyond the training set; (2) language-conditioned interfaces allow operators to redirect robot behavior without writing code; and (3) a learned world model of robot-object interactions enables planning at inference time rather than requiring exhaustive pre-training on every task.[^1][^4]
RFM-1 was unveiled at the MODEX 2024 supply chain trade event in Atlanta, where it ran live in Covariant's booth from March 11 through March 14, 2024.[^1][^2] An accompanying technical write-up and product page on covariant.ai described the model's architecture, training data composition, and demonstrated capabilities, and was authored by an in-house team that included Andrew Sohn, Anusha Nagabandi, Carlos Florensa, Daniel Adelberg, Di Wu, Hassan Farooq, Ignasi Clavera, Jeremy Welborn, Juyue Chen, Nikhil Mishra, Peter Chen, Peter Qian, Pieter Abbeel, Rocky Duan, Varun Vijay, and Yang Liu.[^16] On April 25, 2024, Covariant published an update describing a scaling pass that increased the resolution of the model's generated video frames by roughly four times (a "400% higher resolution" claim attributed to scaling up compute, data, and model size), which the company said reduced visual hallucinations during world-model rollouts.[^9]
RFM-1 is presented as a single decoder-style autoregressive transformer trained with a next-token prediction objective over a shared discrete vocabulary that spans multiple modalities.[^1][^3] The architectural choice is broadly the same one that underpins large language models: condition on a prefix, predict the next token, and unify diverse input and output types by mapping them all into a common token stream. Covariant has not released a paper specifying the exact tokenizer, context length, or layer counts, but the public documentation makes clear that the design intent is "any-to-any," meaning that the same model can be conditioned on any subset of supported modalities and made to produce any other subset.[^1][^3]
Each input modality is tokenized into the same sequence representation:
Because all of these channels share a common token space, the same trained model can be conditioned on any subset of modalities and made to produce any other subset. Covariant documents three concrete inference patterns enabled by this design: (1) generating control actions from images plus a natural-language instruction; (2) producing video predictions of how a scene will evolve under a given action sequence, which the company calls a learned world model; and (3) cross-modal grounding tasks such as answering text questions about images or describing the contents of a robotic workspace.[^1][^3][^9]
The any-to-any framing differentiates RFM-1 from earlier robot policies, which typically had a fixed input shape (RGB image plus proprioception) and a fixed output (an action vector). It is closer in spirit to multimodal sequence models in vision and language such as those described in the Multimodal AI research literature, with the added twist that two of the modalities (robot actions and proprioceptive sensor readings) are tightly coupled to the embodiment that produced them.[^1][^16]
For control, RFM-1 takes an image of the workspace, the robot's current proprioceptive state, and a text instruction, and autoregressively predicts a sequence of action tokens that decode to joint and gripper commands.[^1][^3] The model can be steered by natural-language prompts of the form documented by Covariant such as instructions to pick a particular item from a tote or to sort items into the correct bins.[^1][^16] In the company's MODEX 2024 demonstrations, robots executed multi-step picking tasks while a human operator typed plain-English directions instead of writing a conventional motion plan.[^2][^1] Because the action stream is tokenized, the model can in principle generate at variable horizons and emit either a single command or a longer plan; the public materials do not give explicit numbers for control frequency, but the underlying Covariant Brain stack runs on industrial robotic arms whose nominal control rates are in the tens of hertz range typical of warehouse manipulation.[^1][^16]
A distinctive capability is video prediction. Given an initial image and a candidate action sequence, RFM-1 generates future video tokens that depict how the scene is expected to evolve under that policy.[^9][^1] Covariant frames this as a "world model that understands physics," in contrast to traditional analytic physics simulators that rely on hand-engineered contact, friction, and finite-element models.[^9][^4] In a 2024 interview with IEEE Spectrum, Abbeel argued that the world model is "effectively a learned physics engine" induced from real warehouse interactions, and noted that it handles difficult-to-simulate materials (such as deformable or "floppy" packaging) without explicit hand-tuning because the training data already contains those distributions.[^4]
This video prediction head can also be used at inference time for action selection: the agent rolls out alternative candidate actions, compares the predicted outcomes, and chooses the trajectory that best satisfies the task specification, an approach broadly consistent with model-based planning in reinforcement learning.[^9][^4] The April 2024 update specifically focused on improving the fidelity of these rollouts by raising the spatial resolution of generated frames.[^9]
Because the same model has been exposed to large quantities of text and image data alongside warehouse trajectories, Covariant describes RFM-1 as capable of "reasoning" about a workspace in natural language: explaining a failed pick, asking a human operator for help, accepting a strategy suggestion in plain English, and applying that suggestion on subsequent attempts.[^1][^16] These language-conditioned behaviors mirror the goals of contemporaneous vision-language-action models such as RT-2 and OpenVLA, with the difference that RFM-1's language grounding is trained jointly with the warehouse-scale proprioceptive and action data.[^1][^16]
Covariant has not published a paper enumerating the precise dataset mix, but its public descriptions identify three components:
Covariant has consistently emphasized that the warehouse component is the differentiating asset relative to research-grade datasets, citing growth on the order of "a million trajectories every few weeks" from its operating fleet during the period leading up to the 2024 announcement.[^4]
Public demonstrations and Covariant's own communications describe several capability areas. Specific quantitative benchmarks have not been released, so the following list captures qualitative claims observed across the company's product page, the MODEX 2024 demonstrations, the high-fidelity scene prediction update, and third-party coverage in IEEE Spectrum and trade publications:
These demonstrations were shown in person at MODEX 2024 and in subsequent video releases on Covariant's site through 2024.[^2][^9] Trade publications such as Robotics 24/7 and Supply Chain 24/7 covered live demos at MODEX, and IEEE Spectrum published an in-depth interview with Abbeel that included a discussion of the world model and floppy-object simulation.[^3][^4]
RFM-1 was announced into a rapidly developing landscape of foundation model approaches to robotics. The following table compares the publicly disclosed parameters of several systems frequently discussed alongside RFM-1.
| Model | Developer | Announced | Parameters | Primary training data | Output |
|---|---|---|---|---|---|
| RFM-1 | Covariant | March 2024[^1] | 8 billion[^1] | Covariant Brain warehouse trajectories plus internet text, image, video[^1] | Multimodal tokens including actions and video[^1] |
| RT-2 | Google DeepMind | July 2023[^17] | Up to 55 billion (PaLI-X variant) | Open X-Embodiment / RT-1 robot data plus web vision-language[^17] | Action tokens via VLA |
| OpenVLA | Stanford / UC Berkeley / Toyota Research Institute consortium | June 2024 | 7 billion[^18] | Open X-Embodiment subset (~970k trajectories)[^18] | Action tokens via VLA |
| π0 (pi-zero) | Physical Intelligence | October 2024[^19] | 3 billion (full); 470M (small variant)[^19] | Open X-Embodiment plus proprietary 8-platform dexterous data[^19] | Continuous actions via flow matching[^19] |
| Helix | Figure AI | February 2025[^20] | 7B (System 2) + 80M (System 1)[^20] | ~500 hours teleoperation on Figure humanoids[^20] | High-rate continuous control (200 Hz) for full humanoid upper body[^20] |
Several differences are worth noting. RT-2 and OpenVLA are explicitly framed as vision-language-action models layered on top of an existing vision-language model backbone, and rely primarily on the public Open X-Embodiment dataset for robot trajectories.[^17][^18] π0 starts from a pretrained vision-language model and uses flow matching, a continuous-action variant of diffusion, rather than discrete action tokens.[^19] Helix separates a 7B vision-language reasoning module ("System 2") that runs at 7 to 9 Hz from an 80M low-level controller ("System 1") that runs at 200 Hz to drive a humanoid's joints.[^20] RFM-1, by contrast, is a single autoregressive transformer in which both control actions and video frames are emitted from the same token stream, and is trained on a proprietary corpus drawn from an operating commercial fleet rather than from a research dataset.[^1][^4]
For reference on dataset scale, Open X-Embodiment in 2023 contained roughly one million trajectories across 22 robot embodiments; Covariant reported that its fleet had collected tens of millions of warehouse pick trajectories by early 2024 and was adding new data on the order of a million every few weeks.[^4][^17]
On August 30, 2024 (the announcement appeared on a Friday evening) Amazon disclosed that it had entered a non-exclusive license to Covariant's foundation models, including RFM-1, and had hired Covariant's three co-founders, Pieter Abbeel, Peter Chen, and Rocky Duan, along with about 25% of the startup's employees.[^5][^6] The structure followed a pattern that had appeared in several 2024 deals (Microsoft's relationship with Inflection AI in March 2024 at $650 million, and Amazon's separate June 2024 hiring of Adept AI personnel at $330 million in licensing fees) in which a hyperscaler obtains a license and key personnel without acquiring the startup outright, often described in press coverage as a "reverse acquihire."[^21] Subsequent transactions (Google's $2.7 billion licensing deal for Character.AI in August 2024 being the most prominent) further popularized the structure.[^21]
The motivation reported in coverage of the deal was twofold. First, Amazon already operates a very large warehouse robotics organization through Amazon Robotics (previously Kiva Systems), and adding a robotics foundation model team allowed it to integrate generative-AI manipulation capabilities into its fulfillment centers. Second, by structuring the transaction as a license and acquihire rather than an outright acquisition, Amazon was widely understood to be seeking to avoid the heightened regulatory scrutiny that had attended other 2024 hyperscaler acquisitions.[^5][^6][^21]
The financial terms of the Covariant transaction were not disclosed at announcement.[^5][^6] A whistleblower complaint filed in 2025, summarized in subsequent reporting, characterized the structure as a $380 million upfront license payment plus a $20 million payment due one year after closing, against Covariant's April 2023 valuation of approximately $625 million.[^15]
Covariant continued to operate after the transaction. Ted Stinson, formerly chief operating officer, became chief executive officer, and co-founder Tianhao Zhang remained with the company; the surviving entity stated that it would focus on Covariant Brain deployments in industries such as apparel, health and beauty, grocery, and pharmaceuticals.[^7][^22] Public-facing activity from Covariant was nonetheless described in 2025 reporting as much reduced relative to pre-deal levels.[^15]
Abbeel was later appointed, in December 2025, to lead Amazon's large language model efforts within the company's AGI organization while continuing to work on robotics, an indication that the personnel transfer had been positioned as broader than robotics alone.[^11] Chen and Duan likewise took senior technical roles inside Amazon's robotics and AI organization, where they were tasked with adapting RFM-1's underlying methods to Amazon's larger fleet of warehouse robots; specific product timelines for those internal efforts had not been publicly disclosed as of mid-2026.[^11][^8]
The deal also drew interest from regulators on both sides of the Atlantic. The United States Federal Trade Commission opened a study of reverse-acquihire patterns in AI in late 2024, and European Commission officials commented on the structure as a way for big technology companies to avoid traditional merger review.[^21]
RFM-1 was among the first multibillion-parameter foundation models explicitly trained for general robot manipulation using a commercial robot fleet's operational data, rather than research datasets or teleoperation studios.[^1][^4] In the 2024 to 2025 wave of generalist robotics models, it is typically grouped with π0 from Physical Intelligence and Helix from Figure AI as examples of commercially backed systems with stated ambitions to scale across embodiments and tasks.[^19][^20] It also informed the wider industry argument that proprietary deployment data could be a structural advantage over research consortia such as Open X-Embodiment when training robot foundation models.[^4][^17]
The Amazon transaction made RFM-1 part of one of the most consequential 2024 reverse-acquihire deals in robotics and contributed to the regulatory discussion in the United States and the European Union about how non-acquisition licensing arrangements should be treated for antitrust purposes.[^5][^6][^21]
A subtler significance is the way RFM-1 framed the role of world models in robot control. Rather than treating physical simulation as an offline engineering tool used to generate synthetic training data, Covariant integrated video prediction directly into the policy network so that the model can imagine alternative outcomes at inference time. This integration mirrors developments in language model planning (where chain-of-thought reasoning sits inside the same model that produces the final answer) and points toward a more unified view of perception, action, and prediction in embodied AI.[^9][^4] Several subsequent systems, including π0 and Helix, adopt different architectural choices but share the high-level commitment to a single large model that owns perception, language, and control together.[^19][^20]
From a commercial perspective, RFM-1 also represented an early test of whether warehouse-scale, deployment-derived data alone could rival cross-lab research datasets for training robot foundation models. Although Covariant has not released benchmark numbers that would allow a clean comparison, the publicly observable result was that an 8 billion parameter model trained primarily on its proprietary fleet data was considered valuable enough by Amazon to support the eight-figure license described in subsequent reporting, alongside the hiring of the founding team.[^5][^15]
Several limitations and open questions have been noted about RFM-1, either by Covariant itself or in third-party reporting.