# RFM-1 (Robotics Foundation Model)

> Source: https://aiwiki.ai/wiki/rfm_1
> Updated: 2026-06-07
> Categories: AI Models, Embodied AI, Robotics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# RFM-1

**RFM-1** (Robotics Foundation Model 1) is an 8 billion parameter multimodal transformer for robotic manipulation announced by [Covariant](/wiki/covariant) on March 11, 2024 at the MODEX 2024 trade show in Atlanta.[^1][^2] The model is configured as an any-to-any sequence model that ingests text, images, video, robot actions, and physical sensor measurements such as joint angles, gripper state, and force readings, and autoregressively predicts the next token in any of those modalities.[^1][^3] Covariant trained RFM-1 on tens of millions of warehouse picking trajectories collected from its globally deployed Covariant Brain robot fleet, supplemented with internet-scale text, image, and video data.[^1][^4] In late August 2024, [Amazon](/wiki/amazon) entered into a non-exclusive license for Covariant's foundation models and hired the company's three co-founders ([Pieter Abbeel](/wiki/pieter_abbeel), Peter Chen, and Rocky Duan) along with roughly a quarter of Covariant's staff, in a transaction widely described as a "reverse acquihire."[^5][^6] Subsequent development of the model continued at both Amazon and a smaller remaining Covariant under new leadership.[^7][^8]

## Infobox

| Field | Value |
|---|---|
| Developer | [Covariant](/wiki/covariant) |
| Announced | March 11, 2024[^1] |
| Venue | MODEX 2024 trade show, Atlanta, Georgia[^2] |
| Parameters | 8 billion[^1][^3] |
| Architecture | Multimodal any-to-any autoregressive transformer[^1][^3] |
| Modalities | Text, images, video, robot actions, joint angles, gripper state, force, suction[^1][^3] |
| Primary training data | Covariant Brain warehouse trajectories plus internet text, image, and video[^1][^4] |
| Successor / status | Continued development under [Amazon](/wiki/amazon) after license deal August 2024[^5][^6] |
| Update | High-fidelity scene prediction release, April 25, 2024[^9] |

## Background

### Covariant and the Berkeley AI lineage

Covariant was founded in 2017 under the original name Embodied Intelligence by four researchers from the [University of California, Berkeley](/wiki/uc_berkeley): [Pieter Abbeel](/wiki/pieter_abbeel), Peter Chen, Rocky Duan, and Tianhao Zhang.[^10] Abbeel was at the time (and remains) a professor of electrical engineering and computer sciences at Berkeley, director of the Berkeley Robot Learning Lab, and co-director of the Berkeley AI Research lab; he received his PhD from [Stanford University](/wiki/stanford_university) in 2008 under [Andrew Ng](/wiki/andrew_ng), who he has described as his doctoral advisor.[^11] Chen, Duan, and Zhang were former Berkeley graduate students of Abbeel.[^10][^11] Abbeel, Chen, and Duan had also worked together at [OpenAI](/wiki/openai) in 2016 as part of its founding research staff, with research focused on [reinforcement learning](/wiki/reinforcement_learning), [imitation learning](/wiki/imitation_learning), and meta-learning; Zhang previously did research at Microsoft.[^10][^11]

The founders' stated objective was to combine techniques from [imitation learning](/wiki/imitation_learning) and reinforcement learning so that conventional robot arms could autonomously handle a wider range of manipulation tasks than possible with rule-based programming.[^10] Their early bet was that a single AI system, what the company would later call the Covariant Brain, could be deployed across many different robot hardware configurations and many different warehouse SKU sets without bespoke programming for each new site. The company chose to focus initially on the e-commerce piece-picking problem, a domain that combines high SKU diversity, deformable and irregular objects, and a clear economic incentive (labor shortages in fulfillment centers), all of which favored a learning-based approach over hand-engineered grasping heuristics.[^10][^12]

In its early years the company stayed in stealth, emerging in January 2020 with the commercial launch of the Covariant Brain, an AI platform initially focused on warehouse pick-and-place.[^10] In a 2020 evaluation organized by industrial robotics vendor ABB across 20 piece-picking AI startups facing 26 real-world tasks (half of which were undisclosed in advance), Covariant was the only entrant to clear every test, an outcome that led to a commercial partnership announced in February 2020.[^12] The first integrated ABB-Covariant deployment was at Active Ants, an e-commerce fulfillment provider in the Netherlands.[^12]

### Funding and customers prior to RFM-1

Between 2017 and 2023 Covariant raised approximately $222 million across a seed round and three priced rounds. According to public reports, an initial $7 million seed was led by Amplify Partners; a $20 million Series A; a $40 million Series B led by Index Ventures in May 2020; an $80 million Series C led by Index Ventures with Amplify Partners and Radical Ventures in July 2021; and a $75 million extension in April 2023 co-led by Radical Ventures and Index Ventures with participation from the Canada Pension Plan Investment Board, Amplify Partners, Gates Frontier Holdings, AIX Ventures, and Northgate Capital.[^13][^14] Reporting on a 2025 whistleblower disclosure later valued the company at $625 million as of the April 2023 round.[^15]

By the time RFM-1 was announced, Covariant Brain robots were operating in production warehouses across 15 countries at dozens of customers, including KNAPP installations for Würth and other large logistics operators.[^4][^16] This installed base was a deliberate strategic choice: as Abbeel framed it in 2024, "by building a valuable picking robot that's deployed across 15 countries with dozens of customers, we essentially have a data collection machine."[^4]

### The case for a robotics foundation model

By 2023 large language models such as [GPT](/wiki/gpt_generative_pre-trained_transformer) and multimodal systems such as Google DeepMind's [RT-2](/wiki/rt_2) and [PaLM-E](/wiki/palm-e_an_embodied_multimodal_language_model) had demonstrated that scaling [transformer](/wiki/transformer) architectures with [tokenized](/wiki/tokenization) multimodal data could yield strong generalization in language and vision.[^17] Robotics had remained data-bound. Whereas a language model can ingest hundreds of billions of tokens scraped from the public web, a robot policy needs trajectories that pair observations with actions in a specific embodiment, and these have historically been collected one task at a time via teleoperation or scripted policies. The [Open X-Embodiment](/wiki/open_x_embodiment) collaboration, released in late 2023, was the largest cross-lab response to this data bottleneck: it assembled roughly one million teleoperated trajectories from 22 robot embodiments contributed by 34 research labs, and was used to train the RT-X family of policies as a proof of concept that a single transformer could absorb data from many heterogeneous robots.[^17]

Covariant's pitch for RFM-1 was that its commercial fleet generated trajectories at a substantially higher rate than open research collaborations could match, allowing scaling experiments in the [foundation model](/wiki/foundation_model) regime to be done with proprietary data. The company described this advantage in terms borrowed from autonomous vehicles: just as a deployed self-driving fleet generates more miles than any test program, a deployed picking fleet generates more pick attempts (and more interesting failures) than any teleoperation lab.[^4] Several recurring themes drove the company toward a foundation-model framing rather than a policy-per-task framing: (1) the long-tail distribution of warehouse SKUs requires models that generalize beyond the training set; (2) language-conditioned interfaces allow operators to redirect robot behavior without writing code; and (3) a learned [world model](/wiki/world_model) of robot-object interactions enables planning at inference time rather than requiring exhaustive pre-training on every task.[^1][^4]

## Announcement and Release

RFM-1 was unveiled at the MODEX 2024 supply chain trade event in Atlanta, where it ran live in Covariant's booth from March 11 through March 14, 2024.[^1][^2] An accompanying technical write-up and product page on covariant.ai described the model's architecture, training data composition, and demonstrated capabilities, and was authored by an in-house team that included Andrew Sohn, Anusha Nagabandi, Carlos Florensa, Daniel Adelberg, Di Wu, Hassan Farooq, Ignasi Clavera, Jeremy Welborn, Juyue Chen, Nikhil Mishra, Peter Chen, Peter Qian, [Pieter Abbeel](/wiki/pieter_abbeel), Rocky Duan, Varun Vijay, and Yang Liu.[^16] On April 25, 2024, Covariant published an update describing a scaling pass that increased the resolution of the model's generated video frames by roughly four times (a "400% higher resolution" claim attributed to scaling up compute, data, and model size), which the company said reduced visual hallucinations during world-model rollouts.[^9]

## Technical Details

### Multimodal any-to-any sequence modeling

RFM-1 is presented as a single decoder-style autoregressive transformer trained with a next-token prediction objective over a shared discrete vocabulary that spans multiple modalities.[^1][^3] The architectural choice is broadly the same one that underpins large language models: condition on a prefix, predict the next token, and unify diverse input and output types by mapping them all into a common token stream. Covariant has not released a paper specifying the exact tokenizer, context length, or layer counts, but the public documentation makes clear that the design intent is "any-to-any," meaning that the same model can be conditioned on any subset of supported modalities and made to produce any other subset.[^1][^3]

Each input modality is tokenized into the same sequence representation:

- Natural language instructions and prompts are tokenized as text.[^1][^3]
- Single images and video frames are encoded into image tokens; multi-frame video is treated as a sequence of these tokens.[^1][^3]
- Robot proprioceptive state (joint angles), end-effector configuration (gripper state, including suction strength on the typical Covariant gripper), and force readings are discretized into numerical-sensor tokens.[^1][^3]
- Robot actions (joint and end-effector commands) are likewise tokenized so they can be predicted as outputs.[^1][^3]

Because all of these channels share a common token space, the same trained model can be conditioned on any subset of modalities and made to produce any other subset. Covariant documents three concrete inference patterns enabled by this design: (1) generating control actions from images plus a natural-language instruction; (2) producing video predictions of how a scene will evolve under a given action sequence, which the company calls a learned world model; and (3) cross-modal grounding tasks such as answering text questions about images or describing the contents of a robotic workspace.[^1][^3][^9]

The any-to-any framing differentiates RFM-1 from earlier robot policies, which typically had a fixed input shape (RGB image plus proprioception) and a fixed output (an action vector). It is closer in spirit to multimodal sequence models in vision and language such as those described in the [Multimodal AI](/wiki/multimodal_ai) research literature, with the added twist that two of the modalities (robot actions and proprioceptive sensor readings) are tightly coupled to the embodiment that produced them.[^1][^16]

### Action generation

For control, RFM-1 takes an image of the workspace, the robot's current proprioceptive state, and a text instruction, and autoregressively predicts a sequence of action tokens that decode to joint and gripper commands.[^1][^3] The model can be steered by natural-language prompts of the form documented by Covariant such as instructions to pick a particular item from a tote or to sort items into the correct bins.[^1][^16] In the company's MODEX 2024 demonstrations, robots executed multi-step picking tasks while a human operator typed plain-English directions instead of writing a conventional motion plan.[^2][^1] Because the action stream is tokenized, the model can in principle generate at variable horizons and emit either a single command or a longer plan; the public materials do not give explicit numbers for control frequency, but the underlying Covariant Brain stack runs on industrial robotic arms whose nominal control rates are in the tens of hertz range typical of warehouse manipulation.[^1][^16]

### Video prediction and learned world model

A distinctive capability is video prediction. Given an initial image and a candidate action sequence, RFM-1 generates future video tokens that depict how the scene is expected to evolve under that policy.[^9][^1] Covariant frames this as a "world model that understands physics," in contrast to traditional analytic physics simulators that rely on hand-engineered contact, friction, and finite-element models.[^9][^4] In a 2024 interview with IEEE Spectrum, Abbeel argued that the world model is "effectively a learned physics engine" induced from real warehouse interactions, and noted that it handles difficult-to-simulate materials (such as deformable or "floppy" packaging) without explicit hand-tuning because the training data already contains those distributions.[^4]

This video prediction head can also be used at inference time for action selection: the agent rolls out alternative candidate actions, compares the predicted outcomes, and chooses the trajectory that best satisfies the task specification, an approach broadly consistent with model-based planning in [reinforcement learning](/wiki/reinforcement_learning).[^9][^4] The April 2024 update specifically focused on improving the fidelity of these rollouts by raising the spatial resolution of generated frames.[^9]

### Reasoning and language interface

Because the same model has been exposed to large quantities of text and image data alongside warehouse trajectories, Covariant describes RFM-1 as capable of "reasoning" about a workspace in natural language: explaining a failed pick, asking a human operator for help, accepting a strategy suggestion in plain English, and applying that suggestion on subsequent attempts.[^1][^16] These language-conditioned behaviors mirror the goals of contemporaneous [vision-language-action](/wiki/vla) models such as [RT-2](/wiki/rt_2) and [OpenVLA](/wiki/openvla), with the difference that RFM-1's language grounding is trained jointly with the warehouse-scale proprioceptive and action data.[^1][^16]

### Training data composition

Covariant has not published a paper enumerating the precise dataset mix, but its public descriptions identify three components:

1. **Internal warehouse trajectories.** Tens of millions of pick-and-place episodes collected automatically from Covariant Brain robots deployed in production warehouses around the world. These contain synchronized camera frames, proprioceptive state, suction or gripper readings, and the executed motor commands.[^1][^4][^16]
2. **Public internet text and images.** Used to give the model general language and visual grounding similar to that of a large multimodal model.[^1][^16]
3. **Public internet video.** Used to expose the model to physical dynamics outside the warehouse domain.[^1][^16]

Covariant has consistently emphasized that the warehouse component is the differentiating asset relative to research-grade datasets, citing growth on the order of "a million trajectories every few weeks" from its operating fleet during the period leading up to the 2024 announcement.[^4]

## Demonstrated Capabilities

Public demonstrations and Covariant's own communications describe several capability areas. Specific quantitative benchmarks have not been released, so the following list captures qualitative claims observed across the company's product page, the MODEX 2024 demonstrations, the high-fidelity scene prediction update, and third-party coverage in IEEE Spectrum and trade publications:

- **Novel-object grasping.** RFM-1 picks items it has not seen in training, including deformable objects, transparent containers, and packaging with irregular geometry, which Covariant frames as a key requirement for warehouse e-commerce SKUs.[^1][^16]
- **Language-conditioned task execution.** Operators can issue instructions such as picking a particular colored garment or sorting items by category, with the model translating language to action without a hand-authored task script.[^1][^16]
- **Video prediction as physical simulation.** Given an initial image and a proposed action, the model generates a multi-frame rollout of how the scene is expected to evolve, including object dynamics and contact-rich interactions.[^9][^4]
- **Embodied reasoning and human-in-the-loop interaction.** The model can be asked, in natural language, why a particular attempt failed, can request human help, and can be given a textual hint that it then applies to future attempts.[^16]
- **Cross-modal transfer.** Because all modalities share a token space, the same model can also answer text questions grounded in workspace images or describe a video of a robot action in natural language.[^1][^16]
- **Operator-in-the-loop strategy revision.** Covariant has shown clips in which an operator describes a better picking strategy in text, and the model applies it on subsequent attempts; this is positioned as analogous to giving instructions to a new human warehouse employee rather than reprogramming a fixed automation system.[^1][^16]
- **Generalization across SKUs and tote configurations.** Because the model was trained on tens of millions of pick attempts spanning many customer warehouses, Covariant argues that it handles new tote arrangements, lighting conditions, and SKU mixes without site-specific retraining, in contrast to vision systems that require per-deployment data collection.[^4][^16]

These demonstrations were shown in person at MODEX 2024 and in subsequent video releases on Covariant's site through 2024.[^2][^9] Trade publications such as Robotics 24/7 and Supply Chain 24/7 covered live demos at MODEX, and IEEE Spectrum published an in-depth interview with Abbeel that included a discussion of the world model and floppy-object simulation.[^3][^4]

## Comparison with Other Robotics Foundation Models

RFM-1 was announced into a rapidly developing landscape of [foundation model](/wiki/foundation_model) approaches to robotics. The following table compares the publicly disclosed parameters of several systems frequently discussed alongside RFM-1.

| Model | Developer | Announced | Parameters | Primary training data | Output |
|---|---|---|---|---|---|
| RFM-1 | [Covariant](/wiki/covariant) | March 2024[^1] | 8 billion[^1] | Covariant Brain warehouse trajectories plus internet text, image, video[^1] | Multimodal tokens including actions and video[^1] |
| [RT-2](/wiki/rt_2) | Google DeepMind | July 2023[^17] | Up to 55 billion (PaLI-X variant) | [Open X-Embodiment](/wiki/open_x_embodiment) / RT-1 robot data plus web vision-language[^17] | Action tokens via VLA |
| [OpenVLA](/wiki/openvla) | Stanford / UC Berkeley / Toyota Research Institute consortium | June 2024 | 7 billion[^18] | [Open X-Embodiment](/wiki/open_x_embodiment) subset (~970k trajectories)[^18] | Action tokens via VLA |
| [π0](/wiki/pi0) (pi-zero) | [Physical Intelligence](/wiki/physical_intelligence) | October 2024[^19] | 3 billion (full); 470M (small variant)[^19] | Open X-Embodiment plus proprietary 8-platform dexterous data[^19] | Continuous actions via flow matching[^19] |
| [Helix](/wiki/helix_vla) | [Figure AI](/wiki/figure_ai) | February 2025[^20] | 7B (System 2) + 80M (System 1)[^20] | ~500 hours teleoperation on Figure humanoids[^20] | High-rate continuous control (200 Hz) for full humanoid upper body[^20] |

Several differences are worth noting. [RT-2](/wiki/rt_2) and [OpenVLA](/wiki/openvla) are explicitly framed as [vision-language-action](/wiki/vla) models layered on top of an existing vision-language model backbone, and rely primarily on the public [Open X-Embodiment](/wiki/open_x_embodiment) dataset for robot trajectories.[^17][^18] [π0](/wiki/pi0) starts from a pretrained vision-language model and uses flow matching, a continuous-action variant of diffusion, rather than discrete action tokens.[^19] [Helix](/wiki/helix_vla) separates a 7B vision-language reasoning module ("System 2") that runs at 7 to 9 Hz from an 80M low-level controller ("System 1") that runs at 200 Hz to drive a humanoid's joints.[^20] RFM-1, by contrast, is a single autoregressive transformer in which both control actions and video frames are emitted from the same token stream, and is trained on a proprietary corpus drawn from an operating commercial fleet rather than from a research dataset.[^1][^4]

For reference on dataset scale, [Open X-Embodiment](/wiki/open_x_embodiment) in 2023 contained roughly one million trajectories across 22 robot embodiments; Covariant reported that its fleet had collected tens of millions of warehouse pick trajectories by early 2024 and was adding new data on the order of a million every few weeks.[^4][^17]

## Acquisition by Amazon

On August 30, 2024 (the announcement appeared on a Friday evening) [Amazon](/wiki/amazon) disclosed that it had entered a non-exclusive license to Covariant's foundation models, including RFM-1, and had hired Covariant's three co-founders, [Pieter Abbeel](/wiki/pieter_abbeel), Peter Chen, and Rocky Duan, along with about 25% of the startup's employees.[^5][^6] The structure followed a pattern that had appeared in several 2024 deals (Microsoft's relationship with [Inflection AI](/wiki/inflection_ai) in March 2024 at $650 million, and Amazon's separate June 2024 hiring of Adept AI personnel at $330 million in licensing fees) in which a hyperscaler obtains a license and key personnel without acquiring the startup outright, often described in press coverage as a "reverse acquihire."[^21] Subsequent transactions (Google's $2.7 billion licensing deal for Character.AI in August 2024 being the most prominent) further popularized the structure.[^21]

The motivation reported in coverage of the deal was twofold. First, Amazon already operates a very large warehouse robotics organization through [Amazon Robotics](/wiki/amazon_robotics) (previously Kiva Systems), and adding a robotics foundation model team allowed it to integrate generative-AI manipulation capabilities into its fulfillment centers. Second, by structuring the transaction as a license and acquihire rather than an outright acquisition, Amazon was widely understood to be seeking to avoid the heightened regulatory scrutiny that had attended other 2024 hyperscaler acquisitions.[^5][^6][^21]

The financial terms of the Covariant transaction were not disclosed at announcement.[^5][^6] A whistleblower complaint filed in 2025, summarized in subsequent reporting, characterized the structure as a $380 million upfront license payment plus a $20 million payment due one year after closing, against Covariant's April 2023 valuation of approximately $625 million.[^15]

Covariant continued to operate after the transaction. Ted Stinson, formerly chief operating officer, became chief executive officer, and co-founder Tianhao Zhang remained with the company; the surviving entity stated that it would focus on Covariant Brain deployments in industries such as apparel, health and beauty, grocery, and pharmaceuticals.[^7][^22] Public-facing activity from Covariant was nonetheless described in 2025 reporting as much reduced relative to pre-deal levels.[^15]

Abbeel was later appointed, in December 2025, to lead Amazon's large language model efforts within the company's AGI organization while continuing to work on robotics, an indication that the personnel transfer had been positioned as broader than robotics alone.[^11] Chen and Duan likewise took senior technical roles inside Amazon's robotics and AI organization, where they were tasked with adapting RFM-1's underlying methods to Amazon's larger fleet of warehouse robots; specific product timelines for those internal efforts had not been publicly disclosed as of mid-2026.[^11][^8]

The deal also drew interest from regulators on both sides of the Atlantic. The United States Federal Trade Commission opened a study of reverse-acquihire patterns in AI in late 2024, and European Commission officials commented on the structure as a way for big technology companies to avoid traditional merger review.[^21]

## Significance

RFM-1 was among the first multibillion-parameter [foundation models](/wiki/foundation_model) explicitly trained for general [robot manipulation](/wiki/robot_manipulation) using a commercial robot fleet's operational data, rather than research datasets or teleoperation studios.[^1][^4] In the 2024 to 2025 wave of generalist robotics models, it is typically grouped with [π0](/wiki/pi0) from [Physical Intelligence](/wiki/physical_intelligence) and [Helix](/wiki/helix_vla) from [Figure AI](/wiki/figure_ai) as examples of commercially backed systems with stated ambitions to scale across embodiments and tasks.[^19][^20] It also informed the wider industry argument that proprietary deployment data could be a structural advantage over research consortia such as [Open X-Embodiment](/wiki/open_x_embodiment) when training [robot foundation models](/wiki/robot_foundation_model).[^4][^17]

The Amazon transaction made RFM-1 part of one of the most consequential 2024 reverse-acquihire deals in robotics and contributed to the regulatory discussion in the United States and the European Union about how non-acquisition licensing arrangements should be treated for antitrust purposes.[^5][^6][^21]

A subtler significance is the way RFM-1 framed the role of [world models](/wiki/world_model) in robot control. Rather than treating physical simulation as an offline engineering tool used to generate synthetic training data, Covariant integrated video prediction directly into the policy network so that the model can imagine alternative outcomes at inference time. This integration mirrors developments in language model planning (where chain-of-thought reasoning sits inside the same model that produces the final answer) and points toward a more unified view of perception, action, and prediction in [embodied AI](/wiki/embodied_ai).[^9][^4] Several subsequent systems, including [π0](/wiki/pi0) and [Helix](/wiki/helix_vla), adopt different architectural choices but share the high-level commitment to a single large model that owns perception, language, and control together.[^19][^20]

From a commercial perspective, RFM-1 also represented an early test of whether warehouse-scale, deployment-derived data alone could rival cross-lab research datasets for training [robot foundation models](/wiki/robot_foundation_model). Although Covariant has not released benchmark numbers that would allow a clean comparison, the publicly observable result was that an 8 billion parameter model trained primarily on its proprietary fleet data was considered valuable enough by [Amazon](/wiki/amazon) to support the eight-figure license described in subsequent reporting, alongside the hiring of the founding team.[^5][^15]

## Limitations and Criticisms

Several limitations and open questions have been noted about RFM-1, either by Covariant itself or in third-party reporting.

- **Limited public technical disclosure.** Covariant published product pages, demonstrations, and blog posts about RFM-1, but did not release a peer-reviewed paper, weights, or quantitative benchmarks against [Open X-Embodiment](/wiki/open_x_embodiment), [RT-2](/wiki/rt_2), or other contemporaneous policies.[^1][^16] This makes independent reproduction and comparative evaluation impossible from public materials.
- **Domain concentration.** The model's training data is heavily weighted toward warehouse pick-and-place using suction-based grippers on industrial arms. Generalization to humanoid embodiments, mobile manipulation, or dexterous bimanual tasks (the regimes targeted by [Helix](/wiki/helix_vla), [Mobile ALOHA](/wiki/mobile_aloha), and [π0](/wiki/pi0)) is not directly addressed in publicly available demonstrations.[^1][^16][^19][^20]
- **Closed-source proprietary status.** Unlike [π0](/wiki/pi0), which [Physical Intelligence](/wiki/physical_intelligence) open-sourced in part, RFM-1's weights and architecture details have not been released, limiting external research engagement.[^19]
- **Visual hallucinations in early video predictions.** The April 2024 high-fidelity update explicitly framed earlier video rollouts as having had insufficient resolution and a tendency to hallucinate scene details, motivating the scaling-up effort.[^9]
- **Public uncertainty post-acquisition.** A 2025 whistleblower complaint described the surviving Covariant as having greatly reduced commercial activity since the August 2024 deal and characterized parts of the licensing arrangement as below the company's previous valuation, raising questions about the long-term independence of post-deal RFM-1 development outside [Amazon](/wiki/amazon).[^15]

## Related Work

- [RT-2](/wiki/rt_2): Google DeepMind's 2023 vision-language-action model, an early demonstration that web-scale VLM pretraining could be adapted to action outputs.[^17]
- [OpenVLA](/wiki/openvla): An open-source 7B VLA from a Stanford, Berkeley, and Toyota Research Institute consortium trained on [Open X-Embodiment](/wiki/open_x_embodiment) data.[^18]
- [Open X-Embodiment](/wiki/open_x_embodiment): 2023 academic consortium dataset aggregating roughly one million trajectories across 22 embodiments, used as the principal training source for many open VLAs.[^17]
- [π0](/wiki/pi0): [Physical Intelligence](/wiki/physical_intelligence)'s October 2024 generalist policy using flow matching over actions.[^19]
- [Helix](/wiki/helix_vla): [Figure AI](/wiki/figure_ai)'s February 2025 dual-system VLA for humanoid upper-body control on [Figure 02](/wiki/figure_02) and later [Figure 03](/wiki/figure_03) platforms.[^20]
- [PaLM-E](/wiki/palm-e_an_embodied_multimodal_language_model): Google's 2023 embodied multimodal language model that demonstrated transferring language-model knowledge to robot control.[^17]
- [Mobile ALOHA](/wiki/mobile_aloha): Stanford bimanual mobile manipulation system using teleoperated imitation learning.
- [World model](/wiki/world_model): The broader research direction of learned simulators for planning, of which RFM-1's video-prediction head is an instance.

## See also

- [Covariant](/wiki/covariant)
- [Pieter Abbeel](/wiki/pieter_abbeel)
- [Amazon](/wiki/amazon)
- [Amazon Robotics](/wiki/amazon_robotics)
- [University of California, Berkeley](/wiki/uc_berkeley)
- [Andrew Ng](/wiki/andrew_ng)
- [OpenAI](/wiki/openai)
- [Foundation model](/wiki/foundation_model)
- [Robot foundation model](/wiki/robot_foundation_model)
- [Robot manipulation](/wiki/robot_manipulation)
- [Robot learning](/wiki/robot_learning)
- [Warehouse robot](/wiki/warehouse_robot)
- [Imitation learning](/wiki/imitation_learning)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [VLA](/wiki/vla)
- [Multimodal AI](/wiki/multimodal_ai)
- [Transformer](/wiki/transformer)
- [Tokenization](/wiki/tokenization)
- [World model](/wiki/world_model)
- [Embodied AI](/wiki/embodied_ai)
- [RT-2](/wiki/rt_2)
- [OpenVLA](/wiki/openvla)
- [Open X-Embodiment](/wiki/open_x_embodiment)
- [π0](/wiki/pi0)
- [Physical Intelligence](/wiki/physical_intelligence)
- [Helix (VLA model)](/wiki/helix_vla)
- [Figure AI](/wiki/figure_ai)
- [Figure 02](/wiki/figure_02)
- [Figure 03](/wiki/figure_03)
- [PaLM-E](/wiki/palm-e_an_embodied_multimodal_language_model)
- [Mobile ALOHA](/wiki/mobile_aloha)
- [Inflection AI](/wiki/inflection_ai)

## References

[^1]: Covariant, "Introducing RFM-1: Giving robots human-like reasoning capabilities", Covariant Insights, 2024-03-11. https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/. Accessed 2026-05-20.
[^2]: Covariant, "Covariant Introduces RFM-1 to Give Robots the Human-like Ability to Reason", BusinessWire press release, 2024-03-11. https://www.businesswire.com/news/home/20240311948570/en/Covariant-Introduces-RFM-1-to-Give-Robots-the-Human-like-Ability-to-Reason. Accessed 2026-05-20.
[^3]: Robotics 24/7 staff, "MODEX 2024: Covariant introduces RFM-1 to give robots human-like ability to reason", Robotics 24/7, 2024-03-11. https://www.robotics247.com/article/modex_2024_covariant_introduces_rfm_1_to_give_robots_human_like_ability_to_reason/. Accessed 2026-05-20.
[^4]: Evan Ackerman, "Covariant Announces a Universal AI Platform for Robots", IEEE Spectrum, 2024-03-11. https://spectrum.ieee.org/covariant-foundation-model. Accessed 2026-05-20.
[^5]: Brian Heater, "Amazon hires the founders of AI robotics startup Covariant", TechCrunch, 2024-08-31. https://techcrunch.com/2024/08/31/amazon-hires-the-founders-of-robotics-ai-startup-covariant/. Accessed 2026-05-20.
[^6]: Todd Bishop, "Amazon hires Covariant founders, inks licensing deal with robotics AI startup in latest 'reverse acquihire' deal", GeekWire, 2024-08-31. https://www.geekwire.com/2024/amazon-hires-covariant-founders-inks-licensing-deal-with-robotics-ai-startup-in-latest-reverse-acquihire-deal/. Accessed 2026-05-20.
[^7]: Covariant, "Introducing the next phase of our AI Robotics journey", Covariant Insights, 2024-08-30. https://covariant.ai/insights/introducing-the-next-phase-of-our-ai-robotics-journey/. Accessed 2026-05-20.
[^8]: Eugene Demaitre, "Unpacking Amazon's unique Covariant AI acquisition", The Robot Report, 2024-09-04. https://www.therobotreport.com/unpacking-amazons-unique-covariant-ai-acquisition/. Accessed 2026-05-20.
[^9]: Covariant, "RFM-1 update: High-fidelity scene prediction", Covariant Insights, 2024-04-25. https://covariant.ai/insights/rfm-1-update-high-fidelity-scene-prediction/. Accessed 2026-05-20.
[^10]: Covariant, "About Covariant", covariant.ai. https://covariant.ai/about-us/. Accessed 2026-05-20.
[^11]: Wikipedia contributors, "Pieter Abbeel", Wikipedia, 2026 (continuously updated). https://en.wikipedia.org/wiki/Pieter_Abbeel. Accessed 2026-05-20.
[^12]: ABB and Covariant, "ABB and Covariant Partner to Deploy Integrated AI Robotic Solutions", ABB News Center, 2020-02-25. https://new.abb.com/news/detail/57457/abb-and-covariant-partner-to-deploy-integrated-ai-robotic-solutions. Accessed 2026-05-20.
[^13]: Brian Heater, "Covariant's robotic picking AI nabs another $75M", TechCrunch, 2023-04-04. https://techcrunch.com/2023/04/04/covariants-robotic-picking-at-nabs-another-75m/. Accessed 2026-05-20.
[^14]: Index Ventures, "Congratulations to Covariant on their $75M Series C extension", Index Ventures Perspectives, 2023-04-04. https://www.indexventures.com/perspectives/congratulations-to-covariant-on-their-75m-series-c/. Accessed 2026-05-20.
[^15]: Sunset HQ, "Covariant Acquisition: Key Details, Impact, and What Comes Next", Sunset HQ Blog, 2025. https://www.sunsethq.com/blog/covariant-acquisition. Accessed 2026-05-20.
[^16]: Radical Ventures, "Giving Robots Human-like Reasoning Capabilities: Introducing RFM-1", Radical Ventures, 2024-03-11. https://radical.vc/giving-robots-human-like-reasoning-capabilities-introducing-rfm-1/. Accessed 2026-05-20.
[^17]: Anthony Brohan et al. and the Open X-Embodiment Collaboration, "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", arXiv preprint, 2023-10-13. https://arxiv.org/abs/2310.08864. Accessed 2026-05-20.
[^18]: Moo Jin Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", arXiv preprint, 2024-06-13. https://arxiv.org/abs/2406.09246. Accessed 2026-05-20.
[^19]: Kevin Black et al., "π0: A Vision-Language-Action Flow Model for General Robot Control", Physical Intelligence, 2024-10-31. https://www.pi.website/blog/pi0. Accessed 2026-05-20.
[^20]: Figure AI, "Helix: A Vision-Language-Action Model for Generalist Humanoid Control", Figure News, 2025-02-20. https://www.figure.ai/news/helix. Accessed 2026-05-20.
[^21]: Anjana Susarla, "Reverse acquihires: how big tech is rewriting M&A", AAF / industry coverage, 2024-09. https://www.americanactionforum.org/insight/ftc-eyes-reverse-acquihires-in-ai-sector/. Accessed 2026-05-20.
[^22]: Mike Murphy, "Amazon hires three of the founders of AI robotics company Covariant, licenses its technology", Modern Materials Handling, 2024-09-03. https://www.mmh.com/article/amazon_hires_the_founders_of_ai_robotics_company_covariant_licenses_its_technology. Accessed 2026-05-20.

