Helix (VLA model)
Last reviewed
May 7, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 · 4,250 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 · 4,250 words
Add missing citations, update stale details, or suggest a clearer explanation.
Helix is a Vision-Language-Action (VLA) model developed by Figure AI for controlling humanoid robots. First announced on February 20, 2025, Helix runs entirely onboard Figure 02 robots and was later updated to power the Figure 03 platform. The model takes camera images and natural language commands as inputs and produces continuous joint-level control signals at 200 Hz, enabling dexterous, whole-body manipulation of objects the robot has never previously encountered.
Helix is notable for several technical firsts in the Vision-Language-Action models category: it was the first VLA to output continuous high-rate control across a 35-degree-of-freedom humanoid upper body, the first to operate simultaneously across two robots sharing identical weights, and the first production VLA to run without cloud connectivity on embedded low-power GPUs. In January 2026, Figure released Helix 02, an extension that added full lower-body locomotion control by introducing a third neural subsystem called System 0.
Figure AI was founded in 2022 by entrepreneur Brett Adcock with the goal of building general-purpose humanoid robots for commercial work environments. The company's first platform, Figure 01, demonstrated basic teleoperated tasks and served as a proof-of-concept for the hardware. Its second, Figure 02, was designed with commercial deployment in mind and entered a pilot program at BMW Group Plant Spartanburg in South Carolina in early 2024.
The early Figure 02 robots ran under conventional control architectures written in C++: hand-engineered state machines that specified motions explicitly for each task. Adding a new task meant writing new code. Adding a new object type meant adjusting perception pipelines. These systems were brittle and required manual reprogramming whenever the environment changed or a new object was introduced. The company's researchers concluded that this approach would not scale to the variety of tasks required in real workplaces, let alone homes.
The broader robotics field had been wrestling with the same problem. Task-specific behavioral cloning, where a model learns to imitate recorded demonstrations for one specific task, produced policies that worked reliably in their training environment but failed unpredictably when objects were moved, lighting changed, or unfamiliar items appeared. The emergence of large language models and vision-language models between 2021 and 2023 opened a different path: if a model pretrained on internet data already contained broad knowledge about objects and actions, that knowledge might transfer to physical manipulation with far less robot-specific training.
Beginning in mid-2024, Figure's AI team built Helix as a replacement: a single neural network that could handle perception, language understanding, and motor control together. Brett Adcock previewed the effort in a February 4, 2025 social media post, writing that the company would show something "never seen on a humanoid" within 30 days. On February 20, 2025, Figure published both a detailed technical blog post and a demonstration video showing two Figure 02 robots collaborating to store grocery items, handling thousands of novel household objects they had never encountered in training.
The introduction of Helix coincided with a broader reorganization at Figure. The company restructured its engineering teams around the model, treating Helix as the central product rather than the robot hardware. Later in 2025, Figure announced the Helix Lab, a dedicated data collection and training facility built to feed a continuously growing corpus of robot demonstrations to the model. The lab was designed to integrate data collection, simulation, and on-robot learning into a single pipeline in preparation for broader commercial deployment of Figure 03.
Helix uses a dual-system design inspired loosely by the cognitive science concept of fast and slow thinking. The two subsystems are called System 2 (S2) and System 1 (S1). They operate at very different timescales and serve different functions, but share information through a latent representation vector in shared memory.
System 2 is a 7-billion-parameter vision-language model (VLM) that processes monocular camera images and natural language task instructions. It runs at 7 to 9 Hz, updating its internal representation of the scene on every cycle. The model was initialized from an open-source, open-weight VLM pretrained on internet-scale text and image data, then further trained on robot teleoperation data.
S2's primary job is scene understanding: identifying objects, parsing instructions, and encoding the current behavioral intent into a compact latent vector. This latent vector is written to shared memory and read continuously by S1. Because S2 is grounded in a large internet-pretrained model, it generalizes across a wide variety of objects, textures, and language phrasings without requiring specific training examples for every item the robot will encounter.
To run a 7B-parameter model on embedded hardware, Figure quantized the model to 4-bit precision and implemented model parallelism across the robot's dual onboard GPUs. The result is a 23x reduction in computational overhead compared to a naive cloud-based implementation, and the system stays under 60 watts while maintaining sub-100ms latency from image capture to latent update.
System 1 is an 80-million-parameter transformer-based visuomotor policy. It runs at 200 Hz, consuming both raw camera images and the latent vector produced by S2 to output continuous joint-level commands. The S1 architecture uses a fully convolutional, multi-scale vision backbone pretrained in simulation, combined with a cross-attention encoder-decoder transformer that fuses visual and latent inputs into action predictions.
S1 controls the robot's full upper-body action space: wrist poses in six degrees of freedom, finger flexion and abduction for each digit, torso orientation, and head gaze direction. This 35-DoF action space is output as a continuous stream rather than discrete waypoints, which is what allows Helix to produce fluid, reactive motion rather than the step-by-step movements typical of earlier robot manipulation systems.
A critical design decision was the temporal offset between S1 and S2 inputs during training. Because S2 operates at roughly 7-9 Hz while S1 runs at 200 Hz, there is an inherent latency between when S2 updates its latent and when S1 acts on it. Figure deliberately replicated this latency offset during training so the model would not learn behaviors that depend on information S2 cannot provide in time during deployment. This train-inference alignment prevents a class of compounding errors that plagued earlier dual-system robot controllers.
S2 runs as an asynchronous background process, continuously writing updated latents to shared memory. S1 reads the most recent available latent on every 5-millisecond control cycle. The result is a system that can think at two timescales simultaneously: S2 reasons about what the robot is supposed to be doing while S1 handles the moment-to-moment physics of manipulation.
Both subsystems run on NVIDIA Jetson Orin modules embedded in the robot chassis, with no reliance on cloud compute. Inference is split across two GPUs using model parallelism: S2 loads onto one GPU while S1 uses the other, and the shared memory latent vector bridges them. This onboard execution is what allows Helix to operate in environments with no network connectivity and to respond with the sub-100ms latency required for safe physical interaction.
Running the 7B-parameter S2 on a Jetson Orin required significant compression. Figure applied 4-bit quantization to reduce memory footprint and combined it with model parallelism to split computation across both onboard GPUs. The result was a 23x reduction in computational overhead compared to a cloud-hosted inference approach, while the system stays under 60 watts total. This is a meaningful constraint for a mobile humanoid robot: the entire inference budget must fit within the robot's power envelope alongside motor controllers, sensors, and safety systems.
Helix was trained on roughly 500 hours of teleoperated demonstrations collected by a team of human operators controlling Figure 02 robots through a teleoperation interface. The dataset was gathered using multiple robots and multiple operators to ensure behavioral diversity, and care was taken to filter out slow or unsuccessful demonstrations while keeping examples that showed corrective behavior when failures arose from environmental randomness rather than operator error.
Figure's team described this training set as "a small fraction of the size of previously collected VLA datasets," a point they attributed to data quality rather than data volume. Working closely with teleoperators to standardize and refine manipulation strategies produced measurable improvements in policy quality. The company also applied hindsight instruction labeling: a separate auto-labeling VLM processed recorded demonstrations and generated natural language descriptions of each behavior after the fact, producing a paired dataset of (video, instruction) without requiring operators to dictate commands during collection.
Training used a standard regression loss from raw pixels and text tokens to continuous action vectors. There were no task-specific fine-tuning stages or separate action heads for different behaviors. The single set of neural network weights handles picking, placing, drawer operation, refrigerator use, multi-robot handovers, and other manipulation behaviors without any specialization per task.
Objects used during training were explicitly excluded from the evaluation protocol, so all published results on novel object generalization reflect true out-of-distribution performance.
The most widely discussed capability of Helix at launch was its ability to handle objects the model had never seen during training. In the February 2025 demonstration, Figure 02 robots successfully picked up and stowed thousands of distinct household items, responding to natural language commands such as "Pick up the pasta box" or "Place the cereal in the cabinet." Because S2 is grounded in a large internet-pretrained VLM, it can identify and reason about arbitrary objects based on visual appearance and category knowledge, even without prior robot interaction examples.
Figure specifically excluded training objects from the evaluation demonstrations, meaning no demonstration in the training set contained the exact items shown in the video. This out-of-distribution generalization was a direct consequence of the S2 architecture and its internet pretraining.
Helix is the first VLA demonstrated running simultaneously across two robots sharing identical model weights. In the February 2025 demonstration, two Figure 02 robots coordinated to complete a grocery storage task, with one robot handing items to the other and both adapting their behavior based on the shared task context. No separate coordination protocol or inter-robot communication channel was required: both robots received the same natural language prompt and used S2's scene understanding to infer appropriate roles from the visual context.
Prior to Helix, VLA models in the humanoid space typically controlled only end-effectors, delegating wrist orientation, torso posture, and head movement to separate lower-level controllers. Helix integrates all 35 upper-body degrees of freedom into a single action output, which means the neural policy can learn to use the entire upper body as part of its manipulation strategy. This produces more natural motion and enables tasks that require coordinating hand position with torso lean or head gaze.
Running S2 and S1 entirely on embedded Jetson Orin hardware, without cloud connectivity, was a stated commercial priority at launch. Figure framed this as the model being "immediately ready for commercial deployment" without infrastructure dependencies. The combination of 4-bit quantization, model parallelism, and the 80M-parameter S1 keeps total power consumption low enough for a mobile robot that must manage its own energy budget.
The onboard-only constraint also has implications for safety and latency. A system that relies on cloud inference introduces network latency into the control loop and has a single point of failure: if the network connection drops, the robot stops. Edge-native deployment eliminates both concerns. Helix's 200 Hz control loop and sub-100ms end-to-end latency are only achievable because all inference happens on the same hardware that drives the motors.
Six days after the initial announcement, Figure published a follow-on technical post describing a specialized deployment of Helix for warehouse logistics. The logistics variant added two modifications to the core System 1 visuomotor policy.
First, Figure added implicit stereo vision with multi-scale feature extraction. The stereo upgrade gave S1 a richer 3D understanding of package positions, particularly useful for estimating depth when placing items on moving conveyor belts. Benchmarks showed a 60% throughput increase over the non-stereo baseline, and the stereo model generalized to flat envelopes that had never appeared in the training corpus.
Second, Figure implemented learned visual proprioception, which allows S1 to calibrate itself to individual robot bodies without manual recalibration. This cross-robot transfer capability was important for scaling the policy across a fleet where each unit has slightly different mechanical tolerances.
The logistics training set was notably small: just 8 hours of curated demonstration data produced a functional policy. Curated, high-quality data outperformed a three-times-larger uncurated dataset by 40% on throughput, reinforcing Figure's emphasis on data quality. A "Sport Mode" feature enabled test-time speedup by linearly resampling action chunks, allowing the robot to execute motions 20 to 50% faster than the human demonstrators while maintaining high success rates.
Figure released Helix 02 with the Figure 03 platform, introducing a third subsystem called System 0 (S0) that extended Helix from upper-body manipulation to full-body loco-manipulation.
System 0 is a 10-million-parameter neural network that runs at 1,000 Hz, one level below S1 in the control hierarchy. It replaces the hand-engineered C++ balance and locomotion controllers that Figure's earlier robots relied on. S0 takes full-body joint state and base motion commands as input and outputs joint-level actuator commands at 1 kHz, handling contact, balance, and coordination across legs, torso, and arms simultaneously.
S0 was trained entirely in simulation across more than 200,000 parallel environments, using reinforcement learning against a corpus of over 1,000 hours of retargeted human motion capture data. The simulation-to-real transfer succeeded well enough that Figure could delete the final 109,504 lines of hand-engineered C++ from the robot's control stack, a milestone CEO Brett Adcock described as reaching "Software 2.0" status.
The three-tier hierarchy in Helix 02 is:
The flagship demonstration for Helix 02 showed a Figure 03 robot completing a dishwasher loading and unloading task: a four-minute, end-to-end autonomous operation with no resets and no human intervention. The task required 61 sequential loco-manipulation actions, including walking across the kitchen, opening the dishwasher door, grasping dishes from various positions, placing them in the rack, and closing the door. The robot also demonstrated improvised whole-body moves, closing a drawer with its hip and lifting the dishwasher door with its foot when its hands were occupied.
Other demonstrated tasks included unscrewing bottle caps, extracting pills from blister packs, dispensing liquid from a syringe, and sorting small metal components. These tasks were designed to show that tactile feedback and palm cameras enabled manipulation beyond the limits of vision-only policies.
Figure 03 was designed alongside Helix 02 to provide the sensory data the upgraded policy required. The robot's vision system delivers twice the frame rate, one-quarter the latency, and 60% wider field of view per camera compared to Figure 02. Palm-mounted cameras in each hand provide close-range visual feedback during grasping. Custom fingertip tactile sensors detect forces as small as 3 grams, roughly the weight of a paperclip. These sensing upgrades feed directly into S1 and S0, enabling manipulation behaviors that were out of reach for the original Helix running on Figure 02 hardware.
Figure 02 is a humanoid robot standing approximately 5 feet 6 inches tall and weighing around 60 kilograms. It was the primary platform for Helix at launch and for the BMW deployment. The robot's upper-body configuration, with multi-finger hands and a mobile base, was designed to fit within industrial environments sized for human workers.
Figure 03 was announced on October 9, 2025, as a consumer and home-oriented platform for Helix 02. Standing 5 feet 8 inches tall and weighing 61 kilograms (9% lighter than Figure 02 despite adding new sensors), Figure 03 includes wireless inductive charging through coils in its feet, allowing the robot to autonomously dock and recharge. Its actuators run at twice the speed of Figure 02's with improved torque density, and the chassis incorporates multi-density foam padding and a soft textile exterior for safety in proximity to people. Figure 03's 10 Gbps mmWave data offload capability allows each unit in the field to continuously upload operational data for fleet-level model updates.
Figure priced Figure 03 at approximately $20,000 for consumers and targeted initial home deployments in late 2026. As of early 2026, the company was producing units at its BotQ manufacturing facility in California at a rate approaching one robot every 90 minutes, with a stated annual capacity of approximately 50,000 units.
The most extensively documented real-world deployment of Helix-equipped robots was the Figure 02 pilot at BMW Group Plant Spartanburg in South Carolina. Figure and BMW announced the collaboration in early 2024. The project ran for 11 months, with robots on active production lines from month 10 onward, running 10-hour shifts Monday through Friday.
During the deployment, Figure 02 robots performed sheet-metal loading operations: picking components from racks and placing them into welding fixtures within a 5-millimeter tolerance in a 2-second window. Over the course of the program:
The robots targeted three KPIs per shift: 84-second cycle time (37 seconds for the loading phase), greater than 99% placement accuracy, and zero human interventions required.
The program surfaced a critical hardware reliability issue: the forearm was the top failure point across the fleet, attributed to tight mechanical packaging, three degrees of freedom in a small volume, and thermal constraints. This finding directly shaped the redesign of Figure 03's wrist electronics, which eliminated the distribution board and dynamic cabling and moved each wrist motor controller to direct communication with the main computer, reducing complexity and improving thermal management.
BMW subsequently expanded its program. An initial deployment at BMW Group Plant Leipzig in Germany began in December 2025, with a further test phase planned for spring 2026 ahead of a full pilot launch in summer 2026.
Helix belongs to a generation of VLA models that emerged between 2023 and 2025 for robot control. The models differ substantially in architecture, parameter count, control frequency, and target robot platforms. Each reflects different tradeoffs between generality, speed, deployment constraints, and the type of robot it was designed for.
| Model | Developer | Release | Parameters | Control Hz | Architecture type | Target platform |
|---|---|---|---|---|---|---|
| RT-2 | Google DeepMind | July 2023 | 55B (PaLM-E) | ~1-3 Hz | Single end-to-end VLA | Robot arm |
| Helix (S1+S2) | Figure AI | Feb 2025 | 7B (S2) + 80M (S1) | 200 Hz | Dual-system (VLM + visuomotor) | Humanoid upper body |
| Isaac GR00T N1 | NVIDIA | Mar 2025 | 2B | 120 Hz | Dual-system (VLM + diffusion) | Humanoid |
| pi0 | Physical Intelligence | Oct 2024 | 3.3B | 10 Hz | VLM + action expert (flow matching) | Multi-robot arm |
| GR00T N1.7 | NVIDIA | 2025 | 3B | 120 Hz | Dual-system + reasoning | Humanoid |
RT-2 was the seminal demonstration that a VLM pretrained on internet data could be co-finetuned on robot demonstrations to produce a general-purpose robot policy. Its primary contribution was establishing the transfer of web knowledge to robot control. However, RT-2 controls a 6-DoF arm at roughly 1-3 Hz and was not designed for humanoid whole-body control. The 55-billion parameter PaLM-E variant also required cloud compute and was not deployable onboard a mobile robot.
pi0 from Physical Intelligence uses a 3.3-billion-parameter architecture that combines a PaliGemma VLM backbone with a flow-matching action expert. It operates at 10 Hz and is primarily evaluated on arm manipulation tasks across multiple physical robot platforms. Physical Intelligence subsequently released pi0.5 and pi0.6, progressively refining the architecture with improved reasoning and faster execution. A key difference from Helix is that pi0 targets a broader range of robot form factors and is made available to external researchers through an open-source release, whereas Helix is proprietary to Figure's robots.
Isaac GR00T N1 from NVIDIA uses a dual-system design that parallels Helix's S1/S2 architecture: a VLM module for scene understanding coupled with a diffusion transformer for action generation, running at 120 Hz. GR00T N1 was trained on a mixture of real teleoperation data, human video, and NVIDIA-generated synthetic data from Cosmos and Isaac Lab. It was released as an open foundation model that third-party humanoid robot manufacturers can adapt, in contrast to Helix, which is tightly integrated with Figure's hardware and not publicly licensed.
The core architectural distinction between Helix and most other VLAs is its 200 Hz continuous control rate and its 35-DoF whole-body action space. Most VLA models target arm manipulation with 6-7 DoF at lower control frequencies. Helix's high-rate output is what allows it to handle the reactive, real-time adjustments required for humanoid whole-body manipulation. The tradeoff is that the design is specific to Figure's robot hardware and not straightforwardly portable to other platforms.
Another notable difference is the training data profile. Pi0 and GR00T N1 both incorporate large quantities of synthetic simulation data alongside real robot data. Helix v1 was trained exclusively on real teleoperated demonstrations, relying on data quality rather than simulation-generated volume. Helix 02 introduced simulation-trained System 0, bringing Figure's approach closer to the sim-to-real paradigm used by competitors, but S1 and S2 continued to rely on human demonstration data.
Several limitations in Helix's capabilities and deployment context were identified through Figure's own technical disclosures and the BMW program findings.
The original Helix controlled only the upper body. The lower body ran under separate hand-coded controllers, meaning the robot could not simultaneously walk and manipulate with the same neural policy. This was the primary gap that Helix 02 addressed.
System 1, in the original design, relied mainly on proprioception for base stabilization, giving the robot limited environmental awareness outside its upper-body workspace. Navigation in complex or unstructured environments, including stairs and uneven terrain, required separate systems or manual switching prior to the System 0 integration in Helix 02.
The BMW deployment revealed that hardware reliability, not the AI policy itself, was the binding constraint in sustained industrial use. The forearm failure rate required intervention and ultimately drove a full redesign of the wrist subsystem for Figure 03. Industrial deployment at scale exposed failure modes that controlled laboratory testing did not.
Figure's training approach requires large quantities of high-quality human demonstration data. The company was explicit that teaching a new behavior still requires either extensive teleoperation or programming. The 8-hour dataset figure for the logistics variant was held up as an achievement of data efficiency, but it still represents a meaningful collection effort per task category. Generalizing to entirely new task domains beyond the existing training distribution remains a research problem.
In November 2025, Figure's former head of product safety filed a lawsuit alleging the company had terminated her employment after she raised concerns that the robots posed injury risks, specifically that they were capable of generating forces sufficient to fracture a human skull. The company disputed the characterization. The case highlighted broader industry questions about safety evaluation standards for humanoid robots deployed in proximity to workers.
Figure 03 targeted home deployment in late 2026, but as of early 2026, no commercially available unit had been placed in a private residence. Adcock publicly noted that he would not release the robot for unsupervised home use until safety thresholds he had not yet fully specified were met.