Mobile ALOHA
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,333 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,333 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mobile ALOHA is an open-source, low-cost, whole-body bimanual mobile manipulation platform developed at Stanford University by Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Released publicly on 4 January 2024 alongside the arXiv preprint Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation, the system extends the original ALOHA bimanual teleoperation rig with an AgileX Tracer differential-drive base, two additional leader (puppet) arms for driving the mobile platform, and a behavior-cloning recipe in which the learned policy uses the ACT (Action Chunking with Transformers) architecture co-trained on a mixture of newly collected mobile data and the static ALOHA dataset released the previous year.[1][2] The complete platform was reported in the paper at roughly $32,000 USD in materials, including all four robot arms, the mobile base, batteries, cameras, and an Nvidia RTX 3070 Ti laptop for onboard compute.[1]
The project was widely amplified on social media in January 2024 because of striking video clips that showed the robot performing household chores such as sauteing shrimp, calling an elevator, watering plants, opening a wall cabinet, and rinsing a pan. Some of these clips were autonomous policy rollouts from the paper, while others (in particular several cooking demonstrations widely shared on X/Twitter) were human teleoperated data-collection sessions, a distinction that the authors and IEEE Spectrum subsequently emphasized.[3] The viral wave nonetheless helped popularize low-cost mobile bimanual imitation learning as a research direction and seeded a commercial product line at Trossen Robotics, including the Aloha Mobile kit.[4][5]
This article describes the origin and authorship of Mobile ALOHA, the hardware bill of materials, the teleoperation interface, the ACT algorithm and the co-training recipe, the autonomous tasks reported in the paper along with their measured success rates, comparisons with related systems including ALOHA 2, the Universal Manipulation Interface (UMI), and DexCap, the open-source software and hardware releases, the Trossen Robotics commercialization, and the platform's lasting influence on later vision-language-action work such as Physical Intelligence's pi0 model and the LeRobot ecosystem.
Mobile ALOHA was developed in the Stanford Interactive Perception and Robot Learning Lab (IRIS), led by assistant professor Chelsea Finn, in collaboration with the broader Stanford AI Lab and the Stanford Robotics Center. The project's two student co-leads, Zipeng Fu and Tony Z. Zhao, had previously contributed to neighboring lines of work: Zhao authored the original ALOHA system and the ACT algorithm in 2023 (Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, arXiv:2304.13705), and Fu had worked extensively on legged locomotion and whole-body control.[6] Their goal was to combine the dexterity advantages of the ALOHA teleoperation interface with the reach and household relevance of a mobile platform, without falling into the cost trap of prior mobile manipulator research, which had typically required PR2 or TIAGo-class robots costing on the order of $200,000.[1]
The team released the arXiv preprint, an extensive project page, two GitHub repositories (one for hardware and teleoperation, one for the ACT++ training code), a Google Drive containing several hundred teleoperated episodes, and high-resolution videos of both teleoperation and autonomous rollouts on 4 January 2024. Within seventy-two hours the project had crossed several million video views on X/Twitter, YouTube Shorts, and TikTok, prompting press coverage from MIT Technology Review, VentureBeat, TechCrunch, and Stanford Report.[7][8][9]
Mobile ALOHA inherits the four-arm leader-follower architecture of ALOHA but mounts the two follower arms on a wheeled chassis instead of a fixed tabletop, and adds two additional leader arms mounted on a teleoperation frame that the human operator drags around the workspace while tethered to the robot at the waist.
The system uses four Trossen Robotics ViperX 300 robot arms in total: two as the followers mounted on the mobile base and two as leaders on the teleoperation frame. Each ViperX 300 has six degrees of freedom with a gripper (treated as a seventh actuated joint), uses Dynamixel XM/XH series servos, has a reach of roughly 750 mm and a payload of about 750 g at full extension.[10] The arms are mounted with their bases at approximately shoulder height for an average adult operator, which the paper notes is a deliberate ergonomic choice that balances reachability across countertops, sinks, and elevator buttons at the cost of access to floor-level appliances such as ovens and dishwashers.
The mobile base is the AgileX Tracer AGV, a low-profile differential-drive platform marketed for warehouse logistics. The Tracer has a footprint of roughly 569 mm by 445 mm, an overall robot footprint when the arms and frame are added of about 90 cm by 135 cm, a top speed of 1.6 m/s (comparable to a brisk walking pace), a continuous payload of 100 kg, and CANBUS-based velocity control.[1][11] The paper notes that the Tracer's low height of approximately 17 mm permits placing a heavy lithium battery near the floor, which acts as ballast and significantly improves tip-over stability when the arms apply forces overhead.
A custom-built 1.26 kWh lithium-ion battery weighing approximately 14 kg supplies power to all four arms, the mobile base, cameras, and laptop. The paper reports roughly two to three hours of continuous teleoperation on a single charge. Compute is provided by a consumer-grade laptop with an Intel i7-12800H CPU and an Nvidia RTX 3070 Ti laptop GPU (8 GB VRAM), which is sufficient to run the ACT policy at roughly 50 Hz inference. The laptop is rigid-mounted on the frame and tethered to the operator's harness for cable management.[1]
Mobile ALOHA carries three Logitech C922x USB RGB webcams: one wrist-mounted on each follower arm and one third-person "top" camera looking down at the workspace from a fixed rigid mast. The cameras record at 480 by 640 resolution at 50 Hz. A fourth camera is sometimes mounted on the chassis for visualization during data collection but is not used by the policy network. The robot itself has no LIDAR, depth sensors, force/torque sensors, or tactile skin; localization is purely visual through the three RGB streams, and proprioception comes from the Dynamixel servo encoders and the Tracer's wheel odometry.[1]
The following table summarizes the materials cost reported in the original paper. Prices reflect early 2024 list prices in USD and have shifted somewhat since.
| Component | Description | Approximate cost (USD) |
|---|---|---|
| 4 x ViperX 300 robot arm | 6-DOF arms with grippers (2 leader, 2 follower) | $19,200 |
| AgileX Tracer mobile base | Differential drive AGV | $7,000 |
| 1.26 kWh lithium battery | Onboard power and ballast | $2,000 |
| Nvidia RTX 3070 Ti laptop | Onboard compute | $1,500 |
| 3 x Logitech C922x cameras | Two wrist plus one top RGB webcam | $300 |
| Teleoperation frame | 3D-printed and aluminum extrusion | $1,500 |
| Miscellaneous cables, mounts | USB hubs, CANBUS adapters, harness | $500 |
| Total | approximately $32,000 |
The paper presents this as roughly 16 percent of the cost of a PR2-class research platform with comparable mobile bimanual capability, and the authors frame this cost ratio as the principal hardware contribution of the project.[1]
The teleoperation interface is a direct extension of the leader-follower "puppeteering" scheme introduced in the original ALOHA. The operator stands behind the robot, harnessed to the mobile base at the waist with a rigid tether that is long enough to allow comfortable stepping but short enough that the operator's translation is mechanically coupled to the base. As the operator walks the rig around the workspace, the wheel encoders of the Tracer record linear and angular velocity, which are stored as part of the demonstration. Simultaneously, the operator grips the two leader arms, and the joint angles of these arms are streamed at 50 Hz to the two follower arms, which mirror the motion in real time. Triggers on the leader arm grippers control the open/close state of the follower grippers.[1]
This design has two notable properties relative to alternatives such as virtual reality controllers, motion capture suits, or haptic exoskeletons. First, it requires no calibration of the operator's body kinematics, no battery-powered controllers, and no external tracking system, which makes data collection portable to arbitrary kitchens, hallways, and offices. Second, because the leader arms are mechanically identical to the follower arms (in DOF count, joint limits, and rough scale), the operator's commanded joint trajectory is automatically feasible on the follower, eliminating the kinematic re-targeting problems that affect VR teleoperation of robots with different morphology.
The paper reports a user study with eight participants in which task completion time decreased by 39 to 52 percent within the first five trials of practice, suggesting the interface has a relatively short learning curve. Sustained teleoperation sessions in the paper's data collection regime ran approximately 30 to 60 minutes before operator fatigue became significant.[1]
The policy that runs autonomously on the robot is trained with Action Chunking with Transformers (ACT), the same algorithm used by the original ALOHA, lightly extended to predict the base velocity along with the joint positions. The training recipe also incorporates a co-training step with the static ALOHA dataset, which is the paper's principal algorithmic finding.
ACT was introduced by Tony Z. Zhao and colleagues in April 2023 in the paper Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (arXiv:2304.13705).[6] At a high level, ACT is a conditional variational autoencoder over short chunks of future actions rather than over individual actions. The encoder is a transformer that takes as input the current proprioception and the recent visual observations from the wrist and top cameras, and emits a latent style code z. The decoder is a second transformer that takes z and the current observation and predicts a sequence of the next k actions (the chunk). At deployment time, the chunks are blended with a temporal ensemble to produce smooth open-loop trajectories, and the policy is replanned at the chunk-overlap horizon.[6]
Action chunking dramatically reduces compounding error in behavior cloning by amortizing each policy decision over many timesteps, and it allows the policy to commit to multi-step plans (such as the full reach-grasp-lift sequence of a pan handle) without being derailed by short-term noise in the observation stream.
For Mobile ALOHA, the action space is extended from the 14-dimensional joint vector of the original ALOHA (7 DOF per arm, including gripper) to a 16-dimensional vector that adds linear and angular base velocity. The default chunk size is 45 timesteps at 50 Hz (roughly 0.9 seconds of look-ahead), the encoder has 4 transformer layers, the decoder has 7 transformer layers, the hidden dimension is 512, and the learning rate is 2e-5 with the Adam optimizer.[1]
The paper also benchmarks Diffusion Policy (Chi et al., 2023) and VINN (visual nearest neighbor) as alternatives. Diffusion Policy uses a chunk size of 64 with a DDIM scheduler (50 training steps, 10 inference steps) and a learning rate of 1e-4. VINN uses a chunk size of 100 and a weighted nearest-neighbor lookup over state and visual features. ACT achieves the strongest results on the mobile tasks, although Diffusion Policy is competitive on several of them.[1]
The paper's signature methodological contribution is co-training with a much larger pre-existing static manipulation dataset, specifically the 825-episode, 12-task corpus that was released with the original ALOHA. For each gradient step, with equal probability the training procedure samples either a Mobile ALOHA episode for the target task or a static ALOHA episode from the broader corpus. Because the static episodes do not include base motion, the action vector is zero-padded along the two base velocity dimensions for those samples, which makes the dataset shapes compatible with the same 16-dimensional ACT output head.[1]
The intuition is that the static dataset, despite covering different tasks, is rich in scene diversity and in fine-grained bimanual manipulation primitives. Sharing the encoder and decoder across both datasets prevents the small per-task mobile dataset (often only 20 to 50 episodes) from overfitting, and exposes the policy to a wider distribution of grippers, lighting, and table configurations than a single mobile task would cover.
The paper reports that this co-training boost is largest on visually demanding tasks (the Call Elevator task improves from 0 percent to 95 percent success), is robust to the static-to-mobile sampling ratio anywhere between 30 percent and 70 percent, and provides a roughly 20 percent absolute improvement on most tasks at fixed dataset size. The authors highlight this as the first published demonstration that static manipulation data can transfer beneficially to a mobile manipulation policy without any explicit domain adaptation.[1]
The paper reports autonomous evaluations on a suite of seven tasks, each with 50 (or in two cases 20) teleoperated demonstrations as the training set. The following table summarizes the autonomous success rates with and without co-training. All numbers are taken from the paper.
| Task | Demonstrations | ACT with co-training | ACT without co-training |
|---|---|---|---|
| Wipe Wine (clean a wine spill on a table) | 50 | 95% | 50% |
| Call Elevator (press button and enter cabin) | 50 | 95% | 0% |
| Use Cabinet (open two-door cabinet, store pot) | 50 | 85% | not reported |
| High Five (greet a person) | 20 | 85% | not reported |
| Rinse Pan (rinse used pan under faucet) | 50 | 80% | not reported |
| Push Chairs (push in several chairs) | 50 | 80% | not reported |
| Cook Shrimp (saute shrimp in pan) | 20 | 40% | not reported |
The Wipe Wine and Call Elevator tasks are the two for which the paper presents the full co-training ablation, and the gap between with and without co-training is dramatic in both cases: the policy trained only on its 50 task demonstrations cannot reliably localize the elevator button and never succeeds, while the co-trained policy succeeds in nineteen of twenty trials. The Cook Shrimp task, the one that produced most of the viral video coverage, is also the most difficult: it involves an entire multi-stage cooking workflow (pick up shrimp from a plate, place into pan, stir, transfer to serving plate) and achieves only 40 percent success even with co-training, a fact that the paper is explicit about but that did not always survive the social-media compression.[1][3]
The paper also reports data-efficiency results: a co-trained policy with 35 mobile demonstrations achieves 70 percent on Wipe Wine, which exceeds the 50 percent achieved by a non-co-trained policy with 50 mobile demonstrations. The authors interpret this as evidence that the principal value of co-training is sample efficiency, not just a final-performance boost.[1]
In the first week of January 2024, several short clips from the Mobile ALOHA project page accumulated tens of millions of views across X/Twitter, TikTok, and Chinese-language video platforms. The most-shared clip showed the robot stir-frying shrimp in a wok, and a second showed it cracking an egg into a bowl. Both clips were filmed in the kitchen of a Stanford lab and presented without on-screen captions in many of the social reshares.[3]
The robotics community quickly noted that several of the most-shared cooking clips were teleoperated data-collection sessions, not autonomous policy rollouts. The Mobile ALOHA project page itself labels each video clearly with either "autonomous" or "teleoperated" tags, and the paper is unambiguous that cooking shrimp autonomously is only a 40 percent success-rate behavior, but the labels were stripped when the clips were reposted without context. IEEE Spectrum's Evan Ackerman published a piece titled That Awesome Robot Demo Could Have a Human in the Loop on 8 January 2024 that used Mobile ALOHA as a case study in the broader problem of teleoperation versus autonomy in viral robot videos, arguing that descriptive captions and autonomy banners should be burned into the video frame rather than placed only in metadata.[3]
The authors responded by adding an explicit note to the Mobile ALOHA project page distinguishing the teleoperated and autonomous videos, and by releasing additional autonomous footage of the seven evaluated tasks. The cooking demonstrations that were autonomous (40 percent success on shrimp, as reported in the paper) remained legitimately impressive but were not the same artifact as the viral clips. The episode is now frequently cited as a turning point in how the robotics community labels demo videos, and as part of why the LeRobot and Physical Intelligence ecosystems later adopted clearer in-video provenance labels.[3]
Mobile ALOHA sits in a fast-moving cluster of low-cost data-collection systems released in 2023 and 2024. The following table compares several common reference points, restricted to facts that are directly verifiable from the cited primary sources.
| System | Year | Hardware approach | Approx. cost (USD) | Mobile? | Bimanual? |
|---|---|---|---|---|---|
| ALOHA | 2023 | 2 ViperX + 2 WidowX, fixed table | $20k | no | yes |
| ALOHA 2 | 2024 | Improved 2 ViperX + 2 WidowX | $25-30k | no | yes |
| Mobile ALOHA | 2024 | 4 ViperX on AgileX Tracer + frame | $32k | yes | yes |
| UMI | 2024 | Handheld grippers + GoPro, no robot in data collection | under $500 per gripper | n/a | yes |
| DexCap | 2024 | Wearable glove with Manus VR + camera | not published | n/a | yes |
ALOHA, the predecessor system, established the leader-follower bimanual puppeteering interface and the ACT algorithm. It is fixed to a tabletop and has no mobility. ALOHA 2, released by a Google DeepMind team in May 2024 (arXiv:2405.02292), kept the tabletop form factor but introduced improved grippers, gravity compensation on the leader arms, and was open-sourced together with a MuJoCo Menagerie simulation model.[12] Mobile ALOHA reuses ALOHA's manipulation hardware almost unchanged but adds the mobile base and the third pair of leader arms; ALOHA 2 has not been mobilized in a comparable open-source way as of mid-2026, though several follow-on labs have integrated ALOHA 2 arms with their own mobile bases.
UMI (Universal Manipulation Interface), introduced by Cheng Chi and colleagues at Columbia, Stanford, and Toyota Research Institute in February 2024 (arXiv:2402.10329), is a deliberately different approach: it dispenses with the robot during data collection entirely. The operator holds a 3D-printed soft gripper rig with a GoPro camera on the back, and demonstrations are recorded as in-the-wild human video, then post-processed into deployable policies by aligning the gripper with a downstream manipulator. UMI excels at portability (it can be carried into any environment) and is much cheaper to collect data with, but it loses the proprioceptive richness of a robot demonstration and is fundamentally a single-handed-or-bimanual short-horizon manipulation system, not a mobile one.[13]
DexCap, introduced by Chen Wang and colleagues at Stanford in March 2024, takes yet another route by using a wearable Manus VR glove combined with cameras to capture high-degree-of-freedom finger trajectories for dexterous manipulation. DexCap is oriented toward hand-level dexterity (such as in-hand manipulation of small objects), where the ALOHA-class parallel-jaw grippers are inadequate. It does not address mobility.[14]
In terms of capability per dollar, Mobile ALOHA remains, as of mid-2026, the canonical low-cost reference for mobile bimanual research; UMI is the canonical reference for in-the-wild data collection without a robot; ALOHA 2 is the canonical reference for static bimanual manipulation; and DexCap occupies the dexterous-hand niche.
The Mobile ALOHA team open-sourced all major artifacts under permissive licenses simultaneously with the paper:
The hardware bill of materials is published as part of the paper's supplementary tutorial, including specific part numbers, vendor links, and assembly instructions, an approach that proved important for reproducibility because several follow-on academic labs assembled functioning Mobile ALOHA copies within months of the release.
The robotics company Trossen Robotics, which manufactures the ViperX 300 and WidowX arms that ALOHA and Mobile ALOHA use, introduced a commercial product line called The Aloha Project in late 2024 that productizes both the static and mobile variants. As of mid-2026 the product page lists three configurations:
The Aloha Mobile kit is sold built-to-order through Trossen's sales channel. Trossen also offers the Mobile AI package which bundles the Aloha Mobile hardware with pre-installed software, a curated Nvidia workstation, and access to the LeRobot-compatible ACT and Diffusion Policy training stacks.[17] The commercial release has lowered the engineering effort for academic labs to acquire a working Mobile ALOHA significantly, although as of mid-2026 it remains a research-grade product without industrial certifications.
Mobile ALOHA's influence on robotics research between January 2024 and mid-2026 has been considerable, both as a piece of accessible hardware and as a methodological exemplar.
Imitation learning and behavior cloning. Mobile ALOHA helped re-validate behavior cloning as a viable approach for complex mobile bimanual tasks at a time when many groups were focused on reinforcement learning or sim-to-real transfer. The co-training recipe (mixing a small task-specific dataset with a large generic dataset, with zero-padding for action-space mismatch) was adopted in subsequent vision-language-action models, including the data-mixing strategies described in Physical Intelligence's pi0 (October 2024) and pi0.5.[18]
Hardware popularization. By demonstrating a $32,000 mobile bimanual platform that could perform recognizable household tasks, Mobile ALOHA established a price point that pushed subsequent academic and industrial labs to keep their hardware in the low five figures rather than the low six. Many subsequent VLA papers, including OpenVLA-OFT, RDT-1B, and several entries in the LeRobot ecosystem, benchmark on Mobile ALOHA or ALOHA 2 setups precisely because the platform is reproducible.[18]
Influence on data collection practice. Mobile ALOHA's emphasis on whole-body teleoperation by a tethered operator influenced the data collection style adopted by Physical Intelligence's pi0 and subsequent generalist policies, which rely heavily on ALOHA and Trossen ALOHA Mobile rigs for real-world data.[18][19] The viral video controversy and the resulting community pressure toward clearer labeling of autonomous versus teleoperated demonstrations also reshaped norms for how robotics demos are presented on the web.
Citations and academic uptake. The paper accumulated several thousand citations on Google Scholar within eighteen months of release, placing it among the most-cited robot learning papers of 2024. It is now a standard reference in survey papers on bimanual manipulation, imitation learning, and low-cost robotics.
Limitations and follow-on work. The authors acknowledge several limitations: the 90 cm by 135 cm footprint is too wide for some doorways and corridors, the fixed arm-mounting height excludes floor-level tasks, the policy is single-task with no autonomous improvement loop, and the absence of force feedback or tactile sensors limits dexterous contact-rich behavior. Follow-on academic work has addressed several of these by integrating Mobile ALOHA hardware with telescoping torsos, adding tactile sensors, exploring multi-task and language-conditioned variants, and combining Mobile ALOHA data with foundation-model-based VLAs such as OpenVLA and pi0.[18]