Mobile ALOHA

Embodied AI Open Source AI Robotics

24 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v2 · 4,836 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Mobile ALOHA is an open-source, low-cost system for collecting bimanual mobile manipulation data and learning household tasks from it, developed at Stanford University by Zipeng Fu, Tony Z. Zhao, and Chelsea Finn and released on 4 January 2024.^[1] It mounts two robot arms on a wheeled base and adds a whole-body teleoperation interface, so a human can puppeteer both arms while walking the robot around a room. The complete platform costs roughly $32,000 USD in parts, and with about 50 human demonstrations per task plus co-training on a static ALOHA dataset it can autonomously saute shrimp, call an elevator, open a wall cabinet, and rinse a pan.^[1]

Mobile ALOHA was published alongside the arXiv preprint Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. It extends the original ALOHA (A Low-cost Open-source HArdware system for bimanual teleoperation) rig with an AgileX Tracer differential-drive base, two additional leader (puppet) arms for driving the mobile platform, and a behavior-cloning recipe in which the learned policy uses the ACT (Action Chunking with Transformers) architecture co-trained on a mixture of newly collected mobile data and the static ALOHA dataset released the previous year.^[1]^[2] The paper states the cost directly: "We build Mobile ALOHA with a $32k budget, comparable to a single industrial cobot such as the Franka Emika Panda."^[1] That budget covers all four robot arms, the mobile base, batteries, cameras, and an Nvidia RTX 3070 Ti laptop for onboard compute.^[1]

The project was widely amplified on social media in January 2024 because of striking video clips that showed the robot performing household chores such as sauteing shrimp, calling an elevator, watering plants, opening a wall cabinet, and rinsing a pan. Some of these clips were autonomous policy rollouts from the paper, while others (in particular several cooking demonstrations widely shared on X/Twitter) were human teleoperated data-collection sessions, a distinction that the authors and IEEE Spectrum subsequently emphasized.^[3] The viral wave nonetheless helped popularize low-cost mobile bimanual imitation learning as a research direction and seeded a commercial product line at Trossen Robotics, including the Aloha Mobile kit.^[4]^[5]

This article describes the origin and authorship of Mobile ALOHA, the hardware bill of materials, the teleoperation interface, the ACT algorithm and the co-training recipe, the autonomous tasks reported in the paper along with their measured success rates, comparisons with related systems including ALOHA 2, the Universal Manipulation Interface (UMI), and DexCap, the open-source software and hardware releases, the Trossen Robotics commercialization, and the platform's lasting influence on later vision-language-action work such as Physical Intelligence's pi0 model and the LeRobot ecosystem.

What is Mobile ALOHA?

Mobile ALOHA is two things at once: a piece of low-cost hardware for collecting whole-body bimanual manipulation data, and a learning recipe that turns a few dozen of those demonstrations into an autonomous policy. The paper's abstract frames the gap it addresses plainly: "most results focus on table-top manipulation, lacking the mobility and dexterity necessary for generally useful tasks."^[1] Mobile ALOHA answers that by augmenting the static ALOHA bimanual teleoperation rig "with a mobile base, and a whole-body teleoperation interface," then performing supervised behavior cloning on the collected data.^[1]

The headline result, stated in the abstract, is that "with 50 demonstrations for each task, co-training can increase success rates by up to 90%, allowing Mobile ALOHA to autonomously complete complex mobile manipulation tasks such as sauteing and serving a piece of shrimp, opening a two-door wall cabinet to store heavy cooking pots, calling and entering an elevator, and lightly rinsing a used pan using a kitchen faucet."^[1] The combination of an accessible price point (around $32,000 versus the roughly $200,000 of prior mobile bimanual research robots) and a simple, reproducible learning method is what made the system influential in robot learning.^[1]

Origin and authorship

Mobile ALOHA was developed in the Stanford Interactive Perception and Robot Learning Lab (IRIS), led by assistant professor Chelsea Finn, in collaboration with the broader Stanford AI Lab and the Stanford Robotics Center. The project's two student co-leads, Zipeng Fu and Tony Z. Zhao, had previously contributed to neighboring lines of work: Zhao authored the original ALOHA system and the ACT algorithm in 2023 (Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, arXiv:2304.13705), and Fu had worked extensively on legged locomotion and whole-body control.^[6] Their goal was to combine the dexterity advantages of the ALOHA teleoperation interface with the reach and household relevance of a mobile platform, without falling into the cost trap of prior mobile manipulator research, which had typically required PR2 or TIAGo-class robots costing on the order of $200,000.^[1]

The team released the arXiv preprint, an extensive project page, two GitHub repositories (one for hardware and teleoperation, one for the ACT++ training code), a Google Drive containing several hundred teleoperated episodes, and high-resolution videos of both teleoperation and autonomous rollouts on 4 January 2024. Within seventy-two hours the project had crossed several million video views on X/Twitter, YouTube Shorts, and TikTok, prompting press coverage from MIT Technology Review, VentureBeat, TechCrunch, and Stanford Report.^[7]^[8]^[9]

What hardware does Mobile ALOHA use?

Mobile ALOHA inherits the four-arm leader-follower architecture of ALOHA but mounts the two follower arms on a wheeled chassis instead of a fixed tabletop, and adds two additional leader arms mounted on a teleoperation frame that the human operator drags around the workspace while tethered to the robot at the waist.

Robot arms

The system uses four Trossen Robotics ViperX 300 robot arms in total: two as the followers mounted on the mobile base and two as leaders on the teleoperation frame. Each ViperX 300 has six degrees of freedom with a gripper (treated as a seventh actuated joint), uses Dynamixel XM/XH series servos, has a reach of roughly 750 mm and a payload of about 750 g at full extension.^[10] The arms are mounted with their bases at approximately shoulder height for an average adult operator, which the paper notes is a deliberate ergonomic choice that balances reachability across countertops, sinks, and elevator buttons at the cost of access to floor-level appliances such as ovens and dishwashers.

Mobile base

The mobile base is the AgileX Tracer AGV, a low-profile differential-drive platform marketed for warehouse logistics. The Tracer has a footprint of roughly 569 mm by 445 mm, an overall robot footprint when the arms and frame are added of about 90 cm by 135 cm, a top speed of 1.6 m/s (comparable to a brisk walking pace), a continuous payload of 100 kg, and CANBUS-based velocity control.^[1]^[11] The paper notes that the Tracer's low height of approximately 17 mm permits placing a heavy lithium battery near the floor, which acts as ballast and significantly improves tip-over stability when the arms apply forces overhead.

Onboard power and compute

A custom-built 1.26 kWh lithium-ion battery weighing approximately 14 kg supplies power to all four arms, the mobile base, cameras, and laptop. The paper reports roughly two to three hours of continuous teleoperation on a single charge. Compute is provided by a consumer-grade laptop with an Intel i7-12800H CPU and an Nvidia RTX 3070 Ti laptop GPU (8 GB VRAM), which is sufficient to run the ACT policy at roughly 50 Hz inference. The laptop is rigid-mounted on the frame and tethered to the operator's harness for cable management.^[1]

Cameras and sensors

Mobile ALOHA carries three Logitech C922x USB RGB webcams: one wrist-mounted on each follower arm and one third-person "top" camera looking down at the workspace from a fixed rigid mast. The cameras record at 480 by 640 resolution at 50 Hz. A fourth camera is sometimes mounted on the chassis for visualization during data collection but is not used by the policy network. The robot itself has no LIDAR, depth sensors, force/torque sensors, or tactile skin; localization is purely visual through the three RGB streams, and proprioception comes from the Dynamixel servo encoders and the Tracer's wheel odometry.^[1]

Bill of materials

The following table summarizes the materials cost reported in the original paper. Prices reflect early 2024 list prices in USD and have shifted somewhat since.

Component	Description	Approximate cost (USD)
4 x ViperX 300 robot arm	6-DOF arms with grippers (2 leader, 2 follower)	$19,200
AgileX Tracer mobile base	Differential drive AGV	$7,000
1.26 kWh lithium battery	Onboard power and ballast	$2,000
Nvidia RTX 3070 Ti laptop	Onboard compute	$1,500
3 x Logitech C922x cameras	Two wrist plus one top RGB webcam	$300
Teleoperation frame	3D-printed and aluminum extrusion	$1,500
Miscellaneous cables, mounts	USB hubs, CANBUS adapters, harness	$500
Total		approximately $32,000

The paper presents this as roughly 16 percent of the cost of a PR2-class research platform with comparable mobile bimanual capability, and the authors frame this cost ratio as the principal hardware contribution of the project.^[1]

How does the whole-body teleoperation interface work?

The teleoperation interface is a direct extension of the leader-follower "puppeteering" scheme introduced in the original ALOHA. The operator stands behind the robot, harnessed to the mobile base at the waist with a rigid tether that is long enough to allow comfortable stepping but short enough that the operator's translation is mechanically coupled to the base. As the paper describes it, "the user is then physically tethered to the system and backdrives the wheels to enable base movement. This allows for independent movement of the base while the user has both hands controlling ALOHA."^[1] As the operator walks the rig around the workspace, the wheel encoders of the Tracer record linear and angular velocity, which are stored as part of the demonstration. Simultaneously, the operator grips the two leader arms, and the joint angles of these arms are streamed at 50 Hz to the two follower arms, which mirror the motion in real time. Triggers on the leader arm grippers control the open/close state of the follower grippers.^[1]

This design has two notable properties relative to alternatives such as virtual reality controllers, motion capture suits, or haptic exoskeletons. First, it requires no calibration of the operator's body kinematics, no battery-powered controllers, and no external tracking system, which makes data collection portable to arbitrary kitchens, hallways, and offices. Second, because the leader arms are mechanically identical to the follower arms (in DOF count, joint limits, and rough scale), the operator's commanded joint trajectory is automatically feasible on the follower, eliminating the kinematic re-targeting problems that affect VR teleoperation of robots with different morphology.

The paper reports a user study with eight participants in which task completion time decreased by 39 to 52 percent within the first five trials of practice, suggesting the interface has a relatively short learning curve. Sustained teleoperation sessions in the paper's data collection regime ran approximately 30 to 60 minutes before operator fatigue became significant.^[1]

How does Mobile ALOHA learn tasks?

The policy that runs autonomously on the robot is trained with Action Chunking with Transformers (ACT), the same algorithm used by the original ALOHA, lightly extended to predict the base velocity along with the joint positions. The training recipe also incorporates a co-training step with the static ALOHA dataset, which is the paper's principal algorithmic finding. The whole pipeline is supervised imitation learning (behavior cloning): there is no reward function, no reinforcement learning, and no simulation in the loop for the real-world tasks.

ACT (Action Chunking with Transformers)

ACT was introduced by Tony Z. Zhao and colleagues in April 2023 in the paper Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (arXiv:2304.13705).^[6] At a high level, ACT is a conditional variational autoencoder over short chunks of future actions rather than over individual actions. The encoder is a transformer that takes as input the current proprioception and the recent visual observations from the wrist and top cameras, and emits a latent style code z. The decoder is a second transformer that takes z and the current observation and predicts a sequence of the next k actions (the chunk). At deployment time, the chunks are blended with a temporal ensemble to produce smooth open-loop trajectories, and the policy is replanned at the chunk-overlap horizon.^[6]

Action chunking dramatically reduces compounding error in behavior cloning by amortizing each policy decision over many timesteps, and it allows the policy to commit to multi-step plans (such as the full reach-grasp-lift sequence of a pan handle) without being derailed by short-term noise in the observation stream.

For Mobile ALOHA, the action space is extended from the 14-dimensional joint vector of the original ALOHA (7 DOF per arm, including gripper) to a 16-dimensional vector that adds linear and angular base velocity. The default chunk size is 45 timesteps at 50 Hz (roughly 0.9 seconds of look-ahead), the encoder has 4 transformer layers, the decoder has 7 transformer layers, the hidden dimension is 512, and the learning rate is 2e-5 with the Adam optimizer.^[1]

The paper also benchmarks Diffusion Policy (Chi et al., 2023) and VINN (visual nearest neighbor) as alternatives. Diffusion Policy uses a chunk size of 64 with a DDIM scheduler (50 training steps, 10 inference steps) and a learning rate of 1e-4. VINN uses a chunk size of 100 and a weighted nearest-neighbor lookup over state and visual features. ACT achieves the strongest results on the mobile tasks, although Diffusion Policy is competitive on several of them.^[1]

Co-training with static ALOHA data

The paper's signature methodological contribution is co-training with a much larger pre-existing static manipulation dataset, specifically the 825-episode, 12-task corpus that was released with the original ALOHA. For each gradient step, with equal probability the training procedure samples either a Mobile ALOHA episode for the target task or a static ALOHA episode from the broader corpus. Because the static episodes do not include base motion, the action vector is zero-padded along the two base velocity dimensions for those samples, which makes the dataset shapes compatible with the same 16-dimensional ACT output head.^[1]

The intuition is that the static dataset, despite covering different tasks, is rich in scene diversity and in fine-grained bimanual manipulation primitives. Sharing the encoder and decoder across both datasets prevents the small per-task mobile dataset (often only 20 to 50 episodes) from overfitting, and exposes the policy to a wider distribution of grippers, lighting, and table configurations than a single mobile task would cover.

The paper quantifies the effect precisely: "With co-training, we are able to achieve over 80% success on these tasks with only 50 human demonstrations per task, with an average of 34% absolute improvement compared to no co-training."^[1] The peak gain is larger still: the abstract reports that co-training can increase success rates "by up to 90%," and the Call Elevator task improves from 0 percent to 95 percent. The boost is robust to the static-to-mobile sampling ratio anywhere between 30 percent and 70 percent. The authors highlight this as the first published demonstration that static manipulation data can transfer beneficially to a mobile manipulation policy without any explicit domain adaptation.^[1]

What tasks can Mobile ALOHA do, and how well?

The paper reports autonomous evaluations on a suite of seven tasks, each with 50 (or in two cases 20) teleoperated demonstrations as the training set. The following table summarizes the autonomous success rates with and without co-training, taken from Table 1 of the paper.

Task	Demonstrations	ACT with co-training	ACT without co-training
Wipe Wine (clean a wine spill on a table)	50	100%	50%
Call Elevator (press button and enter cabin)	50	95%	0%
Use Cabinet (open two-door cabinet, store pot)	50	85%	not reported
High Five (greet a person)	20	85%	not reported
Rinse Pan (rinse used pan under faucet)	50	80%	not reported
Push Chairs (push in several chairs)	50	80%	not reported
Cook Shrimp (saute shrimp in pan)	20	40%	not reported

The Wipe Wine and Call Elevator tasks are the two for which the paper presents the full co-training ablation, and the gap between with and without co-training is dramatic in both cases: the policy trained only on its 50 task demonstrations cannot reliably localize the elevator button and never succeeds, while the co-trained policy succeeds in nineteen of twenty trials. The Cook Shrimp task, the one that produced most of the viral video coverage, is also the most difficult: it involves an entire multi-stage cooking workflow (pick up shrimp from a plate, place into pan, stir, transfer to serving plate) and achieves only 40 percent success even with co-training, a fact that the paper is explicit about but that did not always survive the social-media compression.^[1]^[3]

The paper also reports data-efficiency results: a co-trained policy with 35 mobile demonstrations achieves 70 percent on Wipe Wine, which exceeds the 50 percent achieved by a non-co-trained policy with 50 mobile demonstrations. The authors interpret this as evidence that the principal value of co-training is sample efficiency, not just a final-performance boost.^[1]

Were the viral cooking videos real?

In the first week of January 2024, several short clips from the Mobile ALOHA project page accumulated tens of millions of views across X/Twitter, TikTok, and Chinese-language video platforms. The most-shared clip showed the robot stir-frying shrimp in a wok, and a second showed it cracking an egg into a bowl. Both clips were filmed in the kitchen of a Stanford lab and presented without on-screen captions in many of the social reshares.^[3]

The robotics community quickly noted that several of the most-shared cooking clips were teleoperated data-collection sessions, not autonomous policy rollouts. The Mobile ALOHA project page itself separates an "Autonomous Skills" section from a "Teleoperation" section, and the paper is unambiguous that cooking shrimp autonomously is only a 40 percent success-rate behavior, but that context was stripped when the clips were reposted. IEEE Spectrum's Evan Ackerman published a piece titled That Awesome Robot Demo Could Have a Human in the Loop on 8 January 2024 that used Mobile ALOHA as a case study in the broader problem of teleoperation versus autonomy in viral robot videos, arguing that descriptive captions and autonomy banners should be burned into the video frame rather than placed only in metadata.^[3]

The authors responded by adding an explicit note to the Mobile ALOHA project page distinguishing the teleoperated and autonomous videos, and by releasing additional autonomous footage of the seven evaluated tasks. The cooking demonstrations that were autonomous (40 percent success on shrimp, as reported in the paper) remained legitimately impressive but were not the same artifact as the viral clips. The episode is now frequently cited as a turning point in how the robotics community labels demo videos, and as part of why the LeRobot and Physical Intelligence ecosystems later adopted clearer in-video provenance labels.^[3]

How does Mobile ALOHA compare to ALOHA, UMI, and DexCap?

Mobile ALOHA sits in a fast-moving cluster of low-cost data-collection systems released in 2023 and 2024. The following table compares several common reference points, restricted to facts that are directly verifiable from the cited primary sources.

System	Year	Hardware approach	Approx. cost (USD)	Mobile?	Bimanual?
ALOHA	2023	2 ViperX + 2 WidowX, fixed table	$20k	no	yes
ALOHA 2	2024	Improved 2 ViperX + 2 WidowX	$25-30k	no	yes
Mobile ALOHA	2024	4 ViperX on AgileX Tracer + frame	$32k	yes	yes
UMI	2024	Handheld grippers + GoPro, no robot in data collection	under $500 per gripper	n/a	yes
DexCap	2024	Wearable glove with Manus VR + camera	not published	n/a	yes

ALOHA, the predecessor system, established the leader-follower bimanual puppeteering interface and the ACT algorithm. It is fixed to a tabletop and has no mobility. ALOHA 2, released by a Google DeepMind team in May 2024 (arXiv:2405.02292), kept the tabletop form factor but introduced improved grippers, gravity compensation on the leader arms, and was open-sourced together with a MuJoCo Menagerie simulation model.^[12] Mobile ALOHA reuses ALOHA's manipulation hardware almost unchanged but adds the mobile base and the third pair of leader arms; ALOHA 2 has not been mobilized in a comparable open-source way as of mid-2026, though several follow-on labs have integrated ALOHA 2 arms with their own mobile bases.

UMI (Universal Manipulation Interface), introduced by Cheng Chi and colleagues at Columbia, Stanford, and Toyota Research Institute in February 2024 (arXiv:2402.10329), is a deliberately different approach: it dispenses with the robot during data collection entirely. The operator holds a 3D-printed soft gripper rig with a GoPro camera on the back, and demonstrations are recorded as in-the-wild human video, then post-processed into deployable policies by aligning the gripper with a downstream manipulator. UMI excels at portability (it can be carried into any environment) and is much cheaper to collect data with, but it loses the proprioceptive richness of a robot demonstration and is fundamentally a single-handed-or-bimanual short-horizon manipulation system, not a mobile one.^[13]

DexCap, introduced by Chen Wang and colleagues at Stanford in March 2024, takes yet another route by using a wearable Manus VR glove combined with cameras to capture high-degree-of-freedom finger trajectories for dexterous manipulation. DexCap is oriented toward hand-level dexterity (such as in-hand manipulation of small objects), where the ALOHA-class parallel-jaw grippers are inadequate. It does not address mobility.^[14]

In terms of capability per dollar, Mobile ALOHA remains, as of mid-2026, the canonical low-cost reference for mobile bimanual research; UMI is the canonical reference for in-the-wild data collection without a robot; ALOHA 2 is the canonical reference for static bimanual manipulation; and DexCap occupies the dexterous-hand niche.

Is Mobile ALOHA open source?

Yes. The Mobile ALOHA team open-sourced all major artifacts under permissive licenses simultaneously with the paper:

Hardware repository (https://github.com/MarkFzp/mobile-aloha) holds the teleoperation and data-collection code, the ROS 1 noetic launch files for the four arms and three cameras, udev rules for persistent device naming, CANBUS configuration for the Tracer, and the 3D-printed STL files for the teleoperation frame. Tested on Ubuntu 18.04 and 20.04. The repository had crossed 4,400 stars on GitHub by mid-2025.^[2]
ACT++ training repository (https://github.com/MarkFzp/act-plus-plus) holds the imitation learning code, including ACT, Diffusion Policy, and VINN implementations, the co-training data loader that supports zero-padding the static ALOHA actions to 16 dimensions, and two MuJoCo simulation environments (Transfer Cube and Bimanual Insertion) for sanity checks.^[15]
Dataset is hosted on Google Drive and consists of the per-task teleoperated demonstrations used in the paper, totalling several hundred episodes recorded as HDF5 files containing synchronized camera frames, joint states, and base velocities.

The hardware bill of materials is published as part of the paper's supplementary tutorial, including specific part numbers, vendor links, and assembly instructions, an approach that proved important for reproducibility because several follow-on academic labs assembled functioning Mobile ALOHA copies within months of the release.

How much does Mobile ALOHA cost, and can you buy one?

The research build costs approximately $32,000 USD in parts, as the paper states verbatim: "We build Mobile ALOHA with a $32k budget, comparable to a single industrial cobot such as the Franka Emika Panda."^[1] For teams that prefer not to assemble the rig themselves, the robotics company Trossen Robotics, which manufactures the ViperX 300 and WidowX arms that ALOHA and Mobile ALOHA use, introduced a commercial product line called The Aloha Project in late 2024 that productizes both the static and mobile variants. As of mid-2026 the product page lists three configurations:

Aloha Solo, a portable single-arm leader-follower kit with Intel RealSense cameras and a tripod mount, launched November 2024 at a base price of $8,999.95.^[16]
Aloha Stationary, a tabletop bimanual kit corresponding to the original ALOHA / ALOHA 2 form factor.
Aloha Mobile, the productized Mobile ALOHA kit, built around the AgileX Tracer base and the four-arm leader-follower frame.^[4]

The Aloha Mobile kit is sold built-to-order through Trossen's sales channel. Trossen also offers the Mobile AI package which bundles the Aloha Mobile hardware with pre-installed software, a curated Nvidia workstation, and access to the LeRobot-compatible ACT and Diffusion Policy training stacks.^[17] The commercial release has lowered the engineering effort for academic labs to acquire a working Mobile ALOHA significantly, although as of mid-2026 it remains a research-grade product without industrial certifications.

What impact has Mobile ALOHA had?

Mobile ALOHA's influence on robotics research between January 2024 and mid-2026 has been considerable, both as a piece of accessible hardware and as a methodological exemplar.

Imitation learning and behavior cloning. Mobile ALOHA helped re-validate behavior cloning as a viable approach for complex mobile bimanual tasks at a time when many groups were focused on reinforcement learning or sim-to-real transfer. The co-training recipe (mixing a small task-specific dataset with a large generic dataset, with zero-padding for action-space mismatch) was adopted in subsequent vision-language-action models, including the data-mixing strategies described in Physical Intelligence's pi0 (October 2024) and pi0.5.^[18]

Hardware popularization. By demonstrating a $32,000 mobile bimanual platform that could perform recognizable household tasks, Mobile ALOHA established a price point that pushed subsequent academic and industrial labs to keep their hardware in the low five figures rather than the low six. Many subsequent VLA papers, including OpenVLA-OFT, RDT-1B, and several entries in the LeRobot ecosystem, benchmark on Mobile ALOHA or ALOHA 2 setups precisely because the platform is reproducible.^[18]

Influence on data collection practice. Mobile ALOHA's emphasis on whole-body teleoperation by a tethered operator influenced the data collection style adopted by Physical Intelligence's pi0 and subsequent generalist policies, which rely heavily on ALOHA and Trossen ALOHA Mobile rigs for real-world data.^[18]^[19] The viral video controversy and the resulting community pressure toward clearer labeling of autonomous versus teleoperated demonstrations also reshaped norms for how robotics demos are presented on the web.

Citations and academic uptake. The paper has accumulated several hundred citations (438 indexed on Semantic Scholar as of mid-2026, of which 22 are flagged as highly influential), placing it among the most-cited robot learning papers of 2024.^[20] It is now a standard reference in survey papers on bimanual manipulation, imitation learning, and low-cost robotics.

Limitations and follow-on work. The authors acknowledge several limitations: the 90 cm by 135 cm footprint is too wide for some doorways and corridors, the fixed arm-mounting height excludes floor-level tasks, the policy is single-task with no autonomous improvement loop, and the absence of force feedback or tactile sensors limits dexterous contact-rich behavior. Follow-on academic work has addressed several of these by integrating Mobile ALOHA hardware with telescoping torsos, adding tactile sensors, exploring multi-task and language-conditioned variants, and combining Mobile ALOHA data with foundation-model-based VLAs such as OpenVLA and pi0.^[18]

References

Fu, Zipeng; Zhao, Tony Z.; Finn, Chelsea. *Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation*. arXiv:2401.02117, 4 January 2024. https://arxiv.org/abs/2401.02117. Accessed 2026-06-27. ↩
Mobile ALOHA hardware and teleoperation code repository. https://github.com/MarkFzp/mobile-aloha. Accessed 2026-06-27. ↩
Ackerman, Evan. *That Awesome Robot Demo Could Have a Human in the Loop*. IEEE Spectrum, 8 January 2024. https://spectrum.ieee.org/robot-teleoperation-autonomy. Accessed 2026-06-27. ↩
Trossen Robotics, *Aloha Mobile* product page. https://www.trossenrobotics.com/aloha-mobile. Accessed 2026-06-27. ↩
Trossen Robotics, *The Aloha Project* overview. https://www.trossenrobotics.com/the-aloha-project. Accessed 2026-06-27. ↩
Zhao, Tony Z.; Kumar, Vikash; Levine, Sergey; Finn, Chelsea. *Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware*. arXiv:2304.13705, 23 April 2023. https://arxiv.org/abs/2304.13705. Accessed 2026-06-27. ↩
Mobile ALOHA project page. https://mobile-aloha.github.io/. Accessed 2026-06-27. ↩
Stanford Report, *Meet the robot that can saute shrimp*, April 2024. https://news.stanford.edu/stories/2024/04/meet-robot-that-can-saute-shrimp. Accessed 2026-06-27. ↩
MIT Technology Review, *Watch this robot cook shrimp and clean autonomously*, 15 January 2024. https://www.technologyreview.com/2024/01/15/1086592/watch-this-robot-cook-shrimp-and-clean-autonomously/. Accessed 2026-06-27. ↩
Trossen Robotics, *ViperX 300 Robot Arm* product specifications. https://www.trossenrobotics.com/viperx-300-robot-arm.aspx. Accessed 2026-06-27. ↩
AgileX Robotics, *Tracer mini differential AGV* product page. https://global.agilex.ai/products/tracer-mini. Accessed 2026-06-27. ↩
ALOHA 2 Team. *ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation*. arXiv:2405.02292, May 2024. https://arxiv.org/abs/2405.02292. Accessed 2026-06-27. ↩
Chi, Cheng; et al. *Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots*. arXiv:2402.10329, February 2024. https://arxiv.org/abs/2402.10329. Accessed 2026-06-27. ↩
Wang, Chen; et al. *DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation*. arXiv:2403.07788, March 2024. https://arxiv.org/abs/2403.07788. Accessed 2026-06-27. ↩
ACT++ training repository for Mobile ALOHA. https://github.com/MarkFzp/act-plus-plus. Accessed 2026-06-27. ↩
Yahoo Finance / AccessNewswire, *Trossen Robotics Launches Aloha Solo: Affordable and Portable Machine Learning Lab for Advanced Robotics Research*, November 2024. https://finance.yahoo.com/news/trossen-robotics-launches-aloha-solo-100000059.html. Accessed 2026-06-27. ↩
Trossen Robotics, *Mobile AI* product bundle. https://www.trossenrobotics.com/mobile-ai. Accessed 2026-06-27. ↩
Physical Intelligence, *Our First Generalist Policy (pi0)*. https://www.physicalintelligence.company/blog/pi0. Accessed 2026-06-27. ↩
The Robot Report, *Physical Intelligence open-sources Pi0 robotics foundation model*. https://www.therobotreport.com/physical-intelligence-open-sources-pi0-robotics-foundation-model/. Accessed 2026-06-27. ↩
Semantic Scholar, *Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation* (citation record). https://www.semanticscholar.org/paper/fc3819a50705fc3cf90ab92f2a206b858fef3b19. Accessed 2026-06-27. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Action Chunking with Transformers (ACT)Chelsea Finn Large Behavior Model RFM-1 (Robotics Foundation Model)Toyota Research Institute

What is Mobile ALOHA?

Origin and authorship

What hardware does Mobile ALOHA use?

Robot arms

Mobile base

Onboard power and compute

Cameras and sensors

Bill of materials

How does the whole-body teleoperation interface work?

How does Mobile ALOHA learn tasks?

ACT (Action Chunking with Transformers)

Co-training with static ALOHA data

What tasks can Mobile ALOHA do, and how well?

Were the viral cooking videos real?

How does Mobile ALOHA compare to ALOHA, UMI, and DexCap?

Is Mobile ALOHA open source?

How much does Mobile ALOHA cost, and can you buy one?

What impact has Mobile ALOHA had?

See also

References

Improve this article

Related Articles

SmolVLA

AI Habitat

Physical Intelligence

Physical AI

NVIDIA Cosmos

Cognitive robotics

What links here

Related Articles

SmolVLA

AI Habitat

Physical Intelligence

Physical AI

NVIDIA Cosmos

Cognitive robotics

What links here