ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) is an open-source robotic platform designed for collecting bimanual manipulation demonstrations and training imitation learning policies. Developed at Stanford University by Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn, the system was introduced in a 2023 paper published at the Robotics: Science and Systems (RSS) conference. ALOHA pairs two leader robot arms with two follower robot arms, allowing a human operator to teleoperate both follower arms simultaneously by physically moving the leaders. The complete system costs under $20,000, making it roughly 5 to 10 times cheaper than comparable research-grade bimanual setups. Alongside the hardware, the team introduced Action Chunking with Transformers (ACT), an imitation learning algorithm that predicts sequences of future actions rather than single timesteps, which reduces compounding errors in learned policies.
Since its release, ALOHA has become one of the most widely adopted platforms in robot learning research. It has spawned several follow-up systems: Mobile ALOHA (2024), which adds a mobile base for whole-body manipulation; ALOHA 2 (2024), an enhanced hardware revision by Google DeepMind; and ALOHA Unleashed (2024), which combines ALOHA 2 with large-scale diffusion policy training to achieve complex dexterous tasks like tying shoelaces. The hardware designs, software, and simulation models are all publicly available, and commercial kits are sold by Trossen Robotics.
Bimanual manipulation, where a robot uses two arms to coordinate on a single task, is needed for many real-world activities such as folding laundry, cooking, and assembling parts. However, research in bimanual robot learning has historically been limited by the high cost of hardware. Industrial-grade dual-arm setups with force-torque sensors and high-precision actuators can easily exceed $100,000 to $200,000, placing them out of reach for most academic labs. Even when hardware is available, collecting demonstration data for imitation learning often requires specialized teleoperation interfaces (such as VR controllers or exoskeletons) that add cost and complexity.
The ALOHA project was motivated by the idea that affordable, open-source hardware could democratize bimanual manipulation research. By using off-the-shelf hobby-grade robot arms and a simple "puppeteering" teleoperation approach (where the operator physically backdrives smaller leader arms that are kinematically matched to the larger follower arms), the developers aimed to eliminate the need for expensive sensors, calibration procedures, or specialized teleoperation equipment.
The original ALOHA system uses four robot arms from Trossen Robotics: two ViperX 300 arms as the followers (the arms that actually perform tasks) and two WidowX 250 arms as the leaders (the arms the human operator moves). The WidowX leader arms share the same kinematic structure as the ViperX followers but in a smaller, lighter form factor, which makes them easy to backdrive by hand.
| Component | Specification |
|---|---|
| Follower arms | 2x Trossen Robotics ViperX 300, 6 DoF each |
| Leader arms | 2x Trossen Robotics WidowX 250, 6 DoF each |
| Grippers | Parallel-jaw grippers on all four arms |
| Control frequency | 50 Hz joint position synchronization |
| Cameras | 4 RGB cameras (480 x 640 resolution): 2 stationary, 2 wrist-mounted |
| Total cost | Under $20,000 |
| Communication | USB-based, synchronized via ROS |
During teleoperation, the joint positions of the leader arms are read at 50 Hz and sent as position commands to the corresponding follower arms. The operator simply grasps the leader arms and moves them; the followers mirror the motion in real time. This approach requires no force sensors, no motion capture, and no calibration beyond the initial mechanical setup. The grippers on the leader arms are mechanically linked to the follower grippers, so opening and closing the leader gripper directly controls the follower gripper.
The four cameras (two stationary, providing overhead and front views, plus two wrist-mounted cameras on the follower arms) capture RGB images during teleoperation. These images, along with the joint positions, form the observation data used to train learned policies.
ALOHA 2 is a redesigned version of the original system, developed by a team at Google DeepMind and Stanford University. The paper, authored by the ALOHA 2 Team including Jorge Aldaco, Travis Armstrong, Robert Baruch, and others, was released in May 2024. ALOHA 2 retains the same core concept (two ViperX followers, two WidowX leaders) but introduces several hardware improvements targeting ergonomics, durability, and data quality.
| Improvement area | Original ALOHA | ALOHA 2 |
|---|---|---|
| Gripper mechanism | Scissor-type design | Low-friction linear rail design |
| Leader gripper force to operate | 14.68 N | 0.84 N |
| Follower gripper output force | 12.8 N | 27.9 N |
| Gravity compensation | Rubber bands | Passive adjustable hanging retractors |
| Cameras | Logitech webcams | 4x Intel RealSense D405 (848 x 480, global shutter, depth) |
| Gripper material | Standard plastic | Carbon fiber nylon with polyurethane gripping tape |
| Frame | Full aluminum extrusion cage | Simplified 20x20 mm aluminum extrusion frame |
| Workspace table | Variable | Standardized 48 x 30 inch table |
| Software stack | ROS 1 | ROS 2 |
The gripper redesign is one of the most impactful changes. The original scissor-type grippers required significant force to operate, causing operator fatigue during long data collection sessions. The new linear rail mechanism reduces the operating force from 14.68 N to just 0.84 N for the leader grippers, while simultaneously doubling the follower gripper output force from 12.8 N to 27.9 N. The gripper fingers are now 3D-printed in carbon fiber nylon and coated with polyurethane gripping tape on both inner and outer surfaces, improving grip reliability and wear resistance.
The passive gravity compensation system replaces the original rubber band approach with commercially available hanging retractors. These can be adjusted by the operator to balance the weight of the leader arms, reducing fatigue. User studies showed that operators using the passive gravity compensation system could insert 1.38 shapes per minute versus 0.97 shapes per minute without it.
ALOHA 2 also upgraded the camera system from consumer webcams to four Intel RealSense D405 cameras, providing RGB and depth data with global shutter capability at 848 x 480 resolution. The four viewpoints are overhead, worm's-eye (looking up), left wrist, and right wrist. Custom 3D-printed mounts keep the cameras compact and reduce the overall footprint of the follower arms.
The frame was simplified by removing the vertical side panels from the original design, creating more open workspace for human-robot collaboration and larger objects. The 20x20 mm aluminum extrusion frame still provides rigid mounting points for cameras and the gravity compensation system.
The team also released a MuJoCo Menagerie simulation model of ALOHA 2 with system identification. They collected 11 real-world trajectories using the leader arms and minimized the residuals between real and simulated trajectories, tuning proportional gain, damping, armature, joint friction, and torque limits. This allows researchers to develop and test policies in simulation before deploying on real hardware.
All hardware designs, CAD files, assembly tutorials, and the simulation model were open-sourced through the project website.
The original ALOHA paper introduced ACT (Action Chunking with Transformers), an imitation learning algorithm designed to address two fundamental problems in learning manipulation policies from demonstrations: compounding errors and multimodal action distributions.
In standard behavioral cloning, a policy is trained to predict a single action at each timestep given the current observation. Small errors in individual action predictions can accumulate over time, causing the robot to drift into states that were never seen during training. ACT mitigates this by predicting a "chunk" of k future actions at once (for example, the next 50 or 100 joint position targets). Because each chunk covers multiple timesteps, the effective decision horizon of the policy is reduced by a factor of k, giving errors fewer opportunities to compound.
Different human operators may perform the same task in different ways. For instance, when picking up an object, one operator might approach from the left while another approaches from the right. A standard regression-based policy would average these different strategies, producing actions that do not match any real strategy. ACT handles this by using a Conditional Variational Autoencoder (CVAE) framework. During training, a CVAE encoder compresses the ground-truth action sequence and current joint positions into a low-dimensional latent variable z, representing the "style" of the demonstration. The policy (CVAE decoder) takes the current observations along with a sampled z and predicts an action chunk. At test time, the encoder is discarded, and z is sampled from the learned prior distribution, allowing the policy to commit to one coherent strategy per rollout.
The ACT policy architecture consists of three main components:
When executing an action chunk, the robot does not wait for the entire chunk to finish before querying the policy again. Instead, the policy is queried at every timestep, producing overlapping chunks that predict values for the same future timesteps. These overlapping predictions are combined using an exponential weighting scheme called temporal ensembling, where more recent predictions receive higher weight. This smooths the executed trajectory and reduces jerkiness at chunk boundaries.
Using only about 50 demonstrations per task (roughly 10 minutes of teleoperation data), ACT achieved 80 to 90 percent success rates across six challenging bimanual manipulation tasks.
| Task | Success rate | Number of demos |
|---|---|---|
| Open translucent condiment cup | 96% | ~50 |
| Slot a battery | 84% | ~50 |
| Thread a zip tie | High | ~50 |
| Juggle a ping pong ball | High | ~50 |
| Assemble NIST board chain | High | ~50 |
| Prepare tape | High | ~50 |
These results were notable because the tasks involve fine-grained precision (millimeter-level accuracy for battery insertion), dynamic motions (juggling), and complex contact patterns (threading), all achieved with low-cost hardware and minimal demonstration data.
Mobile ALOHA, introduced in January 2024 by Zipeng Fu, Tony Z. Zhao, and Chelsea Finn at Stanford University, extends the original stationary ALOHA system with a mobile base, enabling the robot to navigate environments and perform whole-body mobile manipulation tasks. The paper was published at the Conference on Robot Learning (CoRL) 2024.
The motivation behind Mobile ALOHA was straightforward: many useful household and workplace tasks require the robot to move around, not just manipulate objects on a fixed table. Cooking, cleaning, organizing, and navigating between rooms all require coordinated locomotion and bimanual manipulation.
Mobile ALOHA mounts the ALOHA bimanual system onto an AgileX Tracer AGV (automated guided vehicle), a differential-drive mobile base originally designed for warehouse logistics.
| Component | Specification |
|---|---|
| Mobile base | AgileX Tracer AGV |
| Base cost | ~$7,000 (5x cheaper than comparable Clearpath AGVs) |
| Total system cost | ~$32,000 (including onboard power and compute) |
| Maximum speed | 1.6 m/s |
| Payload capacity | 100 kg |
| Battery | 1.26 kWh, 14 kg (doubles as counterweight) |
| Arms | 2x ViperX 300, 6 DoF each |
| Arm reach from base | 100 cm |
| Vertical reach | 65 cm to 200 cm |
| Lift capacity per arm | 1.5 kg |
| Pull force | 100 N at 1.5 m height |
| Cameras | 3x Logitech C922x (480 x 640, 50 Hz): 2 wrist-mounted, 1 forward-facing |
| Onboard GPU | NVIDIA RTX 3070 Ti (8 GB VRAM) |
| Onboard CPU | Intel i7-12800H |
| Total DoF controlled | 16 (14 arm joints/grippers + 2 base velocities) |
The total cost of approximately $32,000 is comparable to a single industrial Franka Emika Panda arm, yet Mobile ALOHA provides bimanual manipulation, mobility, and onboard compute. For context, comparable high-quality bimanual mobile manipulators have historically cost $200,000 or more.
The 14 kg battery is placed at the base of the robot, serving a dual purpose: providing power for several hours of continuous operation and acting as a counterweight to prevent the robot from tipping over when the arms are extended. Ground clearance is 30 mm, and the base can handle obstacles up to 10 mm in height and slopes up to 8 degrees.
One of the key contributions of Mobile ALOHA is the co-training approach, which demonstrated that data from existing static (tabletop) ALOHA tasks can be combined with new Mobile ALOHA task data to significantly improve performance.
The approach works as follows. The team had previously collected 825 episodes across 12 tabletop tasks on the stationary ALOHA system. When training a policy for a new Mobile ALOHA task, they combined the 50 new mobile demonstrations with the 825 existing static demonstrations, sampling from each dataset with equal probability during training. Because the static demonstrations lack base velocity commands, those action dimensions are zero-padded with [0, 0] to match the 16-dimensional Mobile ALOHA action space. Action normalization statistics are computed using only the mobile task data.
The co-training approach produced striking improvements on several tasks.
| Task | Demonstrations | Success with co-training | Success without co-training |
|---|---|---|---|
| Wipe wine | 50 | 95% | 50% |
| Call elevator | 50 | 95% | 0% |
| Use cabinet | 50 | 85% | 85% |
| Rinse pan | 50 | 80% | 95% |
| Push chairs | 50 | 100% | 100% |
| Cook shrimp | 20 | 40% | 20% |
| High five | 20 | 85% | 85% |
The most dramatic improvement was on the "call elevator" task, where co-training raised the success rate from 0% to 95%. On "wipe wine," co-training nearly doubled the success rate from 50% to 95%. The researchers also showed that co-trained policies using only 35 demonstrations outperformed non-co-trained policies using 50 demonstrations by 20 percentage points on the wine-wiping task, demonstrating meaningful data efficiency gains.
The team tested co-training with three different policy architectures: ACT, Diffusion Policy, and VINN (a retrieval-based method). ACT showed the strongest overall performance. Diffusion Policy also benefited from co-training, with a 30 percentage point improvement on the wine-wiping task. VINN showed mixed results, with co-training helping on one task but slightly hurting on another.
Human operators achieved 39 to 52 percent reductions in task completion time after just five practice trials with the teleoperation interface, indicating that the system is relatively easy to learn.
ALOHA Unleashed, published by Google DeepMind in October 2024, investigates how far imitation learning can be pushed for challenging dexterous bimanual tasks. The authors (Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid) describe the approach as "a simple recipe": large-scale data collection on the ALOHA 2 hardware combined with expressive diffusion policy models.
ALOHA Unleashed uses a Transformer encoder-decoder architecture trained with a diffusion loss. The system operates on the ALOHA 2 platform with its four camera viewpoints.
| Architecture component | Details |
|---|---|
| Vision backbone | ResNet-50, processing 4 RGB images (480 x 640 x 3) |
| Encoder | 85M parameters, bidirectional attention |
| Decoder (diffusion denoiser) | 55M parameters, iterative action refinement |
| Total parameters (Base) | 217M |
| Total parameters (Small, ablations) | 150M |
| Action chunk size | 50 actions (1-second trajectories) |
| Action dimensions | 14 DoF (12 joint positions + 2 gripper values) |
| Diffusion steps (training) | 50, squared cosine noise schedule |
| Inference sampling | DDIM |
The diffusion-based approach is analogous to how image generation models like Imagen work: during training, noise is progressively added to the ground-truth action sequences, and the model learns to denoise them. At inference time, the model starts from random noise and iteratively refines it into a coherent action sequence.
The data collection effort for ALOHA Unleashed was unprecedented for a bimanual manipulation platform. A total of 35 human operators collected over 26,000 demonstrations across 10 ALOHA 2 robots in 2 buildings over approximately 8 months.
| Task | Number of demonstrations |
|---|---|
| Shirt hanging | 8,658 |
| Robot finger replacement | 5,247 |
| Shoelace tying | 5,133 |
| Gear insertion | 4,005 |
| Random kitchen stacking | 3,198 |
| Total | 26,241 |
The diversity in operators, robots, and environments introduced natural variation in teleoperation strategies and hardware conditions, which proved beneficial for policy robustness.
Success rates were evaluated over 20 trials per task.
| Task | Success rate |
|---|---|
| Shirt (easy configuration) | 75% |
| Shirt (messy configuration) | 70% |
| Shoelace (easy) | 70% |
| Shoelace (messy) | 40% |
| Robot finger replacement | 75% |
| Gear insertion (all 3 gears) | 40% |
| Random kitchen (all items) | 25% |
These results represent several firsts in robot learning. ALOHA Unleashed produced the first end-to-end learned policy that can autonomously tie shoelaces and the first that can hang t-shirts on a rack. The robot finger replacement task requires millimeter-precision insertion. All of these behaviors were learned purely from visual observations on uncalibrated hardware, with no explicit state estimation or task-specific engineering.
The paper directly compared the diffusion-based approach against ACT (trained with L1 regression loss) using the same 150M parameter model. On the ShirtMessy task, the diffusion policy achieved 70% success compared to 25% for ACT. The authors concluded that "non-diffusion based architectures are incapable of solving some of our tasks," suggesting that the expressiveness of diffusion models is necessary for highly dexterous, multimodal manipulation.
Ablation experiments revealed several insights:
The ALOHA project has had a broad impact on the robotics research community because of its fully open-source nature. The original ALOHA repository on GitHub provides complete hardware assembly instructions, bill of materials, CAD files, and software for teleoperation and policy training. ALOHA 2 extended this with detailed tutorials and a MuJoCo simulation model. This openness has allowed labs worldwide to replicate and build upon the platform without starting from scratch.
Trossen Robotics, the manufacturer of the ViperX and WidowX arms used in ALOHA, began offering pre-assembled commercial kits based on the ALOHA design. These include the ALOHA Stationary kit (the bimanual tabletop setup), the ALOHA Mobile kit (with the AgileX mobile base), and the ALOHA Solo (a single-arm configuration). The ALOHA Solo starts at approximately $9,000. These commercial offerings lower the barrier to entry for research groups that lack the time or expertise to assemble the system from individual parts.
The Hugging Face LeRobot library, launched in 2024 and led by former Tesla robotics lead Remi Cadene, has adopted the ALOHA platform as one of its primary supported hardware configurations. LeRobot provides a unified PyTorch-based framework for training imitation learning, reinforcement learning, and vision-language-action (VLA) policies, with native support for ALOHA environments and datasets. The Hugging Face Hub hosts multiple ALOHA-related datasets (including aloha_mobile_cabinet and others) and pretrained ACT models.
Within its first twelve months, the LeRobot GitHub repository grew to over 12,000 stars, with an active community of builders sharing tutorials, modifications, and trained models on YouTube and Discord. LeRobot later partnered with The Robot Studio to release the SO-100 arm, a $100 robotic arm designed for accessibility, and NVIDIA announced GR00T N1, an open foundation model for humanoid robots, fine-tuned to run on the LeRobot SO-100 arm.
The open-source nature of ALOHA has inspired community members to build derivative platforms. One notable example is AlohaMini, an open-source dual-arm mobile robot with a motorized vertical lift (0 to 60 cm travel for floor-to-table reach) and a 5-camera perception system. AlohaMini is designed to be fully 3D-printable and can be assembled at home in approximately 60 minutes. It integrates with LeRobot for policy training and deployment, and in late 2025 it gained ManiSkill3 simulation support and a deployment guide for the Pi 0.5 foundation model.
Other community variants include the AgileX COBOT Magic, which builds on the Mobile ALOHA concept using AgileX's own robotics platform, and various university-developed modifications that adapt the ALOHA design for specific research needs.
The ALOHA project has been shaped by a small group of researchers, primarily from Stanford University and Google DeepMind.
| Researcher | Affiliation | Role in ALOHA project |
|---|---|---|
| Tony Z. Zhao | Stanford (formerly); co-founder and CEO of Sunday Robotics | Lead developer of original ALOHA and ACT; co-author of Mobile ALOHA and ALOHA Unleashed |
| Chelsea Finn | Stanford University (IRIS Lab) | Faculty advisor for ALOHA, Mobile ALOHA, and ALOHA Unleashed |
| Zipeng Fu | Stanford University | Co-lead of Mobile ALOHA |
| Vikash Kumar | Meta | Co-author of original ALOHA paper |
| Sergey Levine | UC Berkeley | Co-author of original ALOHA paper |
| Ayzaan Wahid | Google DeepMind | Co-lead of ALOHA Unleashed |
| Jonathan Tompson | Google DeepMind | Co-author of ALOHA Unleashed |
| Danny Driess | Google DeepMind | Co-author of ALOHA Unleashed |
Tony Z. Zhao, who was a computer science PhD student at Stanford under Chelsea Finn and held the Stanford Robotics Fellowship for 2022-23, left Stanford to co-found Sunday Robotics (sunday.ai) with Cheng Chi. Sunday Robotics is developing a home robot called Memo and raised $35 million in initial funding from Benchmark and Conviction. In March 2026, Sunday Robotics raised an additional $165 million and announced plans to launch its first autonomous robots by Thanksgiving 2026. The company's approach builds directly on the data-driven manipulation techniques pioneered in the ALOHA project, including a large-scale glove-based data collection program.
Zipeng Fu, a Stanford AI and Robotics PhD student supported by the Stanford Graduate Fellowship, led the Mobile ALOHA project alongside Tony Zhao.
| Date | Event |
|---|---|
| April 2023 | Original ALOHA paper submitted to arXiv (2304.13705) |
| July 2023 | ALOHA paper presented at RSS 2023 |
| January 2024 | Mobile ALOHA paper released; system goes viral on social media |
| May 2024 | ALOHA 2 hardware paper released (arXiv 2405.02292) |
| September 2024 | Google DeepMind announces ALOHA Unleashed and DemoStart |
| October 2024 | ALOHA Unleashed paper released (arXiv 2410.13126) |
| 2024 | Mobile ALOHA paper published at CoRL 2024 |
| 2024 | Trossen Robotics launches commercial ALOHA kits |
| 2024 | Hugging Face LeRobot library launches with ALOHA support |
| November 2025 | AlohaMini CAD files released |
| December 2025 | AlohaMini gains ManiSkill3 simulation integration |
| February 2026 | AlohaMini Pi 0.5 deployment guide released |
ALOHA's contribution to robotics and embodied AI research is primarily practical rather than theoretical. The system did not introduce fundamentally new concepts in robot learning or teleoperation; bimanual manipulation, imitation learning from demonstrations, and leader-follower teleoperation all predate ALOHA. What the project did was package these ideas into a system that was cheap enough for most labs to afford, simple enough to assemble and use, and open enough for others to modify and build upon.
The result has been a proliferation of bimanual manipulation research that would not have been feasible at previous hardware price points. Before ALOHA, collecting bimanual manipulation demonstrations typically required either expensive industrial hardware or custom-built research platforms that were difficult to replicate. ALOHA showed that commodity robot arms costing a few thousand dollars each, combined with a straightforward puppeteering interface, could produce demonstration data of sufficient quality to train effective manipulation policies.
The co-training insight from Mobile ALOHA, where static manipulation data collected on a tabletop system improves the performance of mobile manipulation policies, suggests that the broader ALOHA community's growing pool of shared demonstration data could have compounding benefits. As more labs collect and share ALOHA-format demonstrations through platforms like the Hugging Face Hub, the value of the shared data pool increases for everyone.
ALOHA Unleashed's results further demonstrated that, given enough data and sufficiently expressive models, learned policies can achieve dexterous manipulation capabilities that were previously only possible with carefully engineered, task-specific controllers. The fact that a single architecture (diffusion transformer) trained on teleoperation data can tie shoelaces, hang shirts, and perform millimeter-precision insertions, all on the same hardware, represents a meaningful step toward general-purpose robotic manipulation.