ALOHA (robot system)

Robotics

24 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v4 · 4,744 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) is an open-source robotic platform for collecting bimanual manipulation demonstrations and training imitation learning policies, built at Stanford University and introduced in the 2023 paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" by Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. ^[1] The complete two-arm system costs under $20,000, roughly 5 to 10 times cheaper than comparable research-grade bimanual setups, and it pairs two leader robot arms with two follower robot arms so a human operator can teleoperate both followers at once by physically moving the leaders. ^[1] Alongside the hardware, the team introduced Action Chunking with Transformers (ACT), an imitation learning algorithm that predicts sequences of future actions rather than single timesteps, which reduces the compounding errors that accumulate in learned policies. The paper reports that ACT "allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations." ^[1]

Since its release, ALOHA has become one of the most widely adopted platforms in robot learning research. It has spawned several follow-up systems: Mobile ALOHA (2024), which adds a mobile base for whole-body manipulation at a total cost of about $32,000; ^[2] ALOHA 2 (2024), an enhanced hardware revision by Google DeepMind; ^[3] and ALOHA Unleashed (2024), which combines ALOHA 2 with large-scale diffusion policy training on over 26,000 demonstrations to achieve complex dexterous tasks like tying shoelaces. ^[4] The hardware designs, software, and simulation models are all publicly available, and commercial kits are sold by Trossen Robotics. ^[6]

Background and motivation

Bimanual manipulation, where a robot uses two arms to coordinate on a single task, is needed for many real-world activities such as folding laundry, cooking, and assembling parts. However, research in bimanual robot learning has historically been limited by the high cost of hardware. Industrial-grade dual-arm setups with force-torque sensors and high-precision actuators can easily exceed $100,000 to $200,000, placing them out of reach for most academic labs. Even when hardware is available, collecting demonstration data for imitation learning often requires specialized teleoperation interfaces (such as VR controllers or exoskeletons) that add cost and complexity.

The ALOHA paper frames the problem directly: "Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up." ^[1] The project was motivated by the idea that affordable, open-source hardware could democratize bimanual manipulation research. By using off-the-shelf hobby-grade robot arms and a simple "puppeteering" teleoperation approach (where the operator physically backdrives smaller leader arms that are kinematically matched to the larger follower arms), the developers aimed to eliminate the need for expensive sensors, calibration procedures, or specialized teleoperation equipment. ^[1]

How much does ALOHA cost and what is in the kit?

Original ALOHA (2023)

The original ALOHA system uses four robot arms from Trossen Robotics: two ViperX 300 arms as the followers (the arms that actually perform tasks) and two WidowX 250 arms as the leaders (the arms the human operator moves). The WidowX leader arms share the same kinematic structure as the ViperX followers but in a smaller, lighter form factor, which makes them easy to backdrive by hand. ^[1]

Component	Specification
Follower arms	2x Trossen Robotics ViperX 300, 6 DoF each
Leader arms	2x Trossen Robotics WidowX 250, 6 DoF each
Grippers	Parallel-jaw grippers on all four arms
Control frequency	50 Hz joint position synchronization
Cameras	4 RGB cameras (480 x 640 resolution): 2 stationary, 2 wrist-mounted
Total cost	Under $20,000
Communication	USB-based, synchronized via ROS

During teleoperation, the joint positions of the leader arms are read at 50 Hz and sent as position commands to the corresponding follower arms. The operator simply grasps the leader arms and moves them; the followers mirror the motion in real time. This approach requires no force sensors, no motion capture, and no calibration beyond the initial mechanical setup. The grippers on the leader arms are mechanically linked to the follower grippers, so opening and closing the leader gripper directly controls the follower gripper. ^[1]

The four cameras (two stationary, providing overhead and front views, plus two wrist-mounted cameras on the follower arms) capture RGB images during teleoperation. These images, along with the joint positions, form the observation data used to train learned policies. ^[1]

ALOHA 2 (2024)

ALOHA 2 is a redesigned version of the original system, developed by a team at Google DeepMind and Stanford University. The paper, authored by the ALOHA 2 Team including Jorge Aldaco, Travis Armstrong, Robert Baruch, and others, was released in May 2024. ALOHA 2 retains the same core concept (two ViperX followers, two WidowX leaders) but introduces several hardware improvements targeting ergonomics, durability, and data quality. ^[3]

Improvement area	Original ALOHA	ALOHA 2
Gripper mechanism	Scissor-type design	Low-friction linear rail design
Leader gripper force to operate	14.68 N	0.84 N
Follower gripper output force	12.8 N	27.9 N
Gravity compensation	Rubber bands	Passive adjustable hanging retractors
Cameras	Logitech webcams	4x Intel RealSense D405 (848 x 480, global shutter, depth)
Gripper material	Standard plastic	Carbon fiber nylon with polyurethane gripping tape
Frame	Full aluminum extrusion cage	Simplified 20x20 mm aluminum extrusion frame
Workspace table	Variable	Standardized 48 x 30 inch table
Software stack	ROS 1	ROS 2

The gripper redesign is one of the most impactful changes. The original scissor-type grippers required significant force to operate, causing operator fatigue during long data collection sessions. The new linear rail mechanism reduces the operating force from 14.68 N to just 0.84 N for the leader grippers, while simultaneously doubling the follower gripper output force from 12.8 N to 27.9 N. The gripper fingers are now 3D-printed in carbon fiber nylon and coated with polyurethane gripping tape on both inner and outer surfaces, improving grip reliability and wear resistance. ^[3]

The passive gravity compensation system replaces the original rubber band approach with commercially available hanging retractors. These can be adjusted by the operator to balance the weight of the leader arms, reducing fatigue. User studies showed that operators using the passive gravity compensation system could insert 1.38 shapes per minute versus 0.97 shapes per minute without it. ^[3]

ALOHA 2 also upgraded the camera system from consumer webcams to four Intel RealSense D405 cameras, providing RGB and depth data with global shutter capability at 848 x 480 resolution. The four viewpoints are overhead, worm's-eye (looking up), left wrist, and right wrist. Custom 3D-printed mounts keep the cameras compact and reduce the overall footprint of the follower arms. ^[3]

The frame was simplified by removing the vertical side panels from the original design, creating more open workspace for human-robot collaboration and larger objects. The 20x20 mm aluminum extrusion frame still provides rigid mounting points for cameras and the gravity compensation system. ^[3]

The team also released a MuJoCo Menagerie simulation model of ALOHA 2 with system identification. They collected 11 real-world trajectories using the leader arms and minimized the residuals between real and simulated trajectories, tuning proportional gain, damping, armature, joint friction, and torque limits. This allows researchers to develop and test policies in simulation before deploying on real hardware. ^[9]

All hardware designs, CAD files, assembly tutorials, and the simulation model were open-sourced through the project website. ^[3]

What is Action Chunking with Transformers (ACT)?

The original ALOHA paper introduced ACT (Action Chunking with Transformers), an imitation learning algorithm designed to address two fundamental problems in learning manipulation policies from demonstrations: compounding errors and multimodal action distributions. The paper describes it as "a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences." ^[1]

Compounding errors

In standard behavioral cloning, a policy is trained to predict a single action at each timestep given the current observation. Small errors in individual action predictions can accumulate over time, causing the robot to drift into states that were never seen during training. ACT mitigates this by predicting a "chunk" of k future actions at once (for example, the next 50 or 100 joint position targets). Because each chunk covers multiple timesteps, the effective decision horizon of the policy is reduced by a factor of k, giving errors fewer opportunities to compound. ^[1]

Multimodal demonstrations

Different human operators may perform the same task in different ways. For instance, when picking up an object, one operator might approach from the left while another approaches from the right. A standard regression-based policy would average these different strategies, producing actions that do not match any real strategy. ACT handles this by using a Conditional Variational Autoencoder (CVAE) framework. During training, a CVAE encoder compresses the ground-truth action sequence and current joint positions into a low-dimensional latent variable z, representing the "style" of the demonstration. The policy (CVAE decoder) takes the current observations along with a sampled z and predicts an action chunk. At test time, the encoder is discarded, and z is sampled from the learned prior distribution, allowing the policy to commit to one coherent strategy per rollout. ^[1]

Architecture details

The ACT policy is a compact model with roughly 80M parameters, ^[8] and its architecture consists of three main components:

ResNet image encoder: Processes RGB images from multiple camera viewpoints into feature vectors.
Transformer encoder: Takes the image features, current joint positions, and the latent style variable z as input tokens, and produces contextualized representations through bidirectional self-attention.
Transformer decoder: Uses cross-attention to the encoder output and produces a sequence of k future action predictions (target joint positions for each arm and gripper).

Temporal ensembling

When executing an action chunk, the robot does not wait for the entire chunk to finish before querying the policy again. Instead, the policy is queried at every timestep, producing overlapping chunks that predict values for the same future timesteps. These overlapping predictions are combined using an exponential weighting scheme called temporal ensembling, where more recent predictions receive higher weight. This smooths the executed trajectory and reduces jerkiness at chunk boundaries. ^[1]

Results on original ALOHA

Using only about 50 demonstrations per task (roughly 10 minutes of teleoperation data), ACT achieved 80 to 90 percent success rates across six challenging bimanual manipulation tasks. On the original ALOHA benchmark suite, ACT reached 80 to 95 percent success compared with only 20 to 50 percent for standard behavioral cloning trained on the same data. ^[1]

Task	Success rate	Number of demos
Open translucent condiment cup	96%	~50
Slot a battery	84%	~50
Thread a zip tie	High	~50
Juggle a ping pong ball	High	~50
Assemble NIST board chain	High	~50
Prepare tape	High	~50

These results were notable because the tasks involve fine-grained precision (millimeter-level accuracy for battery insertion), dynamic motions (juggling), and complex contact patterns (threading), all achieved with low-cost hardware and minimal demonstration data. ^[1]

What is Mobile ALOHA?

Overview

Mobile ALOHA, introduced in January 2024 by Zipeng Fu, Tony Z. Zhao, and Chelsea Finn at Stanford University, extends the original stationary ALOHA system with a mobile base, enabling the robot to navigate environments and perform whole-body mobile manipulation tasks. The paper was published at the Conference on Robot Learning (CoRL) 2024. ^[2]

The motivation behind Mobile ALOHA was straightforward: many useful household and workplace tasks require the robot to move around, not just manipulate objects on a fixed table. Cooking, cleaning, organizing, and navigating between rooms all require coordinated locomotion and bimanual manipulation.

Hardware specifications

Mobile ALOHA mounts the ALOHA bimanual system onto an AgileX Tracer AGV (automated guided vehicle), a differential-drive mobile base originally designed for warehouse logistics. ^[2]

Component	Specification
Mobile base	AgileX Tracer AGV
Base cost	~$7,000 (5x cheaper than comparable Clearpath AGVs)
Total system cost	~$32,000 (including onboard power and compute)
Maximum speed	1.6 m/s
Payload capacity	100 kg
Battery	1.26 kWh, 14 kg (doubles as counterweight)
Arms	2x ViperX 300, 6 DoF each
Arm reach from base	100 cm
Vertical reach	65 cm to 200 cm
Lift capacity per arm	1.5 kg
Pull force	100 N at 1.5 m height
Cameras	3x Logitech C922x (480 x 640, 50 Hz): 2 wrist-mounted, 1 forward-facing
Onboard GPU	NVIDIA RTX 3070 Ti (8 GB VRAM)
Onboard CPU	Intel i7-12800H
Total DoF controlled	16 (14 arm joints/grippers + 2 base velocities)

The total cost of approximately $32,000 is comparable to a single industrial Franka Emika Panda arm, yet Mobile ALOHA provides bimanual manipulation, mobility, and onboard compute. For context, comparable high-quality bimanual mobile manipulators have historically cost $200,000 or more. ^[2]

The 14 kg battery is placed at the base of the robot, serving a dual purpose: providing power for several hours of continuous operation and acting as a counterweight to prevent the robot from tipping over when the arms are extended. Ground clearance is 30 mm, and the base can handle obstacles up to 10 mm in height and slopes up to 8 degrees. ^[2]

How does co-training improve Mobile ALOHA?

One of the key contributions of Mobile ALOHA is the co-training approach, which demonstrated that data from existing static (tabletop) ALOHA tasks can be combined with new Mobile ALOHA task data to significantly improve performance. The authors report that "co-training with existing static ALOHA datasets boosts performance on mobile manipulation tasks," with co-training increasing success rates by up to 90% on some tasks. ^[2]

The approach works as follows. The team had previously collected 825 episodes across 12 tabletop tasks on the stationary ALOHA system. When training a policy for a new Mobile ALOHA task, they combined the 50 new mobile demonstrations with the 825 existing static demonstrations, sampling from each dataset with equal probability during training. Because the static demonstrations lack base velocity commands, those action dimensions are zero-padded with [0, 0] to match the 16-dimensional Mobile ALOHA action space. Action normalization statistics are computed using only the mobile task data. ^[2]

Task results

The co-training approach produced striking improvements on several tasks.

Task	Demonstrations	Success with co-training	Success without co-training
Wipe wine	50	95%	50%
Call elevator	50	95%	0%
Use cabinet	50	85%	85%
Rinse pan	50	80%	95%
Push chairs	50	100%	100%
Cook shrimp	20	40%	20%
High five	20	85%	85%

The most dramatic improvement was on the "call elevator" task, where co-training raised the success rate from 0% to 95%. On "wipe wine," co-training nearly doubled the success rate from 50% to 95%. The researchers also showed that co-trained policies using only 35 demonstrations outperformed non-co-trained policies using 50 demonstrations by 20 percentage points on the wine-wiping task, demonstrating meaningful data efficiency gains. ^[2]

The team tested co-training with three different policy architectures: ACT, Diffusion Policy, and VINN (a retrieval-based method). ACT showed the strongest overall performance. Diffusion Policy also benefited from co-training, with a 30 percentage point improvement on the wine-wiping task. VINN showed mixed results, with co-training helping on one task but slightly hurting on another. ^[2]

Human operators achieved 39 to 52 percent reductions in task completion time after just five practice trials with the teleoperation interface, indicating that the system is relatively easy to learn. ^[2]

What is ALOHA Unleashed?

Overview

ALOHA Unleashed, published by Google DeepMind in October 2024, investigates how far imitation learning can be pushed for challenging dexterous bimanual tasks. The authors (Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid) describe the approach as "a simple recipe": large-scale data collection on the ALOHA 2 hardware combined with expressive diffusion policy models. ^[4]

Technical approach

ALOHA Unleashed uses a Transformer encoder-decoder architecture trained with a diffusion loss. The system operates on the ALOHA 2 platform with its four camera viewpoints. ^[4]

Architecture component	Details
Vision backbone	ResNet-50, processing 4 RGB images (480 x 640 x 3)
Encoder	85M parameters, bidirectional attention
Decoder (diffusion denoiser)	55M parameters, iterative action refinement
Total parameters (Base)	217M
Total parameters (Small, ablations)	150M
Action chunk size	50 actions (1-second trajectories)
Action dimensions	14 DoF (12 joint positions + 2 gripper values)
Diffusion steps (training)	50, squared cosine noise schedule
Inference sampling	DDIM

The diffusion-based approach is analogous to how image generation models like Imagen work: during training, noise is progressively added to the ground-truth action sequences, and the model learns to denoise them. At inference time, the model starts from random noise and iteratively refines it into a coherent action sequence. ^[4]

Data collection at scale

The data collection effort for ALOHA Unleashed was unprecedented for a bimanual manipulation platform. A total of 35 human operators collected over 26,000 demonstrations across 10 ALOHA 2 robots in 2 buildings over approximately 8 months. ^[4]

Task	Number of demonstrations
Shirt hanging	8,658
Robot finger replacement	5,247
Shoelace tying	5,133
Gear insertion	4,005
Random kitchen stacking	3,198
Total	26,241

The diversity in operators, robots, and environments introduced natural variation in teleoperation strategies and hardware conditions, which proved beneficial for policy robustness. ^[4]

Results

Success rates were evaluated over 20 trials per task. ^[4]

Task	Success rate
Shirt (easy configuration)	75%
Shirt (messy configuration)	70%
Shoelace (easy)	70%
Shoelace (messy)	40%
Robot finger replacement	75%
Gear insertion (all 3 gears)	40%
Random kitchen (all items)	25%

These results represent several firsts in robot learning. ALOHA Unleashed produced the first end-to-end learned policy that can autonomously tie shoelaces and the first that can hang t-shirts on a rack. The robot finger replacement task requires millimeter-precision insertion. All of these behaviors were learned purely from visual observations on uncalibrated hardware, with no explicit state estimation or task-specific engineering. ^[4]^[5]

Comparison with ACT

The paper directly compared the diffusion-based approach against ACT (trained with L1 regression loss) using the same 150M parameter model. On the ShirtMessy task, the diffusion policy achieved 70% success compared to 25% for ACT. The authors concluded that "non-diffusion based architectures are incapable of solving some of our tasks," suggesting that the expressiveness of diffusion models is necessary for highly dexterous, multimodal manipulation. ^[4]

Ablation findings

Ablation experiments revealed several insights: ^[4]

Data quantity matters: Reducing training data below 50% of the full dataset caused sharp performance drops. For ShirtMessy, performance fell from 70% to 20%.
Data quality has nuanced effects: Filtering for shorter, cleaner demonstrations improved performance in low-data regimes (30% to 55%) but removing too many messy demonstrations hurt the policy's ability to recover from failures (55% dropped to 40%).
Architecture choice is critical: The diffusion loss proved essential for the most challenging tasks, with non-diffusion alternatives unable to solve them.

Is ALOHA open source?

Open-source impact

Yes. The ALOHA project has had a broad impact on the robotics research community because of its fully open-source nature. The original ALOHA repository on GitHub provides complete hardware assembly instructions, bill of materials, CAD files, and software for teleoperation and policy training. ALOHA 2 extended this with detailed tutorials and a MuJoCo simulation model. This openness has allowed labs worldwide to replicate and build upon the platform without starting from scratch. ^[1]^[3]

Trossen Robotics commercial kits

Trossen Robotics, the manufacturer of the ViperX and WidowX arms used in ALOHA, began offering pre-assembled commercial kits based on the ALOHA design. These include the ALOHA Stationary kit (the bimanual tabletop setup), the ALOHA Mobile kit (with the AgileX mobile base), and the ALOHA Solo (a single-arm configuration). The ALOHA Solo starts at approximately $9,000. These commercial offerings lower the barrier to entry for research groups that lack the time or expertise to assemble the system from individual parts. ^[6]

Hugging Face LeRobot

The Hugging Face LeRobot library, launched in 2024 and led by former Tesla robotics lead Remi Cadene, has adopted the ALOHA platform as one of its primary supported hardware configurations. LeRobot provides a unified PyTorch-based framework for training imitation learning, reinforcement learning, and vision-language-action (VLA) policies, with native support for ALOHA environments and datasets. The Hugging Face Hub hosts multiple ALOHA-related datasets (including aloha_mobile_cabinet and others) and pretrained ACT models. ^[7]^[8]

Within its first twelve months, the LeRobot GitHub repository grew to over 12,000 stars, with an active community of builders sharing tutorials, modifications, and trained models on YouTube and Discord. LeRobot later partnered with The Robot Studio to release the SO-100 arm, a $100 robotic arm designed for accessibility, and NVIDIA announced GR00T N1, an open foundation model for humanoid robots, fine-tuned to run on the LeRobot SO-100 arm. ^[7]

Community-built variants

The open-source nature of ALOHA has inspired community members to build derivative platforms. One notable example is AlohaMini, an open-source dual-arm mobile robot with a motorized vertical lift (0 to 60 cm travel for floor-to-table reach) and a 5-camera perception system. AlohaMini is designed to be fully 3D-printable and can be assembled at home in approximately 60 minutes. It integrates with LeRobot for policy training and deployment, and in late 2025 it gained ManiSkill3 simulation support and a deployment guide for the Pi 0.5 foundation model. ^[10]

Other community variants include the AgileX COBOT Magic, which builds on the Mobile ALOHA concept using AgileX's own robotics platform, and various university-developed modifications that adapt the ALOHA design for specific research needs.

Key researchers

The ALOHA project has been shaped by a small group of researchers, primarily from Stanford University and Google DeepMind.

Researcher	Affiliation	Role in ALOHA project
Tony Z. Zhao	Stanford (formerly); co-founder and CEO of Sunday Robotics	Lead developer of original ALOHA and ACT; co-author of Mobile ALOHA and ALOHA Unleashed
Chelsea Finn	Stanford University (IRIS Lab)	Faculty advisor for ALOHA, Mobile ALOHA, and ALOHA Unleashed
Zipeng Fu	Stanford University	Co-lead of Mobile ALOHA
Vikash Kumar	Meta	Co-author of original ALOHA paper
Sergey Levine	UC Berkeley	Co-author of original ALOHA paper
Ayzaan Wahid	Google DeepMind	Co-lead of ALOHA Unleashed
Jonathan Tompson	Google DeepMind	Co-author of ALOHA Unleashed
Danny Driess	Google DeepMind	Co-author of ALOHA Unleashed

Tony Z. Zhao, who was a computer science PhD student at Stanford under Chelsea Finn and held the Stanford Robotics Fellowship for 2022-23, left Stanford to co-found Sunday Robotics (sunday.ai) with Cheng Chi. Sunday Robotics is developing a home robot called Memo and raised $35 million in initial funding from Benchmark and Conviction. In March 2026, Sunday Robotics raised an additional $165 million and announced plans to launch its first autonomous robots by Thanksgiving 2026. ^[11] The company's approach builds directly on the data-driven manipulation techniques pioneered in the ALOHA project, including a large-scale glove-based data collection program.

Zipeng Fu, a Stanford AI and Robotics PhD student supported by the Stanford Graduate Fellowship, led the Mobile ALOHA project alongside Tony Zhao.

When was ALOHA released? (Timeline)

Date	Event
April 2023	Original ALOHA paper submitted to arXiv (2304.13705)
July 2023	ALOHA paper presented at RSS 2023
January 2024	Mobile ALOHA paper released; system goes viral on social media
May 2024	ALOHA 2 hardware paper released (arXiv 2405.02292)
September 2024	Google DeepMind announces ALOHA Unleashed and DemoStart
October 2024	ALOHA Unleashed paper released (arXiv 2410.13126)
2024	Mobile ALOHA paper published at CoRL 2024
2024	Trossen Robotics launches commercial ALOHA kits
2024	Hugging Face LeRobot library launches with ALOHA support
November 2025	AlohaMini CAD files released
December 2025	AlohaMini gains ManiSkill3 simulation integration
February 2026	AlohaMini Pi 0.5 deployment guide released

Significance in robotics research

ALOHA's contribution to robotics and embodied AI research is primarily practical rather than theoretical. The system did not introduce fundamentally new concepts in robot learning or teleoperation; bimanual manipulation, imitation learning from demonstrations, and leader-follower teleoperation all predate ALOHA. What the project did was package these ideas into a system that was cheap enough for most labs to afford, simple enough to assemble and use, and open enough for others to modify and build upon.

The result has been a proliferation of bimanual manipulation research that would not have been feasible at previous hardware price points. Before ALOHA, collecting bimanual manipulation demonstrations typically required either expensive industrial hardware or custom-built research platforms that were difficult to replicate. ALOHA showed that commodity robot arms costing a few thousand dollars each, combined with a straightforward puppeteering interface, could produce demonstration data of sufficient quality to train effective manipulation policies. ^[1]

The co-training insight from Mobile ALOHA, where static manipulation data collected on a tabletop system improves the performance of mobile manipulation policies, suggests that the broader ALOHA community's growing pool of shared demonstration data could have compounding benefits. As more labs collect and share ALOHA-format demonstrations through platforms like the Hugging Face Hub, the value of the shared data pool increases for everyone. ^[2]

ALOHA Unleashed's results further demonstrated that, given enough data and sufficiently expressive models, learned policies can achieve dexterous manipulation capabilities that were previously only possible with carefully engineered, task-specific controllers. The fact that a single architecture (diffusion transformer) trained on teleoperation data can tie shoelaces, hang shirts, and perform millimeter-precision insertions, all on the same hardware, represents a meaningful step toward general-purpose robotic manipulation. ^[4]

References

Zhao, T. Z., Kumar, V., Levine, S., & Finn, C. (2023). "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." *Robotics: Science and Systems (RSS) 2023*. arXiv:2304.13705. https://tonyzhaozh.github.io/aloha/ ↩
Fu, Z., Zhao, T. Z., & Finn, C. (2024). "Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation." *Conference on Robot Learning (CoRL) 2024*. arXiv:2401.02117. https://mobile-aloha.github.io/ ↩
ALOHA 2 Team (Aldaco, J., Armstrong, T., Baruch, R., et al.). (2024). "ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation." arXiv:2405.02292. https://aloha-2.github.io/ ↩
Zhao, T. Z., Tompson, J., Driess, D., Florence, P., Ghasemipour, K., Finn, C., & Wahid, A. (2024). "ALOHA Unleashed: A Simple Recipe for Robot Dexterity." arXiv:2410.13126. https://aloha-unleashed.github.io/ ↩
Google DeepMind. (2024). "Our latest advances in robot dexterity." https://deepmind.google/blog/advances-in-robot-dexterity/ ↩
Trossen Robotics. "The ALOHA Project." https://www.trossenrobotics.com/the-aloha-project ↩
Hugging Face. "LeRobot: Making AI for Robotics more accessible." https://github.com/huggingface/lerobot ↩
Cadene, R. et al. (2024). "LeRobot: State-of-the-art Machine Learning for Real-World Robotics in PyTorch." Hugging Face. https://huggingface.co/lerobot ↩
Google DeepMind. "MuJoCo Menagerie: ALOHA model." https://github.com/google-deepmind/mujoco_menagerie/blob/main/aloha/README.md ↩
Li, Y. (2025). "AlohaMini: Open-Source Dual-Arm Mobile Robot with Motorized Lift." https://github.com/liyiteng/AlohaMini ↩
Sunday Robotics. (2026). "Sunday Raises $165M to Launch First Autonomous Robots by Thanksgiving." GlobeNewsWire, March 12, 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

ALOHA 2 Action Chunking with Transformers (ACT)Karol Hausman Mobile ALOHA Open X-Embodiment OpenVLA Robot teleoperation Vision-language-action model π₀ (pi-zero)

Background and motivation

How much does ALOHA cost and what is in the kit?

Original ALOHA (2023)

ALOHA 2 (2024)

What is Action Chunking with Transformers (ACT)?

Compounding errors

Multimodal demonstrations

Architecture details

Temporal ensembling

Results on original ALOHA

What is Mobile ALOHA?

Overview

Hardware specifications

How does co-training improve Mobile ALOHA?

Task results

What is ALOHA Unleashed?

Overview

Technical approach

Data collection at scale

Results

Comparison with ACT

Ablation findings

Is ALOHA open source?

Open-source impact

Trossen Robotics commercial kits

Hugging Face LeRobot

Community-built variants

Key researchers

When was ALOHA released? (Timeline)

Significance in robotics research

See also

References

Improve this article

Related Articles

Figure AI

Jetson Thor

NVIDIA Omniverse

SmolVLA

XPeng IRON

AI robotics

What links here

Related Articles

Figure AI

Jetson Thor

NVIDIA Omniverse

SmolVLA

XPeng IRON

AI robotics

What links here