Action Chunking with Transformers (ACT)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,385 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,385 words
Add missing citations, update stale details, or suggest a clearer explanation.
Action Chunking with Transformers (ACT) is an imitation learning algorithm for fine-grained robotic manipulation introduced by Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn in the April 2023 paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware," presented at Robotics: Science and Systems (RSS) 2023.[^1] The method trains a conditional variational autoencoder (CVAE) whose decoder is a transformer that predicts a temporally extended sequence (a "chunk") of K future actions from RGB camera observations and proprioceptive joint states, then executes those actions while temporally ensembling overlapping chunks to suppress jitter.[^2][^3] ACT was paired with ALOHA (A Low-cost Open-source Hardware system for bimanual teleoperATion), a roughly $20,000 dual-arm teleoperation rig built on Trossen Robotics ViperX 300 arms, and the combination demonstrated 80 to 96 percent success on fine-grained tasks such as battery slotting, ziploc opening, and threading from only 50 human demonstrations per task.[^1][^4] ACT has since become one of the canonical baselines and reference implementations for visuomotor policy learning, alongside diffusion policy, and underpins follow-on work including Mobile ALOHA, ALOHA Unleashed, and the action-chunked decoders used in Physical Intelligence's π0 family of generalist policies.[^5][^6][^7]
Behavior cloning from RGB images has been studied for decades, but practitioners had long observed that single-step policies trained on human demonstrations struggle to chain together the dozens of millimeter-accurate motions required for tasks like inserting a battery or threading a zip tie. Two failure modes are particularly damaging. First, demonstrations collected by human teleoperators are non-Markovian and frequently contain pauses, hesitations, or small corrective movements that map identical observations to different actions, which forces a single-step regressor to average across modes.[^2][^8] Second, single-step policies suffer from covariate shift: small per-timestep errors accumulate over long horizons, drifting the robot out of the demonstration distribution faster than it can recover.[^9]
The classical mitigation, DAgger and related interactive imitation learning methods, requires the expert to remain in the loop relabeling states, which is expensive and impractical for fine-grained bimanual tasks performed by teleoperators.[^9] Reinforcement learning offers another path but typically requires either accurate simulators (which are hard to build for contact-rich, deformable, or visually subtle manipulation) or large amounts of real-world reward signal.[^1]
Zhao and collaborators framed their contribution as a pragmatic engineering question: can a small team using inexpensive, commercially available arms collect a few dozen minutes of teleoperated data and learn robust policies for precision tasks, end-to-end from pixels? Two co-designed components addressed this: a teleoperation rig (ALOHA) that made high-quality demonstrations cheap to collect, and a learning method (ACT) that exploited the temporal structure of those demonstrations.[^1]
ALOHA is a leader-follower bimanual teleoperation system documented on its project page at tonyzhaozh.github.io/aloha and released as open-source hardware under the MIT license.[^4][^10] The follower workcell consists of two Trossen Robotics ViperX 300 6-DOF arms; each arm has roughly 75 cm horizontal reach, a 750 g payload, and is driven by Robotis Dynamixel XM540 and XM430 servos that expose position, velocity, current, and PID parameters over a U2D2 USB bus.[^11] The leader arms are two WidowX 250 6-DOF arms, chosen because their lower mass (0.53 kg) and smaller workspace make them comfortable for an operator to hold during multi-hour data collection sessions.[^11]
| Item | Specification |
|---|---|
| Total parts cost | Approximately $20,000 (USD) |
| Follower arms | 2x Trossen ViperX 300 6-DOF |
| Leader arms | 2x Trossen WidowX 250 6-DOF |
| Degrees of freedom | 7 per arm (6 joint + 1 gripper) = 14 total |
| Cameras | 4 RGB at 480x640 (2 wrist-mounted, 2 overhead) |
| Control rate | 50 Hz |
| Workspace | Bimanual tabletop |
| License | MIT (hardware and code) |
| Original paper | RSS 2023[^1] |
The control interface joins the leader and follower arms in joint-space: the operator backdrives the leader, and joint positions are streamed at 50 Hz to the follower as targets. Each demonstration episode records four synchronized RGB streams (two wrist cameras and two overhead cameras), the 14-dimensional follower joint state, and the 14-dimensional leader command, which serves as the action label for supervised learning.[^1][^4] The original release shipped with assembly tutorials, a ROS Noetic software stack, episode recording and replay scripts, and a MuJoCo-based simulation environment for the benchmark tasks.[^10]
The core observation behind ACT is that the policy should not have to commit to one action at a time. If a teleoperator's hand pauses for half a second to align a zip tie, the policy that reproduces that pause must somehow know it is in the middle of a pause rather than at the start. A standard single-step policy only sees the current observation and cannot disambiguate identical frames, so it averages and produces a half-speed motion. Predicting a horizon of K future actions in one forward pass sidesteps this: the chunk implicitly carries the relative-time information that a single-step output lacks.[^2][^8]
Action chunking also addresses covariate shift directly. If errors compound at rate ε per step, then committing to K-step open-loop sequences reduces the number of independent decisions per episode by a factor of K, and the policy is queried K times less often as the agent drifts.[^8] Zhao et al. use K = 100 (two seconds of motion at 50 Hz) in the main experiments, executing each chunk in a temporally ensembled fashion described below.[^1][^3]
ACT learns a generative model p(a_{t:t+K} | o_t) over action chunks conditioned on the current observation o_t, where the observation is the bundle of four RGB images and current joint state. The model is structured as a conditional variational autoencoder.[^1][^3]
During training the CVAE encoder is a BERT-style transformer that ingests a sequence consisting of a learned [CLS] token, the current joint positions, and the demonstrated K-step action sequence. It produces the parameters of a diagonal-Gaussian posterior q(z | a_{t:t+K}, joints) over a 32-dimensional "style" latent z that captures variation across demonstrations (for example, different ways an operator might approach the same object).[^3][^12] At inference time z is set to zero, equivalent to taking the mean of the prior, which yields a deterministic policy; the encoder is discarded.[^3]
The CVAE decoder is the policy itself: it takes the four ResNet-18 image feature maps, the current joint positions, and z, then produces the K-step action chunk. Each 480x640x3 RGB image is processed by a ResNet-18 backbone to a 15x20x512 feature map; the four maps are flattened and concatenated into roughly 1200 image tokens at 512 dimensions, joined by projected joint and z tokens. A transformer encoder (four layers, eight attention heads, hidden size 512, feed-forward size up to roughly 3200) aggregates these tokens; a transformer decoder (seven layers) then uses cross-attention to produce K predicted actions.[^3][^12] The decoder architecture is adapted from Facebook AI Research's DETR object-detection transformer, which the DETR paper introduced for set prediction with transformers; ACT's GitHub repository explicitly notes the modification is from DETR.[^13][^14]
ACT is trained with an L1 reconstruction loss on the K-step action chunk plus a KL-divergence regularizer between the encoder posterior q(z | a, joints) and a standard Gaussian prior, weighted by a hyperparameter β. The official implementation uses β = 10. L1 (rather than L2) loss was found to better preserve sharp, precise actions, an observation later corroborated by other manipulation policies.[^3][^12] Training takes roughly five hours on a single NVIDIA RTX 2080 Ti for a single task with 50 demonstrations.[^12]
A naive deployment of action chunking would simply execute all K actions of a chunk in open loop, then query the policy again. This minimizes inference compute but produces visible jerk at chunk boundaries because consecutive chunks were generated from different observations. ACT instead re-queries the policy at every timestep, generating overlapping K-step chunks; at any given timestep t the agent has up to K previously predicted actions for that timestep coming from chunks that started K, K-1, ..., 1 steps earlier.[^2][^3] These predictions are combined by a weighted average with weights w_i ∝ exp(-m·i), where i indexes how many timesteps ago the prediction was made and m is a temperature controlling the decay rate. The resulting executed action smoothly interpolates between recent and older predictions, suppressing high-frequency jitter while remaining responsive when fresh observations contradict older plans.[^2][^3] Despite querying every step, ACT inference runs in roughly 0.01 seconds on the deployed GPU, so the temporal-ensembling overhead is negligible compared to the 50 Hz control loop.[^12]
The RSS 2023 paper evaluates ACT on six real-world fine-grained tasks performed on ALOHA, plus simulated transfer-cube and bimanual-insertion benchmarks.[^1] The real tasks include slide-ziploc (open a translucent zip-top bag), slot-battery (insert a 9V battery into a slot with millimeter tolerance), open-cup (twist open a translucent condiment cup), thread-velcro (thread a cable tie through a small loop), prep-tape (peel and apply a piece of tape), and put-on-shoe (slip a shoe onto a manikin foot).[^1][^4] Demonstrations are around 50 per task, totaling roughly 10 to 15 minutes of teleoperation per task.[^4]
Reported success rates include 88 percent on slide-ziploc and 96 percent on slot-battery, with overall results spanning roughly 64 to 96 percent across the four headline tasks and lower success on the most precise variants such as thread-velcro.[^4][^12] In simulation, ACT solves the bimanual transfer-cube and insertion benchmarks (where one arm picks an object and hands it to the other) at success rates that substantially exceed earlier behavior-cloning baselines.[^1] The paper compares ACT against BC-ConvMLP (a standard convolutional behavior-cloning baseline), Behavior Transformers (BeT), RT-1, and the Visual Imitation through Nearest Neighbors (VINN) method; ACT outperforms all four, with the largest gaps on the precision-critical real tasks.[^1]
The paper also reports ablations isolating the contribution of each component: action chunking, temporal ensembling, and the CVAE-encoded style latent z each contribute measurable success-rate improvements, with chunking and the CVAE being the largest contributors and temporal ensembling providing additional smoothing.[^1]
ACT and ALOHA together kicked off a small ecosystem of follow-on hardware and policy research, much of which has retained "ALOHA" in its name as a reference to the original platform.
Mobile ALOHA (arXiv:2401.02117, January 2024) by Zipeng Fu, Tony Z. Zhao, and Chelsea Finn extends ALOHA with an AGILEX Tracer mobile base and a whole-body teleoperation interface so the operator can drive the robot through a home while controlling the arms.[^5] The system is built around the same ViperX 300 follower arms and WidowX 250 leaders, plus the mobile base (roughly $7,000), an onboard 1.26 kWh battery, and a consumer laptop, bringing the total parts cost to approximately $32,000.[^5][^15] Mobile ALOHA demonstrated supervised behavior cloning, often using ACT with chunked predictions, on cooking and housekeeping tasks including sauteing and serving shrimp, opening a two-door wall cabinet to store heavy pots, calling and entering an elevator, and rinsing a pan in a kitchen sink. The paper showed that co-training mobile-manipulation data with the original ALOHA static datasets boosted success rates by up to 90 percent on the new mobile tasks with just 50 demonstrations each.[^5][^16]
ALOHA 2 (arXiv:2405.02292, May 2024) is a hardware refresh by a 24-author team primarily from Google DeepMind together with Stanford University and Hoku Labs.[^17] Key changes include a low-friction linear-rail gripper that drops the leader-side operating force from 14.68 N to 0.84 N while doubling the follower's output force from 12.8 N to 27.9 N, a passive gravity-compensation mechanism that replaces the original rubber bands, smaller Intel RealSense D405 depth cameras with global shutter, and a simplified aluminum-extrusion frame. ALOHA 2 also shipped a MuJoCo Menagerie model with higher physical and visual fidelity to support large-scale simulated data collection.[^17] Commercial kits are sold by Trossen Robotics in Solo, Stationary, and Mobile configurations.[^18]
ALOHA Unleashed (arXiv:2410.13126, October 2024, CoRL 2024) by Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid at Google DeepMind is the most ambitious application of the ALOHA platform to date.[^7][^19] Twenty-six thousand demonstrations were collected by 35 teleoperators across 10 ALOHA 2 robots over eight months, then used to train a 217-million-parameter transformer encoder-decoder policy with a diffusion policy head over chunked action sequences (50 timesteps, equivalent to 1 second of motion).[^7] The system solves long-horizon, deformable, and contact-rich tasks including shirt hanging (70 to 75 percent), shoelace tying (40 to 70 percent depending on initialization), multi-gear insertion, and replacing a damaged finger on another ALOHA robot.[^7] An ablation showed that a tuned ACT (action chunking with L1 regression) baseline at the same parameter scale achieved 25 percent on the messy-shirt-hanging task, versus 70 percent for the diffusion variant, indicating that at large data scales the multimodal action distribution captured by diffusion outperforms unimodal regression while keeping action chunking as a common ingredient.[^7]
ACT arrived in spring 2023 alongside Diffusion Policy (Chi et al., arXiv:2303.04137, RSS 2023), which independently used an action-chunked predictor but generated chunks by diffusion model denoising rather than CVAE decoding.[^6] The two methods became the standard reference baselines for visuomotor imitation learning in 2023 to 2025. Practitioner comparisons typically place ACT as faster (millisecond-scale inference for a chunk, well below the 50 Hz control budget on commodity hardware), simpler to train and tune, and competitive with 50 or so demonstrations, while diffusion policies generally scale better with hundreds or thousands of demonstrations and handle highly multimodal action distributions more gracefully at the cost of multi-step denoising.[^20][^21]
The action-chunking decoder has since been incorporated into the LeRobot open-source library from Hugging Face, where ACT is recommended as the first policy class for newcomers because of its 80-million-parameter footprint, multi-hour single-GPU training, and strong performance with small datasets.[^3] LeRobot ships pretrained ACT checkpoints for the ALOHA simulated transfer-cube and insertion tasks, and the policy interface is reused for new low-cost arms such as the SO-100/SO-101.[^3]
The idea of predicting temporally extended action chunks rather than single steps has since been adopted by generalist robot foundation models. Physical Intelligence's π0 (arXiv:2410.24164, October 2024) builds a vision-language-action model on top of a pre-trained vision-language backbone (PaliGemma) and produces action chunks via flow matching at 50 Hz, with the model trained on roughly 10,000 hours of data from seven robot platforms across 68 tasks.[^22] The π0 architecture explicitly cites chunked-action prediction in the ACT and diffusion-policy lineage as a key design choice for smooth high-frequency control.[^22] The follow-on π0.5 (arXiv:2504.16054, April 2025) extends the recipe to open-world generalization, and π* variants continue to use chunked action prediction.[^22]
Several limitations of ACT are well documented in the original paper and subsequent practitioner literature.[^1][^20][^21]
The chunk size K is a sensitive hyperparameter. Too small a chunk recovers the original compounding-error problem; too large a chunk forces the policy to commit to obsolete plans when the world changes. Reported recommended ranges are 50 to 100 timesteps at 50 Hz, which corresponds to 1 to 2 seconds of motion, with K = 100 used in the original paper but K = 50 sometimes preferred in LeRobot recipes.[^3][^21]
ACT inherits the limitations of supervised behavior cloning. It cannot recover from out-of-distribution states better than the demonstrations it was trained on, and its CVAE policy with z set to zero at inference time is effectively unimodal at deployment, which means it does not capture the full multimodality of human demonstrations the way diffusion policies do.[^7][^21] At large data scales the unimodal collapse becomes a measurable disadvantage: the ALOHA Unleashed ablation explicitly attributed roughly 45 absolute percentage points of success-rate gap on messy shirt hanging to this limitation.[^7]
The policy is purely visuomotor; the original ACT does not condition on natural-language goals, so multi-task generalization requires either a multi-task variant or downstream conditioning, which subsequent work has addressed.[^23] Finally, ACT inherits ALOHA's calibration and embodiment dependence: a policy trained on one ALOHA can transfer poorly to a sibling robot with different camera placement or backlash, motivating later large-scale, multi-robot training recipes such as those in ALOHA Unleashed and π0.[^7][^22]
| Method | Year | Action representation | Inference cost | Notes |
|---|---|---|---|---|
| BC-ConvMLP | classic | single step | very low | baseline, suffers from compounding error[^1] |
| Behavior Transformers (BeT) | 2022 | single step (discretized) | low | k-means action bins[^1] |
| RT-1 | 2022 | single step (discretized) | medium | discretized actions, language-conditioned[^1] |
| ACT | 2023 | K-step chunk via CVAE | low (one forward pass) | L1 loss, temporal ensembling[^1][^3] |
| Diffusion Policy | 2023 | K-step chunk via denoising | medium-high (multi-step) | multimodal, more data hungry[^6] |
| OpenVLA | 2024 | single step (discretized tokens) | medium | VLM backbone, action tokens[^24] |
| π0 | 2024 | K-step chunk via flow matching | medium | VLM backbone, generalist[^22] |
| ALOHA Unleashed | 2024 | 50-step chunk via diffusion | high | trained on 26k demos, 217M params[^7] |