Action Chunking with Transformers (ACT)
Last reviewed
Sources
24 citations
Review status
Source-backed
Revision
v3 ยท 3,836 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
24 citations
Review status
Source-backed
Revision
v3 ยท 3,836 words
Add missing citations, update stale details, or suggest a clearer explanation.
Action Chunking with Transformers (ACT) is an imitation learning algorithm for fine-grained robotic manipulation that predicts a short sequence (a "chunk") of future actions at once instead of a single next action, which shrinks the effective task horizon and reduces the compounding-error problem of behavioral cloning [1][2]. It was introduced by Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn in the April 2023 paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware," presented at Robotics: Science and Systems (RSS) 2023 [1]. ACT is implemented as a conditional variational autoencoder (CVAE) whose decoder is a transformer encoder-decoder that maps RGB camera images and proprioceptive joint states to a chunk of K future actions, and it uses a technique called "temporal ensembling" to smooth overlapping chunk predictions at deployment [1][3]. Paired with the roughly $20,000 ALOHA teleoperation rig, ACT learned six fine bimanual tasks (such as slotting a battery and opening a translucent condiment cup) at 80 to 90 percent success from only about 50 human demonstrations, equivalent to roughly 10 to 20 minutes of data per task [1][4]. ACT has since become one of the canonical baselines and reference implementations for visuomotor policy learning, alongside diffusion policy, and underpins follow-on work including Mobile ALOHA, ALOHA Unleashed, and the action-chunked decoders used in Physical Intelligence's pi0 family of generalist policies [5][6][7].
ACT (Action Chunking with Transformers) is a method for teaching robots precise manipulation skills by imitation learning, specifically by behavioral cloning from human teleoperation demonstrations. Its defining idea is action chunking: rather than predicting one action per timestep, the policy predicts a temporally extended block of K future actions in a single forward pass. The original paper describes ACT as a method that "learns a generative model over action sequences" [1].
ACT was introduced together with ALOHA (A Low-cost Open-source Hardware system for bimanual teleoperATion), a dual-arm teleoperation rig built on robot learning hardware from Trossen Robotics. The paper frames the system as a low-cost route to fine manipulation: "Fine manipulation tasks ... remain challenging for robots ... We develop a low-cost system that performs ... fine manipulation tasks, for example slotting a battery or opening a condiment cup" [1]. The two co-designed components are the ALOHA teleoperation rig, which makes high-quality demonstrations cheap to collect, and the ACT learning algorithm, which exploits the temporal structure of those demonstrations.
Behavior cloning from RGB images has been studied for decades, but practitioners had long observed that single-step policies trained on human demonstrations struggle to chain together the dozens of millimeter-accurate motions required for tasks like inserting a battery or threading a zip tie. Two failure modes are particularly damaging. First, demonstrations collected by human teleoperators are non-Markovian and frequently contain pauses, hesitations, or small corrective movements that map identical observations to different actions, which forces a single-step regressor to average across modes. The paper notes that "errors in the policy can compound over time, and human demonstrations can be non-stationary" [1][2][8]. Second, single-step policies suffer from covariate shift: small per-timestep errors accumulate over long horizons, drifting the robot out of the demonstration distribution faster than it can recover [9].
The classical mitigation, DAgger and related interactive imitation learning methods, requires the expert to remain in the loop relabeling states, which is expensive and impractical for fine-grained bimanual tasks performed by teleoperators [9]. Reinforcement learning offers another path but typically requires either accurate simulators (which are hard to build for contact-rich, deformable, or visually subtle manipulation) or large amounts of real-world reward signal [1].
Zhao and collaborators framed their contribution as a pragmatic engineering question: can a small team using inexpensive, commercially available arms collect a few dozen minutes of teleoperated data and learn robust policies for precision tasks, end-to-end from pixels? Two co-designed components addressed this: a teleoperation rig (ALOHA) that made high-quality demonstrations cheap to collect, and a learning method (ACT) that exploited the temporal structure of those demonstrations [1].
ALOHA is a leader-follower bimanual teleoperation system documented on its project page at tonyzhaozh.github.io/aloha and released as open-source hardware under the MIT license [4][10]. The paper states the team "build the bimanual teleoperation setup ALOHA within a 20k USD budget" [1]. The follower workcell consists of two Trossen Robotics ViperX 300 6-DOF arms; each arm has roughly 75 cm horizontal reach, a 750 g payload, and is driven by Robotis Dynamixel XM540 and XM430 servos that expose position, velocity, current, and PID parameters over a U2D2 USB bus [11]. The leader arms are two WidowX 250 6-DOF arms, chosen because their lower mass (0.53 kg) and smaller workspace make them comfortable for an operator to hold during multi-hour data collection sessions [11].
| Item | Specification |
|---|---|
| Total parts cost | Approximately $20,000 (USD) [1] |
| Follower arms | 2x Trossen ViperX 300 6-DOF |
| Leader arms | 2x Trossen WidowX 250 6-DOF |
| Degrees of freedom | 7 per arm (6 joint + 1 gripper) = 14 total |
| Cameras | 4 RGB at 480x640 (2 wrist-mounted, 2 overhead) |
| Control rate | 50 Hz |
| Workspace | Bimanual tabletop |
| License | MIT (hardware and code) |
| Original paper | RSS 2023 [1] |
The control interface joins the leader and follower arms in joint-space: the operator backdrives the leader, and joint positions are streamed at 50 Hz to the follower as targets. Each demonstration episode records four synchronized RGB streams (two wrist cameras and two overhead cameras), the 14-dimensional follower joint state, and the 14-dimensional leader command, which serves as the action label for supervised learning [1][4]. The original release shipped with assembly tutorials, a ROS Noetic software stack, episode recording and replay scripts, and a MuJoCo-based simulation environment for the benchmark tasks [10].
The core observation behind ACT is that the policy should not have to commit to one action at a time. If a teleoperator's hand pauses for half a second to align a zip tie, the policy that reproduces that pause must somehow know it is in the middle of a pause rather than at the start. A standard single-step policy only sees the current observation and cannot disambiguate identical frames, so it averages and produces a half-speed motion. Predicting a horizon of K future actions in one forward pass sidesteps this: the chunk implicitly carries the relative-time information that a single-step output lacks [2][8].
Action chunking also addresses covariate shift directly. By predicting and partly executing K-step sequences, the policy makes far fewer independent decisions per episode. As the paper puts it, action chunking "reduces the effective horizon of the task by k-fold, mitigating compounding errors" [1]. Zhao et al. use K = 100 (two seconds of motion at 50 Hz) in the main experiments, executing each chunk in a temporally ensembled fashion described below [1][3]. Because a 5 to 10 second task that would otherwise need 250 to 500 single-step decisions collapses to only a handful of chunk predictions, the effective decision horizon shrinks by roughly 20 to 50 times [21].
ACT learns a generative model p(a_{t:t+K} | o_t) over action chunks conditioned on the current observation o_t, where the observation is the bundle of four RGB images and current joint state. The model is structured as a conditional variational autoencoder. The paper describes the design directly: "We implement action chunking policy with Transformers, an architecture designed for sequence modeling, and train it as a conditional VAE (CVAE) to capture the variability in human data" [1][3].
During training the CVAE encoder is a BERT-style transformer that ingests a sequence consisting of a learned [CLS] token, the current joint positions, and the demonstrated K-step action sequence. It produces the parameters of a diagonal-Gaussian posterior q(z | a_{t:t+K}, joints) over a 32-dimensional "style" latent z that captures variation across demonstrations (for example, different ways an operator might approach the same object) [3][12]. At inference time z is set to zero, equivalent to taking the mean of the prior, which yields a deterministic policy; the encoder is discarded [3].
The CVAE decoder is the policy itself: it takes the four ResNet-18 image feature maps, the current joint positions, and z, then produces the K-step action chunk. Each 480x640x3 RGB image is processed by a ResNet-18 backbone to a 15x20x512 feature map; the four maps are flattened and concatenated into roughly 1200 image tokens at 512 dimensions, joined by projected joint and z tokens. A transformer encoder (four layers, eight attention heads, hidden size 512, feed-forward size up to roughly 3200) aggregates these tokens; a transformer decoder (seven layers) then uses cross-attention to produce K predicted actions [3][12]. The decoder architecture is adapted from Facebook AI Research's DETR object-detection transformer, which the DETR paper introduced for set prediction with transformers; ACT's GitHub repository explicitly notes the modification is from DETR [13][14].
ACT is trained with an L1 reconstruction loss on the K-step action chunk plus a KL-divergence regularizer between the encoder posterior q(z | a, joints) and a standard Gaussian prior, weighted by a hyperparameter beta. The official implementation uses beta = 10. L1 (rather than L2) loss was found to better preserve sharp, precise actions, an observation later corroborated by other manipulation policies [3][12]. Training takes roughly five hours on a single NVIDIA RTX 2080 Ti for a single task with 50 demonstrations [12].
Temporal ensembling is the inference-time technique ACT uses to convert overlapping action-chunk predictions into one smooth command stream. A naive deployment of action chunking would simply execute all K actions of a chunk in open loop, then query the policy again. This minimizes inference compute but produces visible jerk at chunk boundaries because consecutive chunks were generated from different observations. ACT instead re-queries the policy at every timestep, generating overlapping K-step chunks; at any given timestep t the agent has up to K previously predicted actions for that timestep coming from chunks that started K, K-1, ..., 1 steps earlier [2][3]. As the paper summarizes, "we query the policy more frequently and averages across the overlapping action chunks" [1]. These predictions are combined by a weighted average with weights w_i proportional to exp(-m*i), where i indexes how many timesteps ago the prediction was made and m is a temperature controlling the decay rate. The resulting executed action smoothly interpolates between recent and older predictions, suppressing high-frequency jitter while remaining responsive when fresh observations contradict older plans [2][3]. The paper notes the scheme "incurs no additional training cost, only extra inference-time computation" [1], and ACT inference runs in roughly 0.01 seconds on the deployed GPU, so the temporal-ensembling overhead is negligible compared to the 50 Hz control loop [12].
The RSS 2023 paper evaluates ACT on six real-world fine-grained tasks performed on ALOHA, plus simulated transfer-cube and bimanual-insertion benchmarks [1]. The real tasks include slide-ziploc (open a translucent zip-top bag), slot-battery (insert a battery into a slot with millimeter tolerance), open-cup (twist open a translucent condiment cup), thread-velcro (thread a cable tie through a small loop), prep-tape (peel and apply a piece of tape), and put-on-shoe (slip a shoe onto a manikin foot) [1][4]. The team "record 50 demonstrations for each task, except for Thread Velcro which has 100," totaling roughly 10 to 20 minutes of teleoperation per task [1][4].
Reported success rates span roughly 80 to 90 percent across the headline tasks, including high success on slide-ziploc and slot-battery, with lower success on the most precise variants such as thread-velcro [1][4][12]. In simulation, ACT solves the bimanual transfer-cube and insertion benchmarks (where one arm picks an object and hands it to the other) at success rates that substantially exceed earlier behavior-cloning baselines [1]. The paper compares ACT against BC-ConvMLP (a standard convolutional behavior-cloning baseline), Behavior Transformers (BeT), RT-1, and the Visual Imitation through Nearest Neighbors (VINN) method; ACT outperforms all four, with the largest gaps on the precision-critical real tasks [1].
The paper also reports ablations isolating the contribution of each component: action chunking, temporal ensembling, and the CVAE-encoded style latent z each contribute measurable success-rate improvements. As the authors conclude, "we find both action chunking and temporal ensembling to be important for the success of ACT" [1].
ACT and ALOHA together kicked off a small ecosystem of follow-on hardware and policy research, much of which has retained "ALOHA" in its name as a reference to the original platform.
Mobile ALOHA (arXiv:2401.02117, January 2024) by Zipeng Fu, Tony Z. Zhao, and Chelsea Finn extends ALOHA with an AGILEX Tracer mobile base and a whole-body teleoperation interface so the operator can drive the robot through a home while controlling the arms [5]. The system is built around the same ViperX 300 follower arms and WidowX 250 leaders, plus the mobile base (roughly $7,000), an onboard 1.26 kWh battery, and a consumer laptop, bringing the total parts cost to approximately $32,000 [5][15]. Mobile ALOHA demonstrated supervised behavior cloning, often using ACT with chunked predictions, on cooking and housekeeping tasks including sauteing and serving shrimp, opening a two-door wall cabinet to store heavy pots, calling and entering an elevator, and rinsing a pan in a kitchen sink. The paper showed that co-training mobile-manipulation data with the original ALOHA static datasets boosted success rates by up to 90 percent on the new mobile tasks with just 50 demonstrations each [5][16].
ALOHA 2 (arXiv:2405.02292, May 2024) is a hardware refresh by a 24-author team primarily from Google DeepMind together with Stanford University and Hoku Labs [17]. Key changes include a low-friction linear-rail gripper that drops the leader-side operating force from 14.68 N to 0.84 N while doubling the follower's output force from 12.8 N to 27.9 N, a passive gravity-compensation mechanism that replaces the original rubber bands, smaller Intel RealSense D405 depth cameras with global shutter, and a simplified aluminum-extrusion frame. ALOHA 2 also shipped a MuJoCo Menagerie model with higher physical and visual fidelity to support large-scale simulated data collection [17]. Commercial kits are sold by Trossen Robotics in Solo, Stationary, and Mobile configurations [18].
ALOHA Unleashed (arXiv:2410.13126, October 2024, CoRL 2024) by Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid at Google DeepMind is the most ambitious application of the ALOHA platform to date [7][19]. Twenty-six thousand demonstrations were collected by 35 teleoperators across 10 ALOHA 2 robots over eight months, then used to train a 217-million-parameter transformer encoder-decoder policy with a diffusion policy head over chunked action sequences (50 timesteps, equivalent to 1 second of motion) [7]. The system solves long-horizon, deformable, and contact-rich tasks including shirt hanging (70 to 75 percent), shoelace tying (40 to 70 percent depending on initialization), multi-gear insertion, and replacing a damaged finger on another ALOHA robot [7]. An ablation showed that a tuned ACT (action chunking with L1 regression) baseline at the same parameter scale achieved 25 percent on the messy-shirt-hanging task, versus 70 percent for the diffusion variant, indicating that at large data scales the multimodal action distribution captured by diffusion outperforms unimodal regression while keeping action chunking as a common ingredient [7].
ACT arrived in spring 2023 alongside Diffusion Policy (Chi et al., arXiv:2303.04137, RSS 2023), which independently used an action-chunked predictor but generated chunks by diffusion model denoising rather than CVAE decoding [6]. The two methods became the standard reference baselines for visuomotor imitation learning in 2023 to 2025. Practitioner comparisons typically place ACT as faster (millisecond-scale inference for a chunk, well below the 50 Hz control budget on commodity hardware), simpler to train and tune, and competitive with 50 or so demonstrations, while diffusion policies generally scale better with hundreds or thousands of demonstrations and handle highly multimodal action distributions more gracefully at the cost of multi-step denoising [20][21].
The action-chunking decoder has since been incorporated into the LeRobot open-source library from Hugging Face, where ACT is recommended as the first policy class for newcomers because of its 80-million-parameter footprint, multi-hour single-GPU training, and strong performance with small datasets [3]. LeRobot ships pretrained ACT checkpoints for the ALOHA simulated transfer-cube and insertion tasks, and the policy interface is reused for new low-cost arms such as the SO-100/SO-101 [3].
The idea of predicting temporally extended action chunks rather than single steps has since been adopted by generalist robot foundation models. Physical Intelligence's pi0 (arXiv:2410.24164, October 2024) builds a vision-language-action model on top of a pre-trained vision-language backbone (PaliGemma) and produces action chunks via flow matching at 50 Hz, with the model trained on roughly 10,000 hours of data from seven robot platforms across 68 tasks [22]. The pi0 architecture explicitly cites chunked-action prediction in the ACT and diffusion-policy lineage as a key design choice for smooth high-frequency control [22]. The follow-on pi0.5 (arXiv:2504.16054, April 2025) extends the recipe to open-world generalization, and pi* variants continue to use chunked action prediction [22].
Several limitations of ACT are well documented in the original paper and subsequent practitioner literature [1][20][21].
The chunk size K is a sensitive hyperparameter. Too small a chunk recovers the original compounding-error problem; too large a chunk forces the policy to commit to obsolete plans when the world changes. Reported recommended ranges are 50 to 100 timesteps at 50 Hz, which corresponds to 1 to 2 seconds of motion, with K = 100 used in the original paper but K = 50 sometimes preferred in LeRobot recipes [3][21].
ACT inherits the limitations of supervised behavior cloning. It cannot recover from out-of-distribution states better than the demonstrations it was trained on, and its CVAE policy with z set to zero at inference time is effectively unimodal at deployment, which means it does not capture the full multimodality of human demonstrations the way diffusion policies do [7][21]. At large data scales the unimodal collapse becomes a measurable disadvantage: the ALOHA Unleashed ablation explicitly attributed roughly 45 absolute percentage points of success-rate gap on messy shirt hanging to this limitation [7].
The policy is purely visuomotor; the original ACT does not condition on natural-language goals, so multi-task generalization requires either a multi-task variant or downstream conditioning, which subsequent work has addressed [23]. Finally, ACT inherits ALOHA's calibration and embodiment dependence: a policy trained on one ALOHA can transfer poorly to a sibling robot with different camera placement or backlash, motivating later large-scale, multi-robot training recipes such as those in ALOHA Unleashed and pi0 [7][22].
| Method | Year | Action representation | Inference cost | Notes |
|---|---|---|---|---|
| BC-ConvMLP | classic | single step | very low | baseline, suffers from compounding error [1] |
| Behavior Transformers (BeT) | 2022 | single step (discretized) | low | k-means action bins [1] |
| RT-1 | 2022 | single step (discretized) | medium | discretized actions, language-conditioned [1] |
| ACT | 2023 | K-step chunk via CVAE | low (one forward pass) | L1 loss, temporal ensembling [1][3] |
| Diffusion Policy | 2023 | K-step chunk via denoising | medium-high (multi-step) | multimodal, more data hungry [6] |
| OpenVLA | 2024 | single step (discretized tokens) | medium | VLM backbone, action tokens [24] |
| pi0 | 2024 | K-step chunk via flow matching | medium | VLM backbone, generalist [22] |
| ALOHA Unleashed | 2024 | 50-step chunk via diffusion | high | trained on 26k demos, 217M params [7] |