See also: Robotics, Robot learning, Embodied AI, Computer vision
Robot manipulation refers to the ability of a robotic system to physically interact with objects in its environment through grasping, pushing, pulling, inserting, placing, and other contact-based actions. It is one of the central problems in robotics and a prerequisite for deploying autonomous robots in unstructured real-world settings such as homes, hospitals, warehouses, and construction sites. While industrial robots have performed repetitive pick-and-place operations in structured factory environments since the 1960s, enabling robots to manipulate novel objects in cluttered, unpredictable environments remains an open research challenge.
The difficulty of robot manipulation stems from the interplay of perception, planning, and control under uncertainty. A robot must perceive the geometry and physical properties of objects using sensors such as cameras, depth sensors, and tactile arrays. It must plan a sequence of motions that achieve a desired goal while respecting kinematic constraints, collision avoidance, and contact dynamics. Finally, it must execute those motions with closed-loop control that adapts to disturbances, modeling errors, and object slip. Advances in deep learning, reinforcement learning, computer vision, and simulation have dramatically expanded the capabilities of manipulation systems in recent years, particularly with the emergence of foundation models and vision-language-action models in 2024 and 2025.
The history of robot manipulation can be traced through several distinct eras, each defined by the dominant technology and control paradigm.
In 1961, Unimate became the first industrial robot, developed by George Devol and Joseph Engelberger. Installed at the General Motors Ternstedt plant in Trenton, New Jersey, Unimate performed tasks such as transporting die-castings and welding parts onto car bodies. The robot could move along the X and Y axes, possessed a rotatable pincer-like gripper, and could follow a program of up to 200 stored movements. These early systems operated in highly structured environments with precisely known object positions, relying entirely on pre-programmed trajectories with no sensory feedback.
The development of force/torque sensors and early computer vision systems enabled closed-loop manipulation, where robots could adapt their behavior based on sensory input. The Stanford/JPL Hand, developed in the early 1980s, is considered the first dexterous robotic hand built to investigate fine manipulation. It featured three independently controlled fingers equipped with force feedback. The Utah/MIT Dextrous Hand, developed around the same period, pushed the boundaries of human-like articulation with 16 degrees of freedom driven by pneumatic actuators. These systems demonstrated that multi-fingered manipulation was mechanically feasible, though the computational tools to control them intelligently were still limited.
The convergence of deep learning, large-scale simulation, and affordable hardware sparked a revolution in manipulation research. Convolutional neural networks enabled robots to detect grasp poses directly from raw sensor data. Reinforcement learning allowed robots to discover manipulation strategies through trial and error. Large-scale simulation environments made it possible to train policies on millions of episodes before transferring them to physical hardware. In 2019, OpenAI demonstrated that a five-fingered Shadow Dexterous Hand could solve a Rubik's Cube using reinforcement learning policies trained entirely in simulation, marking a milestone in dexterous sim-to-real transfer.
The end effector is the device attached to the end of a robotic arm that physically interacts with objects. The choice of end effector fundamentally shapes what manipulation tasks a robot can perform. End effectors range from simple two-finger grippers to anthropomorphic hands with dozens of degrees of freedom.
| Gripper Type | Mechanism | Strengths | Limitations | Typical Applications |
|---|---|---|---|---|
| Parallel jaw | Two opposing flat jaws move in parallel to clamp objects | Simple, robust, high force | Limited to objects that fit between jaws; poor with irregular shapes | Assembly, machine tending, bin picking |
| Suction/vacuum | Vacuum pump creates suction through cups or channels | Handles flat, smooth surfaces well; fast cycle times | Requires smooth, non-porous surfaces; struggles with heavy or irregularly shaped objects | Packaging, palletizing, sheet handling |
| Soft gripper | Flexible elastomeric fingers conform to object shape | Gentle on fragile objects; adapts to irregular geometry | Lower gripping force; limited precision | Food handling, agricultural picking, medical devices |
| Multi-finger dexterous | Three or more actuated fingers with multiple joints | Versatile; can perform in-hand manipulation and tool use | Complex control; high cost; mechanical fragility | Research, dexterous tasks, humanoid robots |
| Electromagnetic | Magnetic field holds ferrous objects | Fast attach/detach; no surface damage | Only works with ferromagnetic materials | Metal sheet handling, automotive manufacturing |
| Needle/pin | Needles penetrate porous material to secure grip | Works on soft, porous materials | Damages the object surface; narrow material range | Textile and fabric handling |
Dexterous robotic hands aim to replicate the versatility of the human hand, which has approximately 27 degrees of freedom. Notable dexterous hands include:
Grasping is the most fundamental manipulation skill: the ability to securely hold an object so it can be lifted, transported, or repositioned. Research on grasping spans analytical methods rooted in classical mechanics and data-driven methods that leverage machine learning.
Grasp analysis draws on the mechanics of contact to evaluate whether a given set of contact points can securely restrain an object. Two foundational concepts are:
The quality of a grasp can be measured by metrics such as the largest minimum resisting wrench, the volume of the grasp wrench space, or task-specific criteria. Computing optimal grasps under these metrics is generally NP-hard, so practical algorithms rely on heuristics, sampling, or optimization.
Analytical methods compute grasps by reasoning about object geometry, contact models, and physical constraints. Given a known 3D model of the object and the gripper kinematics, these methods search for contact point configurations that satisfy force or form closure conditions. The search is typically formulated as a constrained nonlinear optimization problem.
Contact models used in analytical grasp planning include:
| Contact Model | Description | Friction Considered |
|---|---|---|
| Frictionless point contact | Contact force is normal to the surface only | No |
| Frictional point contact (Coulomb) | Contact force lies within a friction cone defined by the coefficient of friction | Yes |
| Soft finger contact | Adds a torsional friction component around the contact normal | Yes |
Analytical methods offer interpretability and can provide formal guarantees on grasp stability. However, they require accurate 3D object models and material properties, which are often unavailable in unstructured environments. They also tend to be computationally expensive for complex geometries.
Data-driven approaches learn to predict successful grasps from sensor data, typically RGB images, depth images, or point clouds. These methods have shown remarkable adaptability in diverse scenarios by learning from large datasets, without requiring explicit 3D models of the objects.
Key approaches include:
Deep learning methods for grasp synthesis broadly follow four algorithmic strategies: grasp pose sampling (generating candidate grasps and scoring them), direct grasp pose regression (predicting grasp parameters end-to-end), reinforcement learning (learning grasping policies through trial and error), and exemplar-based methods (retrieving and adapting grasps from a database of known objects).
Dexterous grasping extends the problem to multi-fingered hands, where the robot must coordinate the motion of multiple fingers to form stable grasps on objects of diverse shapes. DexGraspNet, introduced by Wang et al. (2023), provides a large-scale dataset of 1.32 million simulated grasps for 5,355 objects across more than 133 categories, with over 200 diverse grasps per object. The follow-up DexGraspNet 2.0 (2024) expanded the benchmark to 1,319 objects, 8,270 cluttered scenes, and 427 million grasps, demonstrating zero-shot sim-to-real transfer with a 90.7% real-world dexterous grasping success rate in cluttered scenes using a diffusion model conditioned on local geometry.
In-hand manipulation refers to the ability to reposition, reorient, or reconfigure an object within the grasp of a robotic hand, without setting it down. This is one of the most challenging manipulation skills because it requires coordinated finger motions, precise contact control, and real-time adaptation to prevent the object from slipping or falling.
Humans perform in-hand manipulation effortlessly when rotating a pen between fingers, unscrewing a bottle cap, or adjusting the grip on a tool. For robots, replicating these behaviors requires high-dimensional control of multi-fingered hands (typically 12 to 24 degrees of freedom) under complex contact dynamics.
One of the most notable demonstrations of in-hand manipulation was OpenAI's system that solved a Rubik's Cube using a Shadow Dexterous Hand. The policy was trained entirely in simulation using reinforcement learning with a novel technique called Automatic Domain Randomization (ADR). ADR automatically generates training environments of ever-increasing difficulty by randomizing physical parameters such as friction, object mass, and actuator noise. The training consumed approximately 13,000 years of simulated experience. The resulting policy solved the cube 60% of the time (20% for maximally scrambled configurations) and exhibited robustness to real-world perturbations it was never trained on, such as wearing a rubber glove or having fingers tied together.
Recent work on in-hand manipulation has focused on combining reinforcement learning with tactile feedback and learning from human demonstrations. Tactile sensors such as DIGIT and GelSight provide rich contact information that enables finer control during manipulation. The ManiSkill-ViTac challenge (2025) specifically benchmarks manipulation skill learning that combines vision and tactile sensing, driving research on multimodal perception for dexterous tasks.
Robot manipulation research employs a spectrum of methods ranging from classical model-based techniques to modern learning-based approaches, as well as hybrid systems that combine the strengths of both.
Model-based manipulation relies on explicit mathematical models of the robot, the objects, and the physics of contact. These approaches plan manipulation actions by simulating the effects of candidate motions and selecting those that achieve the desired outcome.
Motion planning is a core component of model-based manipulation. The robot must find a collision-free path from its current configuration to a goal configuration that places the end effector in the desired grasp or placement pose. Sampling-based planners such as Rapidly-exploring Random Trees (RRT) and Probabilistic Roadmaps (PRM) are widely used because they can handle high-dimensional configuration spaces without requiring an explicit representation of the obstacle boundaries. For manipulation specifically, RRT-Connect performs well in single-arm and dual-arm tasks due to its bidirectional search strategy.
Trajectory optimization computes smooth, dynamically feasible trajectories by solving a constrained optimization problem that minimizes a cost function (such as time, energy, or jerk) subject to joint limits, collision constraints, and task-specific requirements. In contact-rich manipulation, trajectory optimization must also account for the switching dynamics of making and breaking contact.
Task and motion planning (TAMP) integrates symbolic task planning (deciding what actions to take and in what order) with geometric motion planning (deciding how to execute each action). TAMP is essential for long-horizon manipulation tasks that involve multiple objects and sequential dependencies, such as setting a dinner table or assembling furniture.
Learning-based methods acquire manipulation skills from data, either through supervised learning from demonstrations, reinforcement learning from trial and error, or self-supervised learning from autonomous exploration.
Imitation learning trains a robot policy to reproduce expert behavior from demonstrations. Human operators provide demonstrations through kinesthetic teaching (physically guiding the robot arm), teleoperation, or even video recordings of humans performing the task. The policy learns a mapping from observations (images, joint positions, force readings) to actions.
Key imitation learning frameworks include:
Diffusion Policy, introduced by Chi et al. (2023), represents robot visuomotor policies as conditional denoising diffusion processes. Rather than predicting a single action, the policy generates action trajectories by iteratively denoising a sample from Gaussian noise, conditioned on the current observation. This formulation handles multimodal action distributions gracefully (since demonstrations of the same task can involve different strategies), is suitable for high-dimensional action spaces, and provides training stability. Diffusion Policy consistently outperforms prior state-of-the-art robot learning methods with an average improvement of 46.9% across benchmark tasks.
Variants and extensions include:
Reinforcement learning allows robots to discover manipulation strategies through interaction with the environment, guided by a reward signal. The robot receives a reward for successful task completion (or partial rewards for progress) and learns a policy that maximizes expected cumulative reward.
RL has been successfully applied to contact-rich manipulation tasks that are difficult to demonstrate or model. Key challenges include sample efficiency (physical robot interactions are slow and expensive), reward design (specifying a reward that captures the desired behavior without unintended shortcuts), and sim-to-real transfer (policies trained in simulation often fail on real hardware due to the "reality gap").
The most significant trend in robot manipulation research from 2023 to 2025 has been the emergence of foundation models and vision-language-action models (VLAs). These models combine the semantic understanding of large language models and vision-language models with robotic action generation, enabling robots to follow natural language instructions, generalize to novel objects, and perform multi-step reasoning.
| Model | Organization | Year | Parameters | Key Features |
|---|---|---|---|---|
| RT-1 | 2022 | 35M | Transformer-based; trained on 130k real robot episodes | |
| RT-2 | Google DeepMind | 2023 | 12B/55B | VLA model; represents actions as text tokens; 3x generalization improvement over RT-1 |
| Octo | UC Berkeley | 2024 | 27M/93M | Lightweight open-source generalist robot policy |
| OpenVLA | Stanford | 2024 | 7B | Open-source VLA trained on Open X-Embodiment dataset from 22 embodiments |
| pi0 | Physical Intelligence | 2024 | 3B | Flow-matching VLA; trained on 10,000+ hours of robot data across 7 platforms and 68 tasks |
| Helix | Figure AI | 2025 | Not disclosed | VLA for humanoid robots; trained on 500 hours of teleoperation data |
| GR00T N1 | NVIDIA | 2025 | Not disclosed | Dual-system VLA for humanoid robots; combines fast diffusion policy with LLM-based planner |
| Gemini Robotics | Google DeepMind | 2025 | Not disclosed | Extension of Gemini 2.0 to physical manipulation; demonstrated origami folding and card manipulation |
| SmolVLA | Hugging Face | 2025 | 450M | Compact open-source VLA with performance comparable to much larger models |
RT-2, introduced by Google DeepMind in July 2023, was a landmark VLA that represented robot actions as text tokens, enabling the model to transfer semantic knowledge from web-scale language and image data to robotic control. It demonstrated capabilities such as following the instruction "pick up the bag about to fall off the table" or "move banana to the sum of two plus one," which required both visual understanding and mathematical reasoning that the model acquired from its web pretraining.
pi0, released by Physical Intelligence in October 2024 and open-sourced in February 2025, demonstrated a new level of dexterous generalist manipulation. Trained on over 10,000 hours of robot data from seven different platforms, pi0 could perform tasks including laundry folding, table clearing, dish loading, egg carton stacking, box assembly, and grocery bagging. The model uses flow matching (a variant of diffusion) to generate high-frequency continuous actions at up to 50 Hz.
While vision provides information about object shape, pose, and appearance, tactile sensing provides direct information about contact: forces, pressures, slip, texture, and local geometry at the point of contact. Tactile feedback is essential for tasks that require precise force control, such as handling fragile objects, inserting tight-fitting parts, or detecting incipient slip during grasping.
A major advance in tactile sensing has been the development of optical (vision-based) tactile sensors that use cameras embedded behind a deformable elastomer to capture high-resolution images of contact geometry. These sensors offer much higher spatial resolution and richer data than traditional force/torque sensors.
| Sensor | Developer | Approximate Cost | Key Features |
|---|---|---|---|
| GelSight | MIT / GelSight Inc. | $500 (GelSight Mini) | High-resolution 3D surface reconstruction; micron-level detail |
| DIGIT | Meta AI / GelSight | $350 | Compact fingertip form factor; widely used in research |
| Digit 360 | Meta AI / GelSight (2024) | Not disclosed | 18+ sensing modalities; detects forces as small as 1 millinewton; human-level tactile precision |
| BioTac | SynTouch (discontinued) | $5,000-$10,000 | Multimodal (force, vibration, temperature); bio-inspired design |
| TACTO | Meta AI | Open source (sim) | Optical tactile sensor simulator for training in simulation |
The availability of low-cost, high-resolution tactile sensors such as DIGIT has been described as approaching an "ImageNet moment" for touch, enabling large-scale data collection and the development of tactile foundation models. The ManiSkill-ViTac 2025 challenge specifically benchmarks manipulation tasks that require the integration of vision and tactile sensing.
Sim-to-real transfer is the process of training robot manipulation policies in simulation and deploying them on physical hardware. Simulation offers virtually unlimited data, safe exploration, and the ability to parallelize training across thousands of environments. However, the "reality gap" between simulation and the real world (caused by differences in physics, rendering, sensor noise, and actuation dynamics) means that policies trained purely in simulation often fail when deployed on real robots.
Domain randomization addresses the reality gap by randomizing simulation parameters (friction coefficients, object masses, lighting conditions, camera positions, actuator noise) during training, so the policy learns to be robust to a wide range of conditions. The real world then becomes just one more sample from the distribution of training environments. OpenAI's Automatic Domain Randomization (ADR), used for the Rubik's Cube demonstration, automatically increases the randomization range as the policy improves.
Domain adaptation methods explicitly learn to align the distributions of simulated and real observations. This can be achieved through image-to-image translation (transforming simulated images to look realistic), feature-level alignment, or learning domain-invariant representations.
The real-to-sim-to-real paradigm first constructs a digital twin of the real environment by scanning it with sensors, then trains policies in this faithful simulation, and finally deploys them back to the real world. MIT CSAIL's RialTo system (2024) demonstrated this approach, enabling users to capture digital twins on the fly and achieving a 67% improvement over imitation learning with the same number of demonstrations.
NVIDIA's AutoMate system (RSS 2024) trained robotic assembly skills using reinforcement learning and imitation learning on a dataset of 100 assembly tasks, achieving an 84.5% mean success rate in real-world deployment across 20 assemblies. DexGraspNet 2.0 demonstrated zero-shot sim-to-real transfer for dexterous grasping with a 90.7% success rate in cluttered scenes.
Simulation platforms are critical infrastructure for manipulation research, enabling training, evaluation, and reproducible comparison of methods.
| Platform | Physics Engine | GPU Parallelization | Key Features |
|---|---|---|---|
| MuJoCo | Custom | Limited | Fast, accurate contact simulation; widely used in RL research; open-sourced by DeepMind in 2022 |
| Isaac Lab (NVIDIA) | PhysX 5 | Yes | High-fidelity rendering; tight integration with NVIDIA GPU ecosystem |
| SAPIEN / ManiSkill | PhysX | Yes | Focus on manipulation; ManiSkill3 achieves 30,000+ FPS with GPU parallelization |
| PyBullet | Bullet | No | Open source; accessible; common in academic research |
| RoboCasa | MuJoCo | CPU only | Kitchen and household manipulation scenarios |
| Genesis | Custom | Yes | Recent platform with differentiable physics |
| RoboVerse (2025) | Multi-engine | Yes | Unified interface to 8+ physics engines (Isaac, MuJoCo, SAPIEN, PyBullet, and others) |
ManiSkill3, powered by SAPIEN, is notable for achieving state-of-the-art GPU-parallelized simulation and rendering performance. Its simulation plus rendering speed reaches over 30,000 frames per second with 2 to 4 times better GPU memory efficiency than comparable platforms, making it practical to run large-scale reinforcement learning experiments for manipulation.
RoboVerse, accepted to RSS 2025, provides a standardized interface across eight or more physics engines, enabling researchers to train and evaluate manipulation policies across different simulators without rewriting their code.
| Dataset / Benchmark | Focus | Scale |
|---|---|---|
| GraspNet-1Billion | 6-DOF parallel jaw grasping | 190 scenes; ~1.1 billion grasp annotations |
| DexGraspNet | Dexterous multi-finger grasping | 1.32 million grasps; 5,355 objects |
| DexGraspNet 2.0 | Dexterous grasping in clutter | 427 million grasps; 8,270 scenes |
| Open X-Embodiment | Cross-embodiment manipulation | 1 million+ episodes from 22 robot embodiments |
| ARMBench (Amazon) | Industrial pick and place | 190,000+ objects in industrial settings |
| RLBench | Multi-task manipulation | 100 manipulation tasks with language descriptions |
| CALVIN | Language-conditioned manipulation | Long-horizon tasks in simulated tabletop |
Most manipulation research focuses on rigid objects, where geometry and physics are relatively straightforward to model. Deformable object manipulation (DOM) addresses objects that change shape when forces are applied, such as cloth, rope, food, biological tissue, and flexible packaging. DOM is considered one of the primary bottlenecks for real-world deployment of autonomous robots, particularly in domains such as manufacturing, food processing, surgical robotics, and domestic assistance.
| Category | Examples | Key Challenges |
|---|---|---|
| 1D (linear) | Rope, cable, wire, thread | High-dimensional configuration space; nonlinear dynamics; self-occlusion |
| 2D (planar) | Cloth, fabric, paper, sheet metal | Very high-dimensional state; self-collision; difficult to perceive folds |
| 3D (volumetric) | Dough, sponge, soft tissue, fruit | Complex material properties; irreversible deformation; difficulty in sensing internal state |
| Fragile | Glass, eggs, soft fruit, tissues | Safety-critical force limits; real-time adaptive control required |
Accurate models of deformable objects are often unavailable due to their strong nonlinearity and diversity of material properties. Recent research combines physics-based simulation (finite element methods, position-based dynamics) with learned models to handle the complexity. For fragile objects, force/torque sensing provides vital feedback, enabling robots to limit exerted forces below damage thresholds.
Contact-rich manipulation encompasses tasks where sustained, purposeful contact between the robot and objects is essential for success. Unlike pick-and-place operations (where contact is limited to grasping and releasing), contact-rich tasks involve sliding, pivoting, pushing, screwing, insertion, and other interactions that require reasoning about contact forces and friction throughout execution.
Examples include peg-in-hole insertion, furniture assembly, opening doors and drawers, wiping surfaces, and tool use. Classical motion planners such as RRT are designed to find collision-free paths, which makes them ill-suited for tasks where contact is the goal rather than an obstacle. Specialized planners and controllers that reason about contact modes, friction cones, and force constraints are required.
Recent work on contact-rich whole-body manipulation uses example-guided reinforcement learning to generate robust skills for manipulating large and unwieldy objects (such as moving furniture), where the robot must use its arms, torso, and sometimes legs to maintain contact and control.
High-quality demonstration data is a bottleneck for learning-based manipulation. Teleoperation systems enable human operators to control robots remotely, generating the demonstration datasets needed to train imitation learning policies.
The Universal Manipulation Interface (UMI), presented at RSS 2024, enables data collection using hand-held grippers that can be used in the wild (outside of lab settings), providing a portable, low-cost, and information-rich approach for capturing bimanual and dynamic manipulation demonstrations that can be directly transferred to robot policies.
UniBiDex (2025) provides a unified teleoperation framework supporting both VR and leader-follower inputs through a shared kinematic and safety-aware control module, enabling precise real-time bimanual dexterous manipulation across diverse devices and tasks.
The development of humanoid robots has brought manipulation research into a new context, where robots must coordinate whole-body motion with dexterous hand control. Several companies are developing humanoid platforms with manipulation capabilities.
Tesla's Optimus (Tesla Bot) has been in active development, with the Generation 2 hands featuring 11 degrees of freedom in 2023 and the Generation 3 hands expanding to 22 degrees of freedom in 2024. Tesla commenced mass production of Optimus Gen 3 at its Fremont factory in January 2026.
Figure AI's Figure 01, standing 168 cm tall and weighing 60 kg with 19+ degrees of freedom, integrates OpenAI-powered conversational AI and has been deployed in BMW factory pilot programs. The company's Helix VLA model (2025) was trained on approximately 500 hours of teleoperation data specifically for humanoid manipulation tasks.
NVIDIA's GR00T N1 VLA (2025) uses a dual-system architecture combining a fast diffusion policy (10 ms latency for real-time reactive control) with an LLM-based planner for strategic reasoning, targeting humanoid manipulation applications.
Robot manipulation has the most mature deployment in industrial settings, where structured environments and repetitive tasks reduce the difficulty of the perception and planning problems.
The Amazon Picking Challenge (later Amazon Robotics Challenge), launched in 2015, catalyzed research on robotic picking in warehouse settings. The challenge required teams to build systems that could pick diverse items from shelves and stow them into containers. Amazon has since scaled its internal robotics deployment, reaching one million robots across more than 300 facilities worldwide by June 2025. The company introduced DeepFleet, a generative AI foundation model for coordinating robot fleet movement, improving travel time by 10%.
Approximately 50% of labor costs in manual e-commerce warehouses are spent on picking. The global robotic picking market is predicted to see revenues grow tenfold between 2023 and 2030. Vision-guided robotic systems using 2D and 3D cameras to detect object location, orientation, and size have become the standard for modern pick-and-place operations.
Robots perform assembly tasks including screw driving, snap fitting, insertion, welding, and adhesive application. Contact-rich assembly tasks such as peg-in-hole insertion and gear meshing require force-sensitive control and often benefit from compliance (the ability of the robot to yield slightly under contact forces) to avoid jamming or damage.
Soft grippers and vision systems enable robots to handle delicate produce such as berries, tomatoes, and lettuce. Deformable object manipulation is particularly relevant for food processing tasks such as dough shaping, meat cutting, and packaging.
Despite remarkable progress, several fundamental challenges remain unsolved in robot manipulation.
Current manipulation systems typically excel at tasks they were specifically trained on but struggle to generalize to new objects, environments, and tasks. A robot trained to pick up mugs may fail when presented with a mug of unusual shape or material. Foundation models and VLAs are a promising direction, but achieving truly general-purpose manipulation comparable to human dexterity remains far off. The CVPR 2025 Workshop on Generalization in Robotics Manipulation specifically addresses this gap.
Training manipulation policies with learning-based methods requires large amounts of high-quality demonstration or interaction data. Unlike language and vision, where internet-scale datasets are readily available, robot data must be collected on physical hardware (or in simulation with a reality gap). Scaling data collection through teleoperation, simulation, and learning from human videos are active areas of research.
Many practical manipulation tasks involve long sequences of actions with complex dependencies. Making a sandwich, for example, requires retrieving ingredients, opening containers, spreading condiments, assembling layers, cutting, and plating. Current systems can handle short manipulation primitives but struggle with the planning, error recovery, and state estimation required for long-horizon tasks.
Manipulation in human environments requires safety guarantees. A robot handing a knife to a human, preparing food, or assisting with personal care must operate within strict force limits and have reliable failure detection. Formal safety verification, uncertainty quantification, and robust control under model uncertainty remain important open problems.
Many modern manipulation methods, particularly those based on diffusion models and large foundation models, require significant computational resources. Deploying these models on real robots with real-time latency constraints (often requiring control at 100 Hz or more) is an ongoing engineering challenge. On-device inference optimization and model distillation are active areas of work.
Effective manipulation requires integrating information from multiple sensory modalities: vision, depth, tactile sensing, proprioception, and force/torque sensing. While each modality has seen significant progress individually, developing principled frameworks for fusing these signals in real time remains challenging.