Robot manipulation

Computer Vision Robotics

31 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v4 · 6,250 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Introduction

Robot manipulation is the ability of a robotic system to physically interact with objects in its environment through grasping, pushing, pulling, inserting, placing, and other contact-based actions. It is one of the central problems in robotics and a prerequisite for deploying autonomous robots in unstructured real-world settings such as homes, hospitals, warehouses, and construction sites. While industrial robots have performed repetitive pick-and-place operations in structured factory environments since 1961, when Unimate became the first industrial robot, enabling robots to manipulate novel objects in cluttered, unpredictable environments remains an open research challenge. The field advanced sharply between 2023 and 2025 with the rise of vision-language-action models (VLAs): Physical Intelligence's pi0, for example, was trained on more than 10,000 hours of robot data across 7 platforms and 68 tasks and generates continuous actions at up to 50 Hz.^[9]^[17]

The difficulty of robot manipulation stems from the interplay of perception, planning, and control under uncertainty. A robot must perceive the geometry and physical properties of objects using sensors such as cameras, depth sensors, and tactile arrays. It must plan a sequence of motions that achieve a desired goal while respecting kinematic constraints, collision avoidance, and contact dynamics. Finally, it must execute those motions with closed-loop control that adapts to disturbances, modeling errors, and object slip. Advances in deep learning, reinforcement learning, computer vision, and simulation have dramatically expanded the capabilities of manipulation systems in recent years, particularly with the emergence of foundation models and vision-language-action models in 2024 and 2025.^[17]

Historical Background

The history of robot manipulation can be traced through several distinct eras, each defined by the dominant technology and control paradigm.

Early Industrial Robots (1960s-1970s)

In 1961, Unimate became the first industrial robot, developed by George Devol and Joseph Engelberger. Installed at the General Motors Ternstedt plant in Trenton, New Jersey, Unimate performed tasks such as transporting die-castings and welding parts onto car bodies. The robot could move along the X and Y axes, possessed a rotatable pincer-like gripper, and could follow a program of up to 200 stored movements. These early systems operated in highly structured environments with precisely known object positions, relying entirely on pre-programmed trajectories with no sensory feedback.

Sensor-Based Manipulation (1980s-1990s)

The development of force/torque sensors and early computer vision systems enabled closed-loop manipulation, where robots could adapt their behavior based on sensory input. The Stanford/JPL Hand, developed in the early 1980s, is considered the first dexterous robotic hand built to investigate fine manipulation. It featured three independently controlled fingers equipped with force feedback. The Utah/MIT Dextrous Hand, developed around the same period, pushed the boundaries of human-like articulation with 16 degrees of freedom driven by pneumatic actuators. These systems demonstrated that multi-fingered manipulation was mechanically feasible, though the computational tools to control them intelligently were still limited.

Learning-Based Manipulation (2010s-Present)

The convergence of deep learning, large-scale simulation, and affordable hardware sparked a revolution in manipulation research. Convolutional neural networks enabled robots to detect grasp poses directly from raw sensor data. Reinforcement learning allowed robots to discover manipulation strategies through trial and error. Large-scale simulation environments made it possible to train policies on millions of episodes before transferring them to physical hardware. In 2019, OpenAI demonstrated that a five-fingered Shadow Dexterous Hand could solve a Rubik's Cube using reinforcement learning policies trained entirely in simulation, marking a milestone in dexterous sim-to-real transfer.^[6]

End Effectors and Grippers

The end effector is the device attached to the end of a robotic arm that physically interacts with objects. The choice of end effector fundamentally shapes what manipulation tasks a robot can perform. End effectors range from simple two-finger grippers to anthropomorphic hands with dozens of degrees of freedom.

What are the main types of robot grippers?

Gripper Type	Mechanism	Strengths	Limitations	Typical Applications
Parallel jaw	Two opposing flat jaws move in parallel to clamp objects	Simple, robust, high force	Limited to objects that fit between jaws; poor with irregular shapes	Assembly, machine tending, bin picking
Suction/vacuum	Vacuum pump creates suction through cups or channels	Handles flat, smooth surfaces well; fast cycle times	Requires smooth, non-porous surfaces; struggles with heavy or irregularly shaped objects	Packaging, palletizing, sheet handling
Soft gripper	Flexible elastomeric fingers conform to object shape	Gentle on fragile objects; adapts to irregular geometry	Lower gripping force; limited precision	Food handling, agricultural picking, medical devices
Multi-finger dexterous	Three or more actuated fingers with multiple joints	Versatile; can perform in-hand manipulation and tool use	Complex control; high cost; mechanical fragility	Research, dexterous tasks, humanoid robots
Electromagnetic	Magnetic field holds ferrous objects	Fast attach/detach; no surface damage	Only works with ferromagnetic materials	Metal sheet handling, automotive manufacturing
Needle/pin	Needles penetrate porous material to secure grip	Works on soft, porous materials	Damages the object surface; narrow material range	Textile and fabric handling

Dexterous Robotic Hands

Dexterous robotic hands aim to replicate the versatility of the human hand, which has approximately 27 degrees of freedom. Notable dexterous hands include:

Stanford/JPL Hand (1983): Three-fingered hand with 9 degrees of freedom; one of the earliest platforms for dexterous manipulation research.
Utah/MIT Dextrous Hand (1986): 16 degrees of freedom driven by pneumatic tendons; demonstrated advanced articulation and sensing.
DLR Hand II (2001): Developed by the German Aerospace Center; featured integrated sensors and modular finger design.
Shadow Dexterous Hand: A commercially available 24-degree-of-freedom hand that closely mimics human hand kinematics. Used by OpenAI for the Rubik's Cube demonstration and widely adopted in research labs.^[6]
Allegro Hand: A low-cost, four-fingered hand with 16 degrees of freedom commonly used in academic research on dexterous manipulation and reinforcement learning.
LEAP Hand (2023): A low-cost, open-source dexterous hand designed at Carnegie Mellon University, built with off-the-shelf servo motors, making it accessible for broader research.
iHY Hand: Developed as part of the DARPA Autonomous Robotic Manipulation-Hardware program by Harvard and Yale University, featuring three tendon-driven fingers and five actuators designed for simplicity, durability, and low cost.

Grasping

Grasping is the most fundamental manipulation skill: the ability to securely hold an object so it can be lifted, transported, or repositioned. Research on grasping spans analytical methods rooted in classical mechanics and data-driven methods that leverage machine learning.^[2]

Grasp Analysis Fundamentals

Grasp analysis draws on the mechanics of contact to evaluate whether a given set of contact points can securely restrain an object.^[1] Two foundational concepts are:

Form closure: A grasp achieves form closure when the fingers geometrically constrain the object so that no motion is possible in any direction, regardless of the applied forces. Form closure grasps do not rely on friction and require a minimum of 4 contact points in 2D (or 7 in 3D for frictionless contacts).^[1]
Force closure: A grasp achieves force closure when the contacts, together with friction, can generate arbitrary wrenches (forces and torques) to resist any external disturbance. With non-zero friction, force closure is achievable with as few as 2 contacts in the plane or 3 contacts in 3D space.^[1]

The quality of a grasp can be measured by metrics such as the largest minimum resisting wrench, the volume of the grasp wrench space, or task-specific criteria. Computing optimal grasps under these metrics is generally NP-hard, so practical algorithms rely on heuristics, sampling, or optimization.^[1]

Analytical Grasp Planning

Analytical methods compute grasps by reasoning about object geometry, contact models, and physical constraints. Given a known 3D model of the object and the gripper kinematics, these methods search for contact point configurations that satisfy force or form closure conditions. The search is typically formulated as a constrained nonlinear optimization problem.

Contact models used in analytical grasp planning include:

Contact Model	Description	Friction Considered
Frictionless point contact	Contact force is normal to the surface only	No
Frictional point contact (Coulomb)	Contact force lies within a friction cone defined by the coefficient of friction	Yes
Soft finger contact	Adds a torsional friction component around the contact normal	Yes

Analytical methods offer interpretability and can provide formal guarantees on grasp stability. However, they require accurate 3D object models and material properties, which are often unavailable in unstructured environments. They also tend to be computationally expensive for complex geometries.^[2]

How do robots learn to grasp from data?

Data-driven approaches learn to predict successful grasps from sensor data, typically RGB images, depth images, or point clouds. These methods have shown remarkable adaptability in diverse scenarios by learning from large datasets, without requiring explicit 3D models of the objects.^[2]

Key approaches include:

Planar grasp detection: The robot predicts a 4-DOF grasp (x, y position, orientation angle, and gripper width) from a top-down image. Early work by Lenz et al. (2015) and later by Redmon and Angelova (2015) framed this as an image classification or regression problem using convolutional neural networks.^[3]
6-DOF grasp detection: The robot predicts full 6-DOF grasp poses (3D position and 3D orientation) in the workspace. Methods such as 6-DOF GraspNet (Mousavian et al., 2019) use a variational autoencoder to sample candidate grasps from point cloud input and then refine them with an evaluator network.^[4] GraspNet-1Billion provides a large-scale benchmark with 190 cluttered scenes, approximately 97,280 RGB-D images, and over 1 billion annotated 6-DOF grasp poses.^[11]
Contact-based grasp prediction: Rather than predicting a gripper pose, these methods predict contact points and approach directions on the object surface, enabling generalization across different gripper morphologies.

Deep learning methods for grasp synthesis broadly follow four algorithmic strategies: grasp pose sampling (generating candidate grasps and scoring them), direct grasp pose regression (predicting grasp parameters end-to-end), reinforcement learning (learning grasping policies through trial and error), and exemplar-based methods (retrieving and adapting grasps from a database of known objects).^[3]

Dexterous Grasping

Dexterous grasping extends the problem to multi-fingered hands, where the robot must coordinate the motion of multiple fingers to form stable grasps on objects of diverse shapes. DexGraspNet, introduced by Wang et al. (2023), provides a large-scale dataset of 1.32 million simulated grasps for 5,355 objects across more than 133 categories, with over 200 diverse grasps per object.^[5] The follow-up DexGraspNet 2.0 (2024) expanded the benchmark to 1,319 objects, 8,270 cluttered scenes, and 427 million grasps, demonstrating zero-shot sim-to-real transfer with a 90.7% real-world dexterous grasping success rate in cluttered scenes using a diffusion model conditioned on local geometry.^[14]

In-Hand Manipulation

In-hand manipulation refers to the ability to reposition, reorient, or reconfigure an object within the grasp of a robotic hand, without setting it down. This is one of the most challenging manipulation skills because it requires coordinated finger motions, precise contact control, and real-time adaptation to prevent the object from slipping or falling.^[18]

Humans perform in-hand manipulation effortlessly when rotating a pen between fingers, unscrewing a bottle cap, or adjusting the grip on a tool. For robots, replicating these behaviors requires high-dimensional control of multi-fingered hands (typically 12 to 24 degrees of freedom) under complex contact dynamics.^[18]

OpenAI Rubik's Cube (2019)

One of the most notable demonstrations of in-hand manipulation was OpenAI's system that solved a Rubik's Cube using a Shadow Dexterous Hand.^[6] The policy was trained entirely in simulation using reinforcement learning with a novel technique called Automatic Domain Randomization (ADR). ADR automatically generates training environments of ever-increasing difficulty by randomizing physical parameters such as friction, object mass, and actuator noise. The training consumed approximately 13,000 years of simulated experience. The resulting policy solved the cube 60% of the time for average scrambles, dropping to 20% for the hardest possible configurations requiring 26 quarter-face turns, and exhibited robustness to real-world perturbations it was never trained on, such as wearing a rubber glove or having fingers tied together.^[6] OpenAI summarized the result by noting that "our robot is able to solve the Rubik's Cube one-handed," framing it as a demonstration that a single algorithm could learn dexterous control transferable from simulation to a real hand.^[6]

Recent Advances

Recent work on in-hand manipulation has focused on combining reinforcement learning with tactile feedback and learning from human demonstrations.^[18] Tactile sensors such as DIGIT and GelSight provide rich contact information that enables finer control during manipulation. The ManiSkill-ViTac challenge (2025) specifically benchmarks manipulation skill learning that combines vision and tactile sensing, driving research on multimodal perception for dexterous tasks.

Approaches to Robot Manipulation

Robot manipulation research employs a spectrum of methods ranging from classical model-based techniques to modern learning-based approaches, as well as hybrid systems that combine the strengths of both.

Model-Based Approaches

Model-based manipulation relies on explicit mathematical models of the robot, the objects, and the physics of contact. These approaches plan manipulation actions by simulating the effects of candidate motions and selecting those that achieve the desired outcome.

Motion planning is a core component of model-based manipulation. The robot must find a collision-free path from its current configuration to a goal configuration that places the end effector in the desired grasp or placement pose. Sampling-based planners such as Rapidly-exploring Random Trees (RRT) and Probabilistic Roadmaps (PRM) are widely used because they can handle high-dimensional configuration spaces without requiring an explicit representation of the obstacle boundaries.^[15] For manipulation specifically, RRT-Connect performs well in single-arm and dual-arm tasks due to its bidirectional search strategy.

Trajectory optimization computes smooth, dynamically feasible trajectories by solving a constrained optimization problem that minimizes a cost function (such as time, energy, or jerk) subject to joint limits, collision constraints, and task-specific requirements. In contact-rich manipulation, trajectory optimization must also account for the switching dynamics of making and breaking contact.

Task and motion planning (TAMP) integrates symbolic task planning (deciding what actions to take and in what order) with geometric motion planning (deciding how to execute each action).^[15] TAMP is essential for long-horizon manipulation tasks that involve multiple objects and sequential dependencies, such as setting a dinner table or assembling furniture.

Learning-Based Approaches

Learning-based methods acquire manipulation skills from data, either through supervised learning from demonstrations, reinforcement learning from trial and error, or self-supervised learning from autonomous exploration.

Imitation Learning

Imitation learning trains a robot policy to reproduce expert behavior from demonstrations. Human operators provide demonstrations through kinesthetic teaching (physically guiding the robot arm), teleoperation, or even video recordings of humans performing the task. The policy learns a mapping from observations (images, joint positions, force readings) to actions.^[21]

Key imitation learning frameworks include:

Behavioral Cloning (BC): Directly trains a policy via supervised learning on state-action pairs from demonstrations. Simple and effective, but can suffer from distribution shift when the robot encounters states not seen during training.
DAgger (Dataset Aggregation): Addresses distribution shift by iteratively collecting new demonstrations in states the learned policy actually visits, with an expert providing corrective labels.
Inverse Reinforcement Learning (IRL): Infers the underlying reward function from demonstrations and then uses reinforcement learning to optimize it, potentially generalizing better than behavioral cloning.

Diffusion Policy

Diffusion Policy, introduced by Chi et al. (2023), represents robot visuomotor policies as conditional denoising diffusion processes.^[7] Rather than predicting a single action, the policy generates action trajectories by iteratively denoising a sample from Gaussian noise, conditioned on the current observation. This formulation handles multimodal action distributions gracefully (since demonstrations of the same task can involve different strategies), is suitable for high-dimensional action spaces, and provides training stability. Diffusion Policy consistently outperforms prior state-of-the-art robot learning methods with an average improvement of 46.9% across benchmark tasks.^[7]

Variants and extensions include:

3D Diffusion Policy (3DP): Encodes sparse point clouds into compact 3D representations, enabling policy training from as few as 10 to 40 demonstrations.
FlowPolicy: Uses consistency flow matching to generate actions in a single inference step, significantly improving speed while maintaining comparable success rates.

Reinforcement Learning

Reinforcement learning allows robots to discover manipulation strategies through interaction with the environment, guided by a reward signal. The robot receives a reward for successful task completion (or partial rewards for progress) and learns a policy that maximizes expected cumulative reward.

RL has been successfully applied to contact-rich manipulation tasks that are difficult to demonstrate or model. Key challenges include sample efficiency (physical robot interactions are slow and expensive), reward design (specifying a reward that captures the desired behavior without unintended shortcuts), and sim-to-real transfer (policies trained in simulation often fail on real hardware due to the "reality gap").^[16]

What are vision-language-action models in robotics?

The most significant trend in robot manipulation research from 2023 to 2025 has been the emergence of foundation models and vision-language-action models (VLAs).^[17] These models combine the semantic understanding of large language models and vision-language models with robotic action generation, enabling robots to follow natural language instructions, generalize to novel objects, and perform multi-step reasoning.^[20]

Model	Organization	Year	Parameters	Key Features
RT-1	Google	2022	35M	Transformer-based; trained on 130k real robot episodes
RT-2	Google DeepMind	2023	12B/55B	VLA model; represents actions as text tokens; 3x generalization improvement over RT-1
Octo	UC Berkeley	2024	27M/93M	Lightweight open-source generalist robot policy
OpenVLA	Stanford	2024	7B	Open-source VLA trained on Open X-Embodiment dataset from 22 embodiments
pi0	Physical Intelligence	2024	3B	Flow-matching VLA; trained on 10,000+ hours of robot data across 7 platforms and 68 tasks
Helix	Figure AI	2025	Not disclosed	VLA for humanoid robots; trained on 500 hours of teleoperation data
GR00T N1	NVIDIA	2025	Not disclosed	Dual-system VLA for humanoid robots; combines fast diffusion policy with LLM-based planner
Gemini Robotics	Google DeepMind	2025	Not disclosed	Extension of Gemini 2.0 to physical manipulation; demonstrated origami folding and card manipulation
SmolVLA	Hugging Face	2025	450M	Compact open-source VLA with performance comparable to much larger models

RT-2, introduced by Google DeepMind in July 2023, was a landmark VLA that represented robot actions as text tokens, enabling the model to transfer semantic knowledge from web-scale language and image data to robotic control.^[8] The authors described the central idea as studying "how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning."^[8] In practice, RT-2 could follow instructions such as "pick up the bag about to fall off the table" or "move banana to the sum of two plus one," which required both visual understanding and mathematical reasoning that the model acquired from its web pretraining.^[8]

pi0, released by Physical Intelligence in October 2024 and open-sourced in February 2025, demonstrated a new level of dexterous generalist manipulation. Trained on over 10,000 hours of robot data from seven different platforms, pi0 could perform tasks including laundry folding, table clearing, dish loading, egg carton stacking, box assembly, and grocery bagging.^[9] The model uses flow matching (a variant of diffusion) to generate high-frequency continuous actions at up to 50 Hz.^[9]

Gemini Robotics, introduced by Google DeepMind in March 2025, brought the multimodal reasoning of Gemini 2.0 into the physical world, completing highly dexterous tasks such as folding origami, unzipping bags, and folding clothes while following open-vocabulary natural language instructions. In June 2025 the team released Gemini Robotics On-Device, a VLA optimized to run locally on robots for low-latency inference, adaptable to new tasks with as few as 50 to 100 demonstrations.^[24]

Tactile Sensing for Manipulation

While vision provides information about object shape, pose, and appearance, tactile sensing provides direct information about contact: forces, pressures, slip, texture, and local geometry at the point of contact. Tactile feedback is essential for tasks that require precise force control, such as handling fragile objects, inserting tight-fitting parts, or detecting incipient slip during grasping.

Vision-Based Tactile Sensors

A major advance in tactile sensing has been the development of optical (vision-based) tactile sensors that use cameras embedded behind a deformable elastomer to capture high-resolution images of contact geometry. These sensors offer much higher spatial resolution and richer data than traditional force/torque sensors.

Sensor	Developer	Approximate Cost	Key Features
GelSight	MIT / GelSight Inc.	$500 (GelSight Mini)	High-resolution 3D surface reconstruction; micron-level detail
DIGIT	Meta AI / GelSight	$350	Compact fingertip form factor; widely used in research
Digit 360	Meta AI / GelSight (2024)	Not disclosed	18+ sensing modalities; detects forces as small as 1 millinewton; human-level tactile precision
BioTac	SynTouch (discontinued)	$5,000-$10,000	Multimodal (force, vibration, temperature); bio-inspired design
TACTO	Meta AI	Open source (sim)	Optical tactile sensor simulator for training in simulation

The availability of low-cost, high-resolution tactile sensors such as DIGIT has been described as approaching an "ImageNet moment" for touch, enabling large-scale data collection and the development of tactile foundation models. The ManiSkill-ViTac 2025 challenge specifically benchmarks manipulation tasks that require the integration of vision and tactile sensing.

Sim-to-Real Transfer

Sim-to-real transfer is the process of training robot manipulation policies in simulation and deploying them on physical hardware. Simulation offers virtually unlimited data, safe exploration, and the ability to parallelize training across thousands of environments. However, the "reality gap" between simulation and the real world (caused by differences in physics, rendering, sensor noise, and actuation dynamics) means that policies trained purely in simulation often fail when deployed on real robots.^[16]

Domain Randomization

Domain randomization addresses the reality gap by randomizing simulation parameters (friction coefficients, object masses, lighting conditions, camera positions, actuator noise) during training, so the policy learns to be robust to a wide range of conditions.^[16] The real world then becomes just one more sample from the distribution of training environments. OpenAI's Automatic Domain Randomization (ADR), used for the Rubik's Cube demonstration, automatically increases the randomization range as the policy improves.^[6]

Domain Adaptation

Domain adaptation methods explicitly learn to align the distributions of simulated and real observations. This can be achieved through image-to-image translation (transforming simulated images to look realistic), feature-level alignment, or learning domain-invariant representations.^[16]

Real-to-Sim-to-Real

The real-to-sim-to-real paradigm first constructs a digital twin of the real environment by scanning it with sensors, then trains policies in this faithful simulation, and finally deploys them back to the real world. MIT CSAIL's RialTo system (2024) demonstrated this approach, enabling users to capture digital twins on the fly and achieving a 67% improvement over imitation learning with the same number of demonstrations.

Recent Results

NVIDIA's AutoMate system (RSS 2024) trained robotic assembly skills using reinforcement learning and imitation learning on a dataset of 100 assembly tasks, achieving an 84.5% mean success rate in real-world deployment across 20 assemblies. DexGraspNet 2.0 demonstrated zero-shot sim-to-real transfer for dexterous grasping with a 90.7% success rate in cluttered scenes.^[14]

Simulation Environments and Benchmarks

Simulation platforms are critical infrastructure for manipulation research, enabling training, evaluation, and reproducible comparison of methods.

Major Simulation Platforms

Platform	Physics Engine	GPU Parallelization	Key Features
MuJoCo	Custom	Limited	Fast, accurate contact simulation; widely used in RL research; open-sourced by DeepMind in 2022
Isaac Lab (NVIDIA)	PhysX 5	Yes	High-fidelity rendering; tight integration with NVIDIA GPU ecosystem
SAPIEN / ManiSkill	PhysX	Yes	Focus on manipulation; ManiSkill3 achieves 30,000+ FPS with GPU parallelization
PyBullet	Bullet	No	Open source; accessible; common in academic research
RoboCasa	MuJoCo	CPU only	Kitchen and household manipulation scenarios
Genesis	Custom	Yes	Recent platform with differentiable physics
RoboVerse (2025)	Multi-engine	Yes	Unified interface to 8+ physics engines (Isaac, MuJoCo, SAPIEN, PyBullet, and others)

ManiSkill3, powered by SAPIEN, is notable for achieving state-of-the-art GPU-parallelized simulation and rendering performance.^[19] Its simulation plus rendering speed reaches over 30,000 frames per second with 2 to 4 times better GPU memory efficiency than comparable platforms, making it practical to run large-scale reinforcement learning experiments for manipulation.^[19]

RoboVerse, accepted to RSS 2025, provides a standardized interface across eight or more physics engines, enabling researchers to train and evaluate manipulation policies across different simulators without rewriting their code.

Benchmark Datasets

Dataset / Benchmark	Focus	Scale
GraspNet-1Billion	6-DOF parallel jaw grasping	190 scenes; ~1.1 billion grasp annotations
DexGraspNet	Dexterous multi-finger grasping	1.32 million grasps; 5,355 objects
DexGraspNet 2.0	Dexterous grasping in clutter	427 million grasps; 8,270 scenes
Open X-Embodiment	Cross-embodiment manipulation	1 million+ episodes from 22 robot embodiments
ARMBench (Amazon)	Industrial pick and place	190,000+ objects in industrial settings
RLBench	Multi-task manipulation	100 manipulation tasks with language descriptions
CALVIN	Language-conditioned manipulation	Long-horizon tasks in simulated tabletop

Deformable Object Manipulation

Most manipulation research focuses on rigid objects, where geometry and physics are relatively straightforward to model. Deformable object manipulation (DOM) addresses objects that change shape when forces are applied, such as cloth, rope, food, biological tissue, and flexible packaging. DOM is considered one of the primary bottlenecks for real-world deployment of autonomous robots, particularly in domains such as manufacturing, food processing, surgical robotics, and domestic assistance.^[22]

Categories of Deformable Objects

Category	Examples	Key Challenges
1D (linear)	Rope, cable, wire, thread	High-dimensional configuration space; nonlinear dynamics; self-occlusion
2D (planar)	Cloth, fabric, paper, sheet metal	Very high-dimensional state; self-collision; difficult to perceive folds
3D (volumetric)	Dough, sponge, soft tissue, fruit	Complex material properties; irreversible deformation; difficulty in sensing internal state
Fragile	Glass, eggs, soft fruit, tissues	Safety-critical force limits; real-time adaptive control required

Accurate models of deformable objects are often unavailable due to their strong nonlinearity and diversity of material properties.^[22] Recent research combines physics-based simulation (finite element methods, position-based dynamics) with learned models to handle the complexity. For fragile objects, force/torque sensing provides vital feedback, enabling robots to limit exerted forces below damage thresholds.^[22]

Contact-Rich Manipulation

Contact-rich manipulation encompasses tasks where sustained, purposeful contact between the robot and objects is essential for success. Unlike pick-and-place operations (where contact is limited to grasping and releasing), contact-rich tasks involve sliding, pivoting, pushing, screwing, insertion, and other interactions that require reasoning about contact forces and friction throughout execution.

Examples include peg-in-hole insertion, furniture assembly, opening doors and drawers, wiping surfaces, and tool use. Classical motion planners such as RRT are designed to find collision-free paths, which makes them ill-suited for tasks where contact is the goal rather than an obstacle. Specialized planners and controllers that reason about contact modes, friction cones, and force constraints are required.^[23]

Recent work on contact-rich whole-body manipulation uses example-guided reinforcement learning to generate robust skills for manipulating large and unwieldy objects (such as moving furniture), where the robot must use its arms, torso, and sometimes legs to maintain contact and control.

Teleoperation and Data Collection

High-quality demonstration data is a bottleneck for learning-based manipulation. Teleoperation systems enable human operators to control robots remotely, generating the demonstration datasets needed to train imitation learning policies.

Teleoperation Interfaces

Kinesthetic teaching: The human physically guides the robot arm through the desired motion while joint positions are recorded.
Leader-follower: A separate "leader" robot or haptic device is moved by the human, and the "follower" robot mirrors the motion.
VR controllers: The operator wears a VR headset and uses hand controllers to command the robot in an immersive first-person view.
Hand tracking: Vision-based systems track the operator's bare hand pose and retarget it to a robotic hand.

The Universal Manipulation Interface (UMI), presented at RSS 2024, enables data collection using hand-held grippers that can be used in the wild (outside of lab settings), providing a portable, low-cost, and information-rich approach for capturing bimanual and dynamic manipulation demonstrations that can be directly transferred to robot policies.^[13]

UniBiDex (2025) provides a unified teleoperation framework supporting both VR and leader-follower inputs through a shared kinematic and safety-aware control module, enabling precise real-time bimanual dexterous manipulation across diverse devices and tasks.

Humanoid Robot Manipulation

The development of humanoid robots has brought manipulation research into a new context, where robots must coordinate whole-body motion with dexterous hand control. Several companies are developing humanoid platforms with manipulation capabilities.

Tesla's Optimus (Tesla Bot) has been in active development, with the Generation 2 hands featuring 11 degrees of freedom in 2023 and the Generation 3 hands expanding to 22 degrees of freedom in 2024. Tesla commenced mass production of Optimus Gen 3 at its Fremont factory in January 2026.

Figure AI's Figure 01, standing 168 cm tall and weighing 60 kg with 19+ degrees of freedom, integrates OpenAI-powered conversational AI and has been deployed in BMW factory pilot programs. The company's Helix VLA model (2025) was trained on approximately 500 hours of teleoperation data specifically for humanoid manipulation tasks.

NVIDIA's GR00T N1 VLA (2025) uses a dual-system architecture combining a fast diffusion policy (10 ms latency for real-time reactive control) with an LLM-based planner for strategic reasoning, targeting humanoid manipulation applications.

Industrial Applications

Robot manipulation has the most mature deployment in industrial settings, where structured environments and repetitive tasks reduce the difficulty of the perception and planning problems.

Warehouse and Logistics

The Amazon Picking Challenge (later Amazon Robotics Challenge), launched in 2015, catalyzed research on robotic picking in warehouse settings. The challenge required teams to build systems that could pick diverse items from shelves and stow them into containers. Amazon has since scaled its internal robotics deployment, reaching one million robots across more than 300 facilities worldwide by July 2025.^[25] The company introduced DeepFleet, described as its first generative AI foundation model for coordinating robot fleet movement, which improved fleet travel time by 10%.^[25]

Approximately 50% of labor costs in manual e-commerce warehouses are spent on picking. The global robotic picking market is predicted to see revenues grow tenfold between 2023 and 2030. Vision-guided robotic systems using 2D and 3D cameras to detect object location, orientation, and size have become the standard for modern pick-and-place operations.

Manufacturing and Assembly

Robots perform assembly tasks including screw driving, snap fitting, insertion, welding, and adhesive application. Contact-rich assembly tasks such as peg-in-hole insertion and gear meshing require force-sensitive control and often benefit from compliance (the ability of the robot to yield slightly under contact forces) to avoid jamming or damage.

Food and Agriculture

Soft grippers and vision systems enable robots to handle delicate produce such as berries, tomatoes, and lettuce. Deformable object manipulation is particularly relevant for food processing tasks such as dough shaping, meat cutting, and packaging.^[22]

Open Challenges and Future Directions

Despite remarkable progress, several fundamental challenges remain unsolved in robot manipulation.

Generalization

Current manipulation systems typically excel at tasks they were specifically trained on but struggle to generalize to new objects, environments, and tasks. A robot trained to pick up mugs may fail when presented with a mug of unusual shape or material. Foundation models and VLAs are a promising direction, but achieving truly general-purpose manipulation comparable to human dexterity remains far off.^[20] The CVPR 2025 Workshop on Generalization in Robotics Manipulation specifically addresses this gap.

Data Scarcity

Training manipulation policies with learning-based methods requires large amounts of high-quality demonstration or interaction data. Unlike language and vision, where internet-scale datasets are readily available, robot data must be collected on physical hardware (or in simulation with a reality gap). Scaling data collection through teleoperation, simulation, and learning from human videos are active areas of research.

Long-Horizon Tasks

Many practical manipulation tasks involve long sequences of actions with complex dependencies. Making a sandwich, for example, requires retrieving ingredients, opening containers, spreading condiments, assembling layers, cutting, and plating. Current systems can handle short manipulation primitives but struggle with the planning, error recovery, and state estimation required for long-horizon tasks.

Safety and Robustness

Manipulation in human environments requires safety guarantees. A robot handing a knife to a human, preparing food, or assisting with personal care must operate within strict force limits and have reliable failure detection. Formal safety verification, uncertainty quantification, and robust control under model uncertainty remain important open problems.

Real-Time Execution

Many modern manipulation methods, particularly those based on diffusion models and large foundation models, require significant computational resources. Deploying these models on real robots with real-time latency constraints (often requiring control at 100 Hz or more) is an ongoing engineering challenge. On-device inference optimization and model distillation are active areas of work.

Multimodal Integration

Effective manipulation requires integrating information from multiple sensory modalities: vision, depth, tactile sensing, proprioception, and force/torque sensing. While each modality has seen significant progress individually, developing principled frameworks for fusing these signals in real time remains challenging.

Key Textbooks and Resources

Robotic Manipulation: Perception, Planning, and Control by Russ Tedrake (MIT). Available online at manipulation.csail.mit.edu. Covers perception (deep learning and 3D geometry), planning (kinematics, trajectory generation, collision-free motion planning, TAMP, planning under uncertainty), and control (model-based and learning-based).^[15]
Springer Handbook of Robotics, Chapter 38: Grasping. Comprehensive reference on grasp analysis, force and form closure, and grasp planning algorithms.
A Mathematical Introduction to Robotic Manipulation by Murray, Li, and Sastry. Classic textbook on the mathematical foundations of manipulation, covering rigid body transformations, kinematics, and contact mechanics.^[23]

References

Bicchi, A., & Kumar, V. (2000). Robotic grasping and contact: A review. *Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)*. ↩
Bohg, J., Morales, A., Asfour, T., & Kragic, D. (2014). Data-driven grasp synthesis: A survey. *IEEE Transactions on Robotics*, 30(2), 289-309. ↩
Newbury, R., Gu, M., Chumbley, L., Mousavian, A., Eppner, C., Leitner, J., ... & Cosgun, A. (2023). Deep learning approaches to grasp synthesis: A review. *IEEE Transactions on Robotics*, 39(5), 3994-4015. ↩
Mousavian, A., Eppner, C., & Fox, D. (2019). 6-DOF GraspNet: Variational grasp generation for object manipulation. *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*, 2901-2910. ↩
Wang, R., et al. (2023). DexGraspNet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. *IEEE International Conference on Robotics and Automation (ICRA)*. ↩
Akkaya, I., et al. (2019). Solving Rubik's Cube with a robot hand. *arXiv preprint arXiv:1910.07113*. See also OpenAI, "Solving Rubik's Cube with a robot hand," openai.com (October 15, 2019). ↩
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., & Song, S. (2023). Diffusion Policy: Visuomotor policy learning via action diffusion. *The International Journal of Robotics Research*, 44(5). ↩
Brohan, A., et al. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. *arXiv preprint arXiv:2307.15818*. ↩
Black, K., et al. (2024). pi0: A vision-language-action flow model for general robot control. *arXiv preprint arXiv:2410.24164*. Physical Intelligence, pi.website. ↩
Kim, M., et al. (2024). OpenVLA: An open-source vision-language-action model. *arXiv preprint arXiv:2406.09246*.
Fang, H., et al. (2020). GraspNet-1Billion: A large-scale benchmark for general object grasping. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. ↩
Zhu, Y., et al. (2020). Dark, beyond deep: A paradigm shift to cognitive AI with humanlike common sense. *Engineering*, 6(3), 310-345.
Chi, C., et al. (2024). Universal Manipulation Interface: In-the-wild robot teaching without in-the-wild robots. *Robotics: Science and Systems (RSS)*. ↩
Yin, Z., Huang, B., Qin, Y., Chen, Q., & Wang, X. (2024). DexGraspNet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. *arXiv preprint arXiv:2410.23004*. ↩
Tedrake, R. (2025). Robotic Manipulation: Perception, Planning, and Control. MIT. Available at manipulation.csail.mit.edu. ↩
Zhao, W., Queralta, J. P., & Westerlund, T. (2020). Sim-to-real transfer in deep reinforcement learning for robotics: A survey. *arXiv preprint arXiv:2009.13303*. ↩
Firoozi, R., et al. (2025). Foundation models in robotics: Applications, challenges, and the future. *The International Journal of Robotics Research*, 44(5). ↩
Weinberg, A., et al. (2024). Survey of learning-based approaches for robotic in-hand manipulation. *Frontiers in Robotics and AI*, 11, 1455431. ↩
Gu, J., et al. (2024). ManiSkill3: GPU parallelized robotics simulation and rendering for generalizable embodied AI. *arXiv preprint arXiv:2410.00425*. ↩
Li, D., et al. (2025). What foundation models can bring for robot learning in manipulation: A survey. *The International Journal of Robotics Research*. ↩
An, S., et al. (2025). Dexterous manipulation through imitation learning: A survey. *IEEE Transactions on Robotics*. ↩
Sanchez, J., et al. (2018). Robotic manipulation and sensing of deformable objects in domestic and industrial applications: A survey. *The International Journal of Robotics Research*, 37(7), 688-716. ↩
Murray, R. M., Li, Z., & Sastry, S. S. (1994). *A Mathematical Introduction to Robotic Manipulation*. CRC Press. ↩
Google DeepMind. (2025). Gemini Robotics: Bringing AI into the physical world (March 2025) and Gemini Robotics On-Device brings AI to local robotic devices (June 2025). deepmind.google. ↩
Amazon. (2025). Amazon deploys over 1 million robots and launches new AI foundation model (DeepFleet). aboutamazon.com (July 2, 2025). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit