Gemini Robotics is a family of robotics foundation models developed by Google DeepMind that extends the Gemini multimodal model line into the physical world. The first two models in the family, Gemini Robotics and Gemini Robotics-ER (short for Embodied Reasoning), were announced on March 12, 2025 and were both built on top of the Gemini 2.0 base model.[1][2] Subsequent releases added an on-device variant in June 2025, the Gemini Robotics 1.5 generation in September 2025, and a refreshed reasoning model called Gemini Robotics-ER 1.6 in April 2026.[3][4][5][6]
The line is positioned as a general-purpose vision-language-action (VLA) and embodied reasoning stack that lets robots perceive, plan, and act across different tasks and embodiments. Google DeepMind describes three properties as central to the design: generality across novel situations, interactivity through ordinary language, and dexterity for fine manipulation.[1][2] The models are intended to be used together, with the embodied reasoning model acting as a high-level planner that calls the action model and other tools, including Google Search and user-defined functions.[4][7]
Apptronik is the lead humanoid hardware partner for the program, and Google DeepMind has run a trusted tester program that included Agile Robots, Agility Robotics, Boston Dynamics, and Enchanted Tools at launch and grew to more than sixty organizations by late 2025.[2][8][9] MIT Technology Review described the launch as one of the first major applications of generative AI to advanced robotics, while IEEE Spectrum called the announcement a step toward foundation models for embodied agents.[10][11]
Robotics research has long struggled to produce policies that generalize beyond the narrow conditions they were trained on. Earlier efforts inside Google DeepMind, notably the RT-1 (2022) and RT-2 (2023) projects, layered action prediction on top of vision-language models and showed that internet-scale pretraining could improve a robot's response to novel objects and instructions. Outside Google, the OpenVLA model from Stanford and the Pi-0 model from Physical Intelligence extended the same recipe with different choices about action representation and embodiment coverage.[12]
Gemini Robotics inherits this approach but starts from a much larger and more recent base. Carolina Parada, who leads robotics at Google DeepMind, described the team's strategy as broad task learning instead of single-task specialization, with the bet that generalization would emerge once the model had enough exposure to varied tasks.[13] The models are built on top of Gemini 2.0, which already encodes wide visual and linguistic context, and they add physical actions as an additional output modality alongside text.[1][14]
The project was designed in collaboration with hardware partners from the start. Apptronik, maker of the Apollo humanoid robot, is the lead humanoid partner, and Google DeepMind opened a trusted tester program for Gemini Robotics-ER on the same day as the public announcement.[2][15]
The table below summarizes the major Gemini Robotics releases through April 2026.
| Date | Release | Notes |
|---|---|---|
| March 12, 2025 | Gemini Robotics and Gemini Robotics-ER announced | Built on Gemini 2.0; both models gated to trusted testers at launch[1][2] |
| March 25, 2025 | Technical report posted to arXiv | Paper number 2503.20020, titled "Gemini Robotics: Bringing AI into the Physical World"[16] |
| June 24, 2025 | Gemini Robotics On-Device released | First on-robot VLA in the family; first VLA from Google DeepMind opened to fine-tuning[3][17] |
| September 25, 2025 | Gemini Robotics 1.5 and Gemini Robotics-ER 1.5 announced | Two-model agentic stack; ER 1.5 made available in preview through the Gemini API and Google AI Studio[4][18] |
| April 14, 2026 | Gemini Robotics-ER 1.6 released | Improved spatial reasoning and instrument reading; deployed on Boston Dynamics Spot for industrial inspection[5][6][19] |
The Gemini Robotics product page groups the models into three roles: a vision-language-action model that turns visual input and instructions into motor commands, a reasoning model that plans, and an on-device variant that runs locally on the robot.[20] The same role structure has been preserved across model generations, with the version numbers (1.5, 1.6) tracking improvements within each role rather than separate product lines.
| Model | Type | First released | Role |
|---|---|---|---|
| Gemini Robotics | Vision-language-action | March 2025 | Cloud-served VLA that issues motor commands |
| Gemini Robotics-ER | Vision-language model | March 2025 | Embodied reasoning, perception, and planning layer |
| Gemini Robotics On-Device | Vision-language-action | June 2025 | Local VLA optimized for on-robot inference |
| Gemini Robotics 1.5 | Vision-language-action | September 2025 | Reasoning-augmented VLA that explains its actions before executing |
| Gemini Robotics-ER 1.5 | Vision-language model | September 2025 | High-level planner with tool calling and adjustable thinking budget |
| Gemini Robotics-ER 1.6 | Vision-language model | April 2026 | Refresh focused on spatial reasoning, multi-view understanding, and instrument reading |
The embodied reasoning models are vision-language models that emit pointing, object detection, success detection, and code rather than continuous joint commands. The vision-language-action models translate pixels and instructions into motor commands directly. In Google's deployment guidance, the two models are used together: the reasoning model produces a plan and calls the action model (or other tools) to execute each step.[4][7]
Gemini Robotics models inherit the transformer architecture and the multimodal pretraining data of Gemini 2.0. They are then fine-tuned on robot-specific data, including teleoperated demonstrations on real robots and synthetic trajectories generated in simulation. The technical report describes the resulting system as a generalist VLA that can perform object detection, pointing, trajectory and grasp prediction, multi-view correspondence, and 3D bounding box prediction without task-specific heads.[16]
From the original announcement onward, Google DeepMind has framed the family as a two-model system. Gemini Robotics-ER handles perception and decision-making: it recognizes elements in the scene, estimates their size and location, predicts grasp points and trajectories, and emits code to execute the action. Gemini Robotics handles execution: it converts the visual context plus an instruction into the joint-level commands that drive the robot.[2][13] The 1.5 generation made this split explicit by giving Gemini Robotics-ER 1.5 native tool-calling abilities so it can call Gemini Robotics 1.5 (or any other VLA) the same way a language agent calls a function.[4][7]
According to the technical report, the Gemini Robotics VLA is split between a vision-language backbone hosted in the cloud and a small action decoder running on the robot's onboard computer. The team optimized the backbone latency from "seconds to under 160 ms" and reported an end-to-end latency from raw camera observation to a chunk of low-level joint commands of about 250 ms, supporting an effective control frequency of 50 Hz.[16] This split allows the system to use the full Gemini 2.0 weights for reasoning while still issuing high-frequency motor commands locally, an architecture choice that contrasts with fully on-device VLAs such as Gemini Robotics On-Device or with cloud-only approaches without a dedicated on-robot decoder.
The model emits actions in chunks rather than one step at a time. Action chunking lets the policy plan several timesteps of motion in advance, which Google DeepMind argues helps it produce smoother trajectories and tolerate the network round trip to the cloud backbone.[16]
The Gemini Robotics paper describes the training mix as "a large and diverse dataset consisting of action-labeled robot data as well as other multimodal data." The robot-data portion includes thousands of hours of expert teleoperated demonstrations collected over twelve months on ALOHA 2 robots. The multimodal portion includes web documents, code, images, audio, video, and embodied reasoning and visual question answering data inherited from Gemini 2.0 pretraining.[16] Ablation studies in the same paper found that training a Gemini Robotics specialist model from scratch, rather than fine-tuning the generalist checkpoint, dropped success rates on evaluation tasks to 0%, which the authors interpreted as evidence that the multimodal pretraining is doing most of the heavy lifting for generalization.[16]
In the 1.5 generation, the VLA was given the ability to think in natural language before producing actions. Google DeepMind described this as helping the robot "assess and complete complex tasks more transparently," and noted that the model could explain its plan in natural language while moving.[4] On the planning side, Gemini Robotics-ER 1.5 added an adjustable thinking budget that lets developers trade response speed for reasoning depth, and explicit checks against payload limits and workspace constraints to filter physically infeasible plans.[7]
Google DeepMind has emphasized that a single Gemini Robotics model can drive multiple robot embodiments. Internal evaluations showed that tasks demonstrated only on the bi-arm ALOHA 2 platform during training also worked on Apptronik's Apollo humanoid and on the bi-arm Franka FR3, with no per-robot specialization.[1][4] In the on-device release, Google reported that the model could be adapted to the Franka FR3 and Apollo from ALOHA training data with as few as 50 to 100 demonstrations per new task.[3][17]
Gemini Robotics On-Device is a smaller VLA designed to run locally on the robot's onboard compute, outside of any cloud connection. It is engineered for low-latency inference on bi-arm platforms, with general-purpose dexterity comparable to the cloud-served model on out-of-distribution and multi-step tasks.[3] InfoQ reported that the on-device model completed evaluation tasks successfully more than 60% of the time on average, against roughly 80% for the cloud variant, with stronger performance than other published on-device VLA baselines.[17] It was the first VLA Google DeepMind released for fine-tuning by external developers, distributed through the Gemini Robotics SDK with a MuJoCo physics simulator integration.[3][17]
Google DeepMind's public materials describe three recurring properties for Gemini Robotics: generality, interactivity, and dexterity.[1][2]
| Capability | Description |
|---|---|
| Generality | Adapts to new objects, instructions, and environments. Google reported that Gemini Robotics more than doubles the score of other state-of-the-art VLA models on a comprehensive generalization benchmark.[1] |
| Interactivity | Responds to commands phrased in everyday conversational language across multiple languages, monitors the scene continuously, and can adjust mid-task when objects move.[1][2] |
| Dexterity | Performs multi-step manipulation including origami folding, packing snacks into a Ziploc bag, folding clothes, unzipping bags, and pouring salad dressing.[1][3] |
| Tool use | The 1.5 generation can call digital tools such as Google Search and other VLA models to retrieve information or execute sub-steps.[4][7] |
| Embodiment transfer | Adapts across ALOHA 2, the bi-arm Franka FR3, and humanoid platforms such as Apptronik's Apollo, often without per-robot training.[1][4] |
Gemini Robotics-ER, the reasoning model, contributes a different set of capabilities focused on perception and planning: 2D pointing and object detection grounded in size, weight, and affordance information, multi-view correspondence, success and failure detection from camera streams, and code generation that calls other tools to execute physical actions.[7][16]
The Gemini Robotics technical report decomposes generalization into three axes that are evaluated separately, a structure that reflects how earlier RT-2 work measured progress on robot foundation models.[16]
| Axis | Definition | Example |
|---|---|---|
| Visual generalization | Invariance to visual changes that do not affect the actions required to solve the task | New backgrounds, lighting changes, or distractor objects added to a familiar scene |
| Instruction generalization | Robustness to paraphrased or differently structured instructions | "Put the banana into the container" versus "Place the yellow fruit inside the clear box" |
| Action generalization | Ability to adapt or synthesize new motions for tasks the robot has not been trained on | Slam-dunking a new toy basketball, or grasping objects of an unseen shape |
Google reported that on a comprehensive generalization benchmark spanning these axes, Gemini Robotics more than doubled the average score of prior state-of-the-art VLA baselines, with the largest gaps appearing on instruction and action generalization rather than visual generalization alone.[1][16]
The demonstrations released with the original Gemini Robotics announcement included a robotic arm placing a banana into a clear container while the container was repositioned, folding glasses into a case, performing origami, and slam-dunking a small basketball into a net despite never having seen those specific objects in training.[1][10] Apollo, Apptronik's humanoid robot, was shown sorting laundry, placing colored blocks into trays, and loading bread into Ziploc bags.[15][21]
For the 1.5 generation, Google DeepMind highlighted a multi-step waste sorting task in which the robot first looked up local recycling rules using Google Search, then identified each object visually, then placed it in the correct bin: a sequence that required tool use, planning, and physical manipulation in a single mission.[4]
Google has published several headline benchmark numbers for the Gemini Robotics family. The numbers below are taken directly from Google DeepMind's announcements and the underlying technical reports.
| Model | Benchmark | Result | Source |
|---|---|---|---|
| Gemini Robotics (March 2025) | Comprehensive generalization benchmark | More than 2x average score over prior state-of-the-art VLAs | [1] |
| Gemini Robotics-ER (March 2025) | End-to-end robotic control | 2x to 3x success rate over Gemini 2.0 baseline | [1] |
| Gemini Robotics-ER 1.5 (September 2025) | 15 academic embodied reasoning benchmarks (ERQA, Point-Bench, RefSpatial, RoboSpatial-Pointing, Where2Place, BLINK, CV-Bench, EmbSpatial, MindCube, RoboSpatial-VQA, SAT, Cosmos-Reason1, Min Video Pairs, OpenEQA, VSI-Bench) | Highest aggregated score among models tested by Google | [4] |
| Gemini Robotics On-Device (June 2025) | Average task success across seven evaluation tasks | About 60% on-device, about 80% for cloud variant | [17] |
| Gemini Robotics-ER 1.6 (April 2026) | Instrument reading without agentic vision | 86% | [5] |
| Gemini Robotics-ER 1.6 (April 2026) | Instrument reading with agentic vision | 93% | [5] |
| Gemini Robotics models | ASIMOV semantic safety benchmark | Over 80% accuracy on hazardous-scenario questions, including bleach-and-vinegar mixing | [11][22] |
The IEEE Spectrum coverage of the April 2026 release reported that ER 1.6 lifted instrument reading accuracy from a Gemini Robotics-ER 1.5 baseline of 23% to 98% when equipped with agentic vision in Boston Dynamics' deployment, illustrating how much the agentic vision pipeline contributes on top of the base model.[19]
Gemini Robotics is intentionally embodiment-flexible, but most public demonstrations have used a small set of hardware partners.
| Partner | Robot | Role |
|---|---|---|
| Google DeepMind in-house | ALOHA 2 (bi-arm research platform) | Primary training and evaluation platform[1][16] |
| Franka Robotics | Franka FR3 (bi-arm) | Cross-embodiment evaluation; on-device adaptation target[3][17] |
| Apptronik | Apollo (humanoid) | Lead humanoid partner; demonstrations of laundry sorting, color-block placement, and packing[15][21] |
| Boston Dynamics | Spot (quadruped) | Industrial inspection, gauge reading, and autonomous navigation with Gemini Robotics-ER 1.6[6][19] |
| Agile Robots | Industrial bi-arm platforms | Trusted tester for Gemini Robotics-ER[2][8] |
| Agility Robotics | Digit (humanoid) | Trusted tester for Gemini Robotics-ER[2][8] |
| Enchanted Tools | Mirokai (mobile humanoid) | Trusted tester for Gemini Robotics-ER[2][8] |
Google DeepMind reported that the trusted tester program had grown to over sixty participants by the September 2025 update, with Apptronik named as a continuing partner during that release.[18]
Boston Dynamics integrated Gemini Robotics into the Spot SDK by exposing a small set of "tools" (navigation between locations, image capture, object identification, grasping, and placement) that Gemini Robotics could call. The integration deliberately limits the model's capabilities to the existing API surface so that the model cannot invent new actions outside what Spot is sanctioned to do.[23] On top of this, the AIVI-Learning visual inspection product on Spot and the Orbit fleet manager incorporated Gemini Robotics-ER 1.6 to read analog gauges, thermometers, sight glasses, and digital displays during autonomous patrol.[6][19]
Google DeepMind has framed Gemini Robotics as a robotics program with an explicit safety layer rather than a research demo. The company says its robotics models are reviewed through its Responsibility and Safety Council and evaluated against the ASIMOV benchmark suite for semantic and physical safety constraints.[1][22] In MIT Technology Review's launch coverage, the team described a constitutional approach inspired by Isaac Asimov's laws of robotics that produces a data-driven robot constitution to align behavior with human values.[10]
The Gemini Robotics 1.5 release upgraded the safety stack in three ways: it added a high-level semantic reasoning step that lets the planner think about safety before acting, aligned the planner's output with the existing Gemini Safety Policies, and triggered low-level on-board safety subsystems for collision avoidance.[4] An updated benchmark, ASIMOV v2, added broader tail coverage, new safety question types, and video modalities, and Google reported state-of-the-art results on it for Gemini Robotics-ER 1.5.[4]
For the April 2026 ER 1.6 release, Google reported better adherence to physical constraints (such as weight limits and liquid handling), a roughly 10-percentage-point improvement on video-based safety hazard identification compared to Gemini 3.0 Flash, and stronger compliance with Gemini safety policies on adversarial prompts.[5]
Gemini Robotics sits in a small but growing category of robotics foundation models. The table below summarizes how it compares to the most widely discussed peers as of April 2026.
| Model | Developer | Released | Approach | Action representation | Notable embodiments |
|---|---|---|---|---|---|
| RT-1 | Google Research | 2022 | Transformer policy on real demonstrations | Discrete tokens | Everyday Robots mobile manipulator |
| RT-2 | Google DeepMind | 2023 | VLM (PaLI-X / PaLM-E) fine-tuned on robot data | Discrete tokens | Bi-arm research robots |
| OpenVLA | Stanford and partners | June 2024 | Open-source 7B VLA on Open X-Embodiment data | Discrete tokens | 22 embodiments via Open X-Embodiment |
| Pi-0 | Physical Intelligence | 2024 | Diffusion-based VLA with 50Hz continuous joint outputs | Continuous, diffusion-generated trajectories | Multiple bi-arm and humanoid platforms |
| GR00T N1 | NVIDIA | 2025 | Foundation VLA for humanoids | Continuous joint actions | Humanoid robots |
| Gemini Robotics 1.5 | Google DeepMind | September 2025 | VLA on top of Gemini 2.0 with thinking before action | Continuous joint actions | ALOHA 2, Franka FR3, Apptronik Apollo |
In architecture terms, Pi-0 emphasizes diffusion-based continuous control and a hardware-agnostic, real-data philosophy, while OpenVLA and the RT line use discrete action tokens. Gemini Robotics shares Pi-0's continuous control philosophy but inherits the much larger Gemini 2.0 base, which gives it stronger world knowledge and tool-use behavior at the expense of fully open weights.[12] OpenVLA is open source and has been shown to outperform RT-2 on a suite of manipulation tasks despite a smaller parameter count, while Gemini Robotics is closed source and accessed through partner programs and the Gemini API.[12]
Google DeepMind has staggered developer access across the family. The vision-language-action models have generally been gated to partner programs, while the embodied reasoning models have been the first to reach broader developer audiences.
| Model | Access route as of April 2026 | Notes |
|---|---|---|
| Gemini Robotics (March 2025) | Trusted tester program only | Required signup form; available to a small set of partners[2][8] |
| Gemini Robotics-ER (March 2025) | Trusted tester program only | Same partner program as the VLA[2][8] |
| Gemini Robotics On-Device (June 2025) | Waitlist, then Gemini Robotics SDK | First VLA in the family released for fine-tuning, distributed with MuJoCo integration[3][17] |
| Gemini Robotics 1.5 (September 2025) | Select partners only | Continued partner-only distribution for the action model[4][18] |
| Gemini Robotics-ER 1.5 (September 2025) | Public preview through Gemini API and Google AI Studio | First Gemini Robotics model on the public Gemini API[4][7] |
| Gemini Robotics-ER 1.6 (April 2026) | Gemini API and Google AI Studio | Sample Colab notebooks and reference integrations published with the release[5] |
Google's developer materials for Gemini 2.5 highlight related primitives that complement Gemini Robotics, including the Live API for real-time voice interaction with robots, function-calling for defining robot APIs as tools, and code-generation patterns for pick-and-place planning. These primitives are exposed through Google AI Studio, the Gemini API, and Vertex AI for any application built on Gemini 2.5, not just the dedicated robotics models.[25]
The public deployments of Gemini Robotics fall into several broad use cases.
| Use case | Examples | Models involved |
|---|---|---|
| Industrial inspection | Reading analog gauges, thermometers, sight glasses, and digital displays during autonomous patrol with Boston Dynamics Spot | Gemini Robotics-ER 1.6[6][19] |
| Logistics and manufacturing | Apptronik Apollo trial deployments at Mercedes-Benz, GXO Logistics, and Jabil | Gemini Robotics, Gemini Robotics 1.5[15][21] |
| Household tasks | Sorting laundry, packing snacks into bags, folding garments, organizing shoes by handwritten instructions | Gemini Robotics, Gemini Robotics 1.5, Gemini Robotics On-Device[1][3][23] |
| Multi-step agentic missions | Sorting trash and recycling using local rules looked up via Google Search, then physically placing each object in the correct bin | Gemini Robotics 1.5 + Gemini Robotics-ER 1.5[4] |
| Research demonstrations | Origami folding, slam-dunking small basketballs, drawing cards, pouring salad dressing | Gemini Robotics, Gemini Robotics On-Device[1][3][10] |
The combination of Gemini Robotics-ER and a vision-language-action model has been pitched as a generic agent loop for embodied tasks: the planner inspects the scene, decomposes the goal into sub-tasks, calls a VLA or external API for each step, and uses success detection to decide whether to retry or move on.[4][7]
Reception of the launch combined enthusiasm about scope with skepticism about real-world readiness. Stanford bioengineer Jan Liphardt told MIT Technology Review that the missing piece between cognition, large language models, and decision-making was an intermediate level of physical intelligence, and that Gemini Robotics was a credible attempt to fill that gap.[10] IEEE Spectrum noted that the dexterity demonstrations applied to specific high-quality training data rather than fully general skills, and that the embodied reasoning model's reliance on human-centric training data could produce suboptimal grasps for some robotic end effectors.[11]
MIT Technology Review's hands-on commentary observed that the demonstrations remained "quite slow and a little janky," while crediting the underlying generalization with a clear step up from prior systems.[10] Axios characterized the launch as the moment the humanoid robot industry started to converge with frontier AI labs, with Google DeepMind, Apptronik, Boston Dynamics, and Agility Robotics all named in the same announcement.[24]
Later coverage of Gemini Robotics-ER 1.6 was more pointed about practical impact. IEEE Spectrum and The Robot Report both highlighted the deployment on Boston Dynamics' Spot for industrial inspection as the moment when Gemini Robotics moved from research lab to revenue-generating field robot.[19]
Google DeepMind has acknowledged several limitations of the Gemini Robotics family in its own materials and in interviews. The cloud-served VLA depends on a network connection, which constrains deployments where bandwidth or latency is unreliable; the on-device model addresses this but trades some capability for size, with average task success closer to 60% than to the cloud variant's roughly 80% in Google's own evaluations.[3][17] Generalization gains depend heavily on the training data mix, and dexterous tasks such as origami folding require careful per-task data curation rather than emerging fully zero-shot.[11]
The reasoning model can hallucinate spatial properties or affordances, and the September 2025 release added explicit physical-feasibility checks to filter out plans the planner generated but the robot could not safely execute.[7] Demonstrations remain comparatively slow: motions are deliberate rather than human-speed, and several reviewers have noted that the dexterity reels published by Google DeepMind are edited highlights from longer takes.[10][11]
Finally, Gemini Robotics models are largely closed: weights are not published, and access is gated through partner programs, the Gemini Robotics SDK trusted-tester waitlist, and the Gemini API for the embodied reasoning models. This contrasts with peers such as OpenVLA, whose weights and training data are public, and limits independent reproduction of Google's benchmark numbers.[12]