Aditya Ramesh
Last reviewed
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,467 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,467 words
Add missing citations, update stale details, or suggest a clearer explanation.
Aditya Ramesh is an artificial intelligence researcher at OpenAI, where he serves as a vice president of research and is best known as the creator of DALL-E, the text-to-image model that helped launch the modern wave of generative image synthesis. He was the lead author of the original DALL-E in 2021 and of DALL-E 2 in 2022, and he led the team that built DALL-E 3 in 2023. Ramesh later assembled and led the group behind Sora, OpenAI's text-to-video model, and as of 2026 he heads an internal research program called Worldsim and the company's robotics division. [1][8]
Ramesh is of Indian origin. [11] He studied at New York University, where he earned a bachelor's degree; he does not hold a graduate degree. During his final undergraduate years he worked on research projects in the laboratory of Yann LeCun, the Turing Award winner and deep learning pioneer who had built NYU's machine learning research group. [2]
According to LeCun, Ramesh had intended to pursue a PhD after graduating. Instead he took a summer internship at OpenAI, and the laboratory decided to keep him on. [2] That choice placed Ramesh inside OpenAI during the period when the organization was scaling up the transformer based generative models that would come to define its research agenda. His path, from undergraduate intern to vice president of research without a doctorate, later became a frequently cited example of how the field rewarded hands-on model building over formal credentials. [2]
Ramesh is best known as the inventor of DALL-E, which OpenAI introduced in January 2021. The system generated images from natural language descriptions and could combine unrelated ideas, such as "an armchair in the shape of an avocado," in coherent and often unexpected ways. Its name is a portmanteau of the Surrealist painter Salvador Dali and the animated Pixar robot WALL-E, chosen to evoke a merger of art and technology. [11][1]
The first DALL-E was a 12 billion parameter autoregressive transformer that treated text and image tokens as a single stream of data, extending to vision the generative pretraining approach OpenAI had already applied to language. The model leaned on a separate OpenAI system, CLIP, to rank its candidate images by how well they matched the prompt. Ramesh was the lead author of the accompanying paper, "Zero-Shot Text-to-Image Generation," presented at the International Conference on Machine Learning in 2021. His co-authors included Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and OpenAI co-founder Ilya Sutskever. [3] The work built on OpenAI's earlier generative pretraining research, including Image GPT, which had shown that transformers trained to predict pixels could learn strong image representations.
In an interview marking the model's second anniversary, Ramesh said the team had been drawn to text-to-image generation because language can describe almost any situation: "We felt like text-to-image generation was interesting because as humans, we're able to construct a sentence to describe any situation." He added that he had expected the technology to matter but was surprised by how quickly it reached a wide audience. [1]
Ramesh led the follow-up, DALL-E 2, unveiled in April 2022, and was the lead author of its paper, "Hierarchical Text-Conditional Image Generation with CLIP Latents," written with Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. [4] Rather than a single autoregressive model, DALL-E 2 used a two stage design: a prior that produces a CLIP image embedding from a text caption, followed by a diffusion model decoder that turns that embedding into an image. The move to diffusion brought higher resolution and greater photorealism, and it introduced editing features such as inpainting, outpainting, and image variations that became standard in later tools. OpenAI released DALL-E 2 cautiously, behind a waitlist and content filters, before opening it more broadly later in 2022. Ramesh has described DALL-E as a "creative co-pilot" for artists, comparable to how OpenAI's Codex assists programmers. [1]
In September 2023, OpenAI announced DALL-E 3, which Ramesh led. Built natively into ChatGPT, it followed prompts far more faithfully than its predecessors and let users refine images through conversation. It began rolling out to ChatGPT Plus and Enterprise subscribers in October 2023, and OpenAI allowed artists to opt their work out of future training. [5]
| Model | Public debut | Paper | Architecture | Notable advances |
|---|---|---|---|---|
| DALL-E | January 2021 | Zero-Shot Text-to-Image Generation | 12B-parameter autoregressive transformer over text and image tokens | First broad demonstration of free-form text-to-image generation |
| DALL-E 2 | April 2022 | Hierarchical Text-Conditional Image Generation with CLIP Latents | CLIP prior plus diffusion decoder | Higher resolution and photorealism; inpainting, outpainting, variations |
| DALL-E 3 | September 2023 | DALL-E 3 system card | Diffusion model integrated with ChatGPT | Much stronger prompt following; conversational image creation |
After the DALL-E series, Ramesh turned from still images to video. He built the team that created Sora, OpenAI's text-to-video model, and the product experience at sora.com. [8] Within OpenAI the effort sat inside a World Simulation group that Ramesh headed as vice president of research, and he was one of Sora's public leaders alongside researchers Tim Brooks and Bill Peebles. [7]
OpenAI unveiled Sora as a research preview on February 15, 2024, releasing sample clips and a technical report titled "Video generation models as world simulators." The model could generate high definition video up to about one minute long from a text prompt, and OpenAI presented it not merely as a video generator but as an early step toward systems that learn an implicit simulation of the physical world. [6] OpenAI later opened a public version, Sora Turbo, through sora.com in December 2024. [8]
By 2025 and 2026 Ramesh's focus had shifted from generating pixels to acting in the physical world. He leads an internal research program called Worldsim, which uses powerful world models in the lineage of Sora to simulate reality in enough detail that simulated experience can substitute for scarce real world data. OpenAI has argued that this approach helps overcome the data bottleneck that has long slowed robot learning. [8][10]
On May 31, 2026, OpenAI announced a dedicated robotics division that grew out of the Worldsim program, with Ramesh leading it. [9][10] Chief executive Sam Altman framed the move as OpenAI's world simulation research evolving into OpenAI Robotics. The stated near term goal is to build robots that support skilled workers constructing infrastructure, with a longer term ambition of general purpose personal robots in widespread use. The relaunch marked OpenAI's return to robotics after it had disbanded an earlier robotics team around 2020 to 2021, and it followed the end of OpenAI's collaboration with the humanoid robot startup Figure. The division began hiring across areas such as actuator design, simulation realism, and large scale data collection. [9][10]
A consistent thread runs through Ramesh's work: building generative models that internalize a usable model of the world. DALL-E learned to render scenes described in language; Sora extended that idea to motion and time, which OpenAI explicitly described as world simulation; and Worldsim aims to push the same modeling power toward agents that perceive and act. [6][8] His career also illustrates a broader pattern at OpenAI, in which capabilities first demonstrated in one modality, text, were carried into images, then video, and ultimately toward embodied systems.