Justin Johnson
Last reviewed
Jun 5, 2026
Sources
24 citations
Review status
Source-backed
Revision
v2 ยท 2,226 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 5, 2026
Sources
24 citations
Review status
Source-backed
Revision
v2 ยท 2,226 words
Add missing citations, update stale details, or suggest a clearer explanation.
Justin Johnson is an American computer scientist known for his work in computer vision and machine learning, particularly on perceptual losses for real-time neural style transfer, the CLEVR visual-reasoning benchmark, and the DenseCap dense-captioning model.[1][2][3] He completed his PhD at Stanford University under Fei-Fei Li, was an assistant professor at the University of Michigan from 2019 to 2024, and is a co-founder of World Labs, the spatial-intelligence company started in 2024 by Li and three other researchers.[4][5][6]
Johnson's research spans visual reasoning, vision and language, image generation, and 3D scene understanding using deep learning.[4][7] During his doctoral work he produced several widely used contributions: a feed-forward formulation of neural style transfer trained with perceptual loss functions, the DenseCap model for localizing and describing many regions of an image, and CLEVR, a synthetic dataset designed to probe compositional visual reasoning.[1][3][2] As a faculty member he became known beyond the research community for a freely available graduate course on deep learning for computer vision, whose recorded lectures circulated widely online.[8] His Google Scholar profile lists tens of thousands of citations and an h-index in the forties, with the style-transfer paper alone accounting for more than ten thousand citations.[7]
Johnson earned a Bachelor of Science in mathematics and computer science from the California Institute of Technology in 2012.[9] He then entered the doctoral program in computer science at Stanford, where he worked in the Stanford Vision Lab under the supervision of Fei-Fei Li and completed his PhD in 2018.[1][6][10] His graduate research concentrated on connecting language and vision, structured visual reasoning, and image synthesis, themes that ran through most of his early publications.[6][9] He stated his interests broadly as computer vision and machine learning, with an emphasis on visual reasoning, vision and language, image generation, and 3D reasoning using deep neural networks.[4]
While a student he held research internships in industry, including stints at Google and at Yahoo, before joining Facebook AI Research toward the end of his doctorate.[9][6] He has also worked as a visiting faculty researcher at Google AI.[4]
In 2019 Johnson joined the University of Michigan as an assistant professor in the department of electrical engineering and computer science, where he led a group working on computer vision and machine learning.[6][4] He advised several doctoral students there, including Mohamed El Banani, Karan Desai, Ang Cao, Chris Rockwell, Nilesh Kulkarni, and Tiange Luo, some of them co-advised with David Fouhey or Honglak Lee, and his Michigan-era work extended into 3D reconstruction and scene generation alongside his earlier interests in vision and language.[6][4] He remained on the Michigan faculty until 2024.[6]
Johnson is best known as a teacher for EECS 498-007 / 598-005, "Deep Learning for Computer Vision," which he developed and delivered at Michigan in the fall 2019, fall 2020, and winter 2022 terms, and he also taught the undergraduate EECS 442 computer-vision course.[8][4] The full lecture series for the deep-learning course was posted publicly and became a popular self-study resource, covering neural network architectures, training methods, and contemporary research in visual recognition.[8] The course traces its lineage to Stanford's CS231n, "Convolutional Neural Networks for Visual Recognition," which Johnson co-taught as a graduate student.[4][11] He was an instructor for CS231n across several offerings, including the 2016 edition with Andrej Karpathy and Fei-Fei Li and the 2017 edition whose lecture videos were released online, and he continued as an instructor in 2018 and 2019 alongside Serena Yeung and Fei-Fei Li.[4][11]
Several of Johnson's papers are heavily cited and have shaped subfields of computer vision. As of 2026 his Google Scholar profile reports more than 47,000 total citations, an h-index of 43, and an i10-index of 58.[7]
His most-cited work is "Perceptual Losses for Real-Time Style Transfer and Super-Resolution," presented at the 2016 European Conference on Computer Vision (ECCV) with Alexandre Alahi and Fei-Fei Li.[1] The paper trains a feed-forward convolutional neural network for image transformation tasks using a perceptual loss computed from the activations of a pretrained network, rather than a per-pixel loss.[1] For artistic style transfer it reproduces results comparable to the slower optimization-based method of Gatys and colleagues while running roughly three orders of magnitude faster, which made interactive style transfer practical.[1] Johnson released an accompanying open-source implementation that was widely reused.[12] On Google Scholar the paper has accumulated well over ten thousand citations, making it his single most cited publication.[7] He later studied the temporal artifacts of these feed-forward models in "Characterizing and Improving Stability in Neural Style Transfer," presented at the 2017 International Conference on Computer Vision (ICCV) with Agrim Gupta, Alexandre Alahi, and Fei-Fei Li.[4]
In "DenseCap: Fully Convolutional Localization Networks for Dense Captioning," presented at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) with Andrej Karpathy and Fei-Fei Li, he introduced the task of dense captioning, in which a system both localizes salient regions of an image and describes each in natural language.[3] The model combined a convolutional network, a differentiable localization layer, and a recurrent neural network language model in a single architecture trained end to end, and was evaluated on the Visual Genome dataset.[3]
CLEVR, short for "Compositional Language and Elementary Visual Reasoning," is a diagnostic dataset he led at CVPR 2017 with collaborators including Bharath Hariharan, Laurens van der Maaten, Fei-Fei Li, and C. Lawrence Zitnick.[2] Built from rendered scenes of simple 3D shapes paired with programmatically generated questions, CLEVR was designed to isolate and measure specific reasoning abilities of vision-and-language systems while minimizing the dataset biases that let models exploit shortcuts.[2] It became a standard benchmark for compositional visual reasoning. Johnson followed it with "Inferring and Executing Programs for Visual Reasoning," an ICCV 2017 oral with Hariharan, van der Maaten, Judy Hoffman, Fei-Fei Li, Zitnick, and Ross Girshick, which paired a program generator that translates a question into an explicit sequence of reasoning steps with an execution engine that runs that program over the image.[17] Trained with program supervision, the neuro-symbolic model reached about 97 percent accuracy on CLEVR, well above strong end-to-end baselines.[17]
Earlier, his CVPR 2015 paper "Image Retrieval using Scene Graphs" introduced a framework for semantic image retrieval in which queries are expressed as scene graphs encoding objects, their attributes, and the relationships among them.[13] He was also among the authors of Visual Genome, a large dataset of densely annotated images connecting language and vision that was published in the International Journal of Computer Vision in 2017 and that ranks among his most cited works.[7] He returned to scene graphs as a generative tool in "Image Generation from Scene Graphs," a CVPR 2018 paper with Agrim Gupta and Fei-Fei Li that synthesizes images from graph-structured descriptions using graph convolution and a predicted scene layout.[18] Among his other widely cited Stanford-era papers are "Visualizing and Understanding Recurrent Networks," "Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks," and "HiDDeN: Hiding Data With Deep Networks."[7][4]
At Facebook AI Research and Michigan, Johnson moved toward 3D reasoning and view synthesis. He co-authored "Mesh R-CNN" at ICCV 2019 with Georgia Gkioxari and Jitendra Malik, extending object detection to predict a 3D triangle mesh for each detected object.[19] In 2020 he was part of the team behind PyTorch3D, a FAIR library of reusable components for deep learning with 3D data described in "Accelerating 3D Deep Learning with PyTorch3D," which provided differentiable rendering, mesh operators, and other primitives and became a widely adopted research tool.[20] His view-synthesis work includes "SynSin: End-to-End View Synthesis from a Single Image," a CVPR 2020 oral with Olivia Wiles, Georgia Gkioxari, and Richard Szeliski that predicts and renders a 3D point cloud to generate new views, and "PixelSynth," an ICCV 2021 paper on producing a 3D-consistent experience from a single image.[4] He also contributed to representation-learning and registration work such as VirTex, which learns visual features from image captions, and UnsupervisedR&R for point-cloud registration via differentiable rendering, and to the PHYRE physical-reasoning benchmark.[4]
The table below summarizes selected works.
| Year | Work | Venue | Topic |
|---|---|---|---|
| 2015 | Image Retrieval using Scene Graphs | CVPR | Scene-graph image retrieval[13] |
| 2016 | DenseCap: Fully Convolutional Localization Networks for Dense Captioning | CVPR | Dense captioning[3] |
| 2016 | Perceptual Losses for Real-Time Style Transfer and Super-Resolution | ECCV | Feed-forward style transfer[1] |
| 2017 | CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning | CVPR | Visual-reasoning benchmark[2] |
| 2017 | Inferring and Executing Programs for Visual Reasoning | ICCV | Neuro-symbolic visual reasoning[17] |
| 2017 | Visual Genome | IJCV | Language-and-vision dataset[7] |
| 2018 | Image Generation from Scene Graphs | CVPR | Scene-graph image synthesis[18] |
| 2019 | Mesh R-CNN | ICCV | 3D mesh prediction from images[19] |
| 2020 | Accelerating 3D Deep Learning with PyTorch3D | arXiv | 3D deep-learning library[20] |
| 2020 | SynSin: End-to-End View Synthesis from a Single Image | CVPR | Single-image view synthesis[4] |
Johnson joined Facebook AI Research (FAIR), the research division later organized under Meta AI, as a research scientist in 2018.[6] He continued his affiliation with the lab in research and visiting-scientist roles, concurrent with his Michigan professorship, through 2023.[6] During this period he contributed to work in computer vision, image generation, and 3D reasoning, including Mesh R-CNN, PyTorch3D, and SynSin.[4][6][19][20] He has also held a visiting faculty researcher position at Google AI.[4]
Johnson is a co-founder of World Labs, a startup that aims to build large "world models" capable of perceiving, generating, reasoning about, and interacting with the 3D world.[5][6][14] According to multiple accounts, the company was founded in January 2024 by Fei-Fei Li as chief executive together with Johnson, Christoph Lassner, and Ben Mildenhall, and it emerged from stealth later in 2024.[21][5] The company describes its four founders as "world-renowned" technologists in machine learning, generative AI, computer vision, and graphics.[5] The venture firm Andreessen Horowitz, an early investor, described Johnson as "a top researcher in AI and computer vision who did his PhD in Fei-Fei's lab and went on to be a professor at University of Michigan," crediting him with "some of the original style transfer work, as well as advances in many other areas of AI and computer vision."[15]
The company is built around the idea of spatial intelligence, the ability of an AI system to understand and act in three-dimensional space rather than reasoning only over flat images.[15][14] In 2024 it raised financing reported at $230 million at a valuation of about $1 billion, with backers including Andreessen Horowitz, the venture arm of NVIDIA, and Radical Ventures.[16][21] In February 2026 World Labs raised an additional $1 billion led by a $200 million investment from Autodesk, with participants including Andreessen Horowitz, NVIDIA, AMD, Emerson Collective, and Fidelity, bringing its total funding to roughly $1.23 billion.[22] In November 2025 the company released Marble, its first commercial product, a multimodal world model that generates persistent, editable 3D worlds from inputs such as a text prompt, one or more images, video, or a coarse 3D layout, and that can export results as Gaussian splats, triangle meshes, or video.[23][5]
As of 2026 Johnson is a co-founder of World Labs and works on its world-model research, and his Google Scholar profile lists his affiliation as co-founder of the company.[7][5] In late 2024 he announced the team's first research result, a generative model of 3D worlds, describing the goal of lifting generative AI into three dimensions to change how media such as movies and games are made.[24]