Shap-E

AI Models Generative AI OpenAI

6 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v1 · 1,296 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Shap-E is a conditional generative model for 3D assets developed by OpenAI, introduced in the paper "Shap-E: Generating Conditional 3D Implicit Functions" by Heewoo Jun and Alex Nichol, submitted to arXiv on May 3, 2023. ^[1] Unlike earlier 3D generative systems that produce a single fixed output representation, Shap-E directly generates the parameters of implicit functions, neural networks whose weights encode a 3D object. Because the generated function defines the object continuously in space, the same output can be rendered both as a textured mesh and as a neural radiance field (NeRF). ^[1] OpenAI released the inference code and model weights publicly on GitHub under the MIT License, positioning Shap-E as the successor to its earlier point-cloud system Point-E. ^[2]

Overview

Most prior 3D generative models, including Point-E, commit to one representation (for example, point clouds, voxels, or a single implicit field). Shap-E instead targets a higher-dimensional, multi-representation output space: rather than emitting geometry directly, it emits the weights of a small multilayer perceptron (MLP) that, when queried at any 3D coordinate, returns the properties needed to reconstruct the asset. ^[1] This design lets a single sampled output be rendered through two different paths, one based on volumetric rendering (NeRF) and one based on extracting an explicit surface mesh.

The authors report that, when trained on a large dataset of paired 3D assets and text, Shap-E can generate complex and diverse 3D objects in a matter of seconds from a text prompt or a single image. ^[1]^[3] The repository's stated purpose is to "generate 3D objects conditioned on text or images." ^[2]

The name follows OpenAI's earlier 2D and 3D generators: it echoes GLIDE and Point-E, with the "E" lineage continuing the family that began with text-to-image work such as DALL-E 2. ^[2]

How it works

Shap-E is trained in two stages: first an encoder that turns 3D assets into implicit-function parameters, then a conditional diffusion model trained over the encoder's outputs.

Stage 1: The encoder

The first stage trains a transformer-based encoder that deterministically maps an existing 3D asset into the parameters of an implicit function. For each asset the encoder ingests two complementary views of the geometry: a point cloud of 16,384 RGB points and 20 multiview renderings at 256x256 resolution. ^[1] These inputs pass through point convolutions and cross-attention into a transformer backbone, which produces a latent sequence of 1,024 tokens. The latents are passed through a tanh activation (clamping them to the range [-1, 1]) and then projected into the weights of the implicit MLP, described as four weight matrices of size 256x256. ^[1]

The implicit MLP itself can be queried in two ways, and the encoder is trained so that both rendering modes agree with the original asset:

NeRF rendering. The function maps a 3D coordinate and a viewing direction to a volume density and an RGB color, and images are produced by integrating along camera rays using coarse-to-fine sampling. This branch is trained with an L1 reconstruction loss on rendered colors and transmittance. ^[1]
STF (signed texture field) rendering. The function maps a coordinate to a color, a signed distance, and a vertex offset. A surface mesh is extracted from a 128-resolution grid using differentiable marching cubes and then rasterized, which yields an explicit textured mesh. This branch is added after a distillation step and trained with an L2 loss. ^[1]

Supporting both decoders from a single set of parameters is what allows Shap-E outputs to be exported either as a NeRF or as a polygonal mesh.

Stage 2: Conditional diffusion over implicit-function parameters

The second stage trains a conditional diffusion model on the latent parameters produced by the frozen encoder. Shap-E reuses the transformer-based diffusion architecture from Point-E, but instead of denoising a sequence of points it denoises the 1,024-token latent sequence, treating each row of the MLP weight matrices as a token. ^[1] The model is trained with x0-prediction (predicting the clean sample directly) and sampled at inference time with a Heun sampler. ^[1]

Conditioning is supplied through CLIP embeddings, and the model uses classifier-free guidance: the conditioning signal is dropped on 10% of training examples so that, at inference, conditional and unconditional predictions can be combined to strengthen prompt adherence. ^[1] For the text-conditional model, a single token carrying the CLIP text embedding is prepended to the sequence; for the image-conditional model, a 256-token CLIP image-embedding sequence is prepended. ^[1] The text-conditional models were trained on a corpus of several million 3D assets, augmented with roughly one million additional assets and about 120,000 captions from human labelers for higher-quality subsets. ^[1]

Comparison with Point-E

Shap-E is the direct successor to Point-E. The two share much of their diffusion infrastructure, but they differ in what they generate: Point-E is an explicit generative model that outputs 3D point clouds (and then fits a mesh as a separate step), whereas Shap-E generates implicit-function parameters that natively support both mesh and NeRF rendering. ^[1]^[2]

A central result of the paper is that, despite modeling a higher-dimensional, multi-representation output space, Shap-E converges faster during training and reaches comparable or better sample quality than the similarly sized Point-E. ^[1] At the 300M-parameter scale, the text-conditional Shap-E reported higher CLIP R-Precision than Point-E while also sampling more quickly. For image-conditional generation, the two models reached roughly the same final evaluation performance, with Shap-E holding a slight advantage on CLIP R-Precision and a slight disadvantage on CLIP score. ^[1]

Property	Point-E	Shap-E
Output representation	Explicit 3D point cloud	Parameters of an implicit function
Rendering modes	Point cloud, then separate mesh-fitting step	Textured mesh and NeRF from one output
Diffusion target	Sequence of points	1,024-token latent (MLP weights)
Text CLIP R-Precision (300M, ViT-B/32)	33.6%	37.8%
Reported sampling latency (text, 300M)	~25 V100-seconds	~13 V100-seconds

Latency and CLIP R-Precision figures are as reported in the Shap-E paper. ^[1]

Release

Shap-E was published on arXiv on May 3, 2023, and OpenAI released the official implementation in the openai/shap-e GitHub repository, including inference code and pre-trained text-conditional and image-conditional model weights under the MIT License. ^[1]^[2] Press coverage of the open-source release appeared in early-to-mid May 2023. ^[3]

The repository ships three example Jupyter notebooks: ^[2]

Notebook	Purpose
`sample_text_to_3d.ipynb`	Sample a 3D model conditioned on a text prompt
`sample_image_to_3d.ipynb`	Sample a 3D model conditioned on a synthetic-view image (background removal recommended)
`encode_model.ipynb`	Load a 3D model, build multiview renders and a point cloud, encode to a latent, and render it back

The encoding notebook depends on Blender 3.3.1 or higher for generating the multiview renders. ^[2]

Significance

Shap-E demonstrated that diffusion models could be trained to generate the weights of neural implicit representations rather than explicit geometry, and that doing so could be both faster to train and competitive in quality with an explicit point-cloud baseline. ^[1] By producing outputs that are simultaneously meshes and NeRFs, it offered a more flexible result format than Point-E for downstream use in graphics and content pipelines. ^[3] As a freely available, locally runnable model with open weights, Shap-E became a widely used reference point for subsequent text-to-3D and image-to-3D research and a common baseline against which later generative 3D systems were compared. ^[2]^[3]

References

Jun, Heewoo and Nichol, Alex. "Shap-E: Generating Conditional 3D Implicit Functions." arXiv:2305.02463, May 3, 2023. https://arxiv.org/abs/2305.02463 ↩
OpenAI. "openai/shap-e: Generate 3D objects conditioned on text or images." GitHub repository. https://github.com/openai/shap-e ↩
Freeman, Andrew. "OpenAI's Shap-E Model Makes 3D Objects From Text or Images." Tom's Hardware, May 8, 2023. https://www.tomshardware.com/news/openai-shap-e-creates-3d-models ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

3D printing OpenAI Point-E

Overview

How it works

Stage 1: The encoder

Stage 2: Conditional diffusion over implicit-function parameters

Comparison with Point-E

Release

Significance

References

Improve this article

Related Articles

Sora 2

GPT Image 1

ChatGPT

DALL-E

GPT-4

Sora

What links here

Related Articles

Sora 2

GPT Image 1

ChatGPT

DALL-E

GPT-4

Sora

What links here