Shap-E
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,296 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,296 words
Add missing citations, update stale details, or suggest a clearer explanation.
Shap-E is a conditional generative model for 3D assets developed by OpenAI, introduced in the paper "Shap-E: Generating Conditional 3D Implicit Functions" by Heewoo Jun and Alex Nichol, submitted to arXiv on May 3, 2023. [1] Unlike earlier 3D generative systems that produce a single fixed output representation, Shap-E directly generates the parameters of implicit functions, neural networks whose weights encode a 3D object. Because the generated function defines the object continuously in space, the same output can be rendered both as a textured mesh and as a neural radiance field (NeRF). [1] OpenAI released the inference code and model weights publicly on GitHub under the MIT License, positioning Shap-E as the successor to its earlier point-cloud system Point-E. [2]
Most prior 3D generative models, including Point-E, commit to one representation (for example, point clouds, voxels, or a single implicit field). Shap-E instead targets a higher-dimensional, multi-representation output space: rather than emitting geometry directly, it emits the weights of a small multilayer perceptron (MLP) that, when queried at any 3D coordinate, returns the properties needed to reconstruct the asset. [1] This design lets a single sampled output be rendered through two different paths, one based on volumetric rendering (NeRF) and one based on extracting an explicit surface mesh.
The authors report that, when trained on a large dataset of paired 3D assets and text, Shap-E can generate complex and diverse 3D objects in a matter of seconds from a text prompt or a single image. [1][3] The repository's stated purpose is to "generate 3D objects conditioned on text or images." [2]
The name follows OpenAI's earlier 2D and 3D generators: it echoes GLIDE and Point-E, with the "E" lineage continuing the family that began with text-to-image work such as DALL-E 2. [2]
Shap-E is trained in two stages: first an encoder that turns 3D assets into implicit-function parameters, then a conditional diffusion model trained over the encoder's outputs.
The first stage trains a transformer-based encoder that deterministically maps an existing 3D asset into the parameters of an implicit function. For each asset the encoder ingests two complementary views of the geometry: a point cloud of 16,384 RGB points and 20 multiview renderings at 256x256 resolution. [1] These inputs pass through point convolutions and cross-attention into a transformer backbone, which produces a latent sequence of 1,024 tokens. The latents are passed through a tanh activation (clamping them to the range [-1, 1]) and then projected into the weights of the implicit MLP, described as four weight matrices of size 256x256. [1]
The implicit MLP itself can be queried in two ways, and the encoder is trained so that both rendering modes agree with the original asset:
Supporting both decoders from a single set of parameters is what allows Shap-E outputs to be exported either as a NeRF or as a polygonal mesh.
The second stage trains a conditional diffusion model on the latent parameters produced by the frozen encoder. Shap-E reuses the transformer-based diffusion architecture from Point-E, but instead of denoising a sequence of points it denoises the 1,024-token latent sequence, treating each row of the MLP weight matrices as a token. [1] The model is trained with x0-prediction (predicting the clean sample directly) and sampled at inference time with a Heun sampler. [1]
Conditioning is supplied through CLIP embeddings, and the model uses classifier-free guidance: the conditioning signal is dropped on 10% of training examples so that, at inference, conditional and unconditional predictions can be combined to strengthen prompt adherence. [1] For the text-conditional model, a single token carrying the CLIP text embedding is prepended to the sequence; for the image-conditional model, a 256-token CLIP image-embedding sequence is prepended. [1] The text-conditional models were trained on a corpus of several million 3D assets, augmented with roughly one million additional assets and about 120,000 captions from human labelers for higher-quality subsets. [1]
Shap-E is the direct successor to Point-E. The two share much of their diffusion infrastructure, but they differ in what they generate: Point-E is an explicit generative model that outputs 3D point clouds (and then fits a mesh as a separate step), whereas Shap-E generates implicit-function parameters that natively support both mesh and NeRF rendering. [1][2]
A central result of the paper is that, despite modeling a higher-dimensional, multi-representation output space, Shap-E converges faster during training and reaches comparable or better sample quality than the similarly sized Point-E. [1] At the 300M-parameter scale, the text-conditional Shap-E reported higher CLIP R-Precision than Point-E while also sampling more quickly. For image-conditional generation, the two models reached roughly the same final evaluation performance, with Shap-E holding a slight advantage on CLIP R-Precision and a slight disadvantage on CLIP score. [1]
| Property | Point-E | Shap-E |
|---|---|---|
| Output representation | Explicit 3D point cloud | Parameters of an implicit function |
| Rendering modes | Point cloud, then separate mesh-fitting step | Textured mesh and NeRF from one output |
| Diffusion target | Sequence of points | 1,024-token latent (MLP weights) |
| Text CLIP R-Precision (300M, ViT-B/32) | 33.6% | 37.8% |
| Reported sampling latency (text, 300M) | ~25 V100-seconds | ~13 V100-seconds |
Latency and CLIP R-Precision figures are as reported in the Shap-E paper. [1]
Shap-E was published on arXiv on May 3, 2023, and OpenAI released the official implementation in the openai/shap-e GitHub repository, including inference code and pre-trained text-conditional and image-conditional model weights under the MIT License. [1][2] Press coverage of the open-source release appeared in early-to-mid May 2023. [3]
The repository ships three example Jupyter notebooks: [2]
| Notebook | Purpose |
|---|---|
sample_text_to_3d.ipynb | Sample a 3D model conditioned on a text prompt |
sample_image_to_3d.ipynb | Sample a 3D model conditioned on a synthetic-view image (background removal recommended) |
encode_model.ipynb | Load a 3D model, build multiview renders and a point cloud, encode to a latent, and render it back |
The encoding notebook depends on Blender 3.3.1 or higher for generating the multiview renders. [2]
Shap-E demonstrated that diffusion models could be trained to generate the weights of neural implicit representations rather than explicit geometry, and that doing so could be both faster to train and competitive in quality with an explicit point-cloud baseline. [1] By producing outputs that are simultaneously meshes and NeRFs, it offered a more flexible result format than Point-E for downstream use in graphics and content pipelines. [3] As a freely available, locally runnable model with open weights, Shap-E became a widely used reference point for subsequent text-to-3D and image-to-3D research and a common baseline against which later generative 3D systems were compared. [2][3]