Neural Radiance Fields (NeRF) is a method for synthesizing novel views of complex 3D scenes by representing them as continuous volumetric functions learned by a neural network. Introduced by Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng in their 2020 paper "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," the technique received the Best Paper Honorable Mention at the European Conference on Computer Vision (ECCV) 2020. NeRF has since become one of the most influential papers in computer vision, amassing thousands of citations and inspiring a large family of follow-up methods for 3D reconstruction, novel view synthesis, and related tasks.
At its core, NeRF takes a set of 2D images of a scene captured from known camera positions and learns a compact, continuous representation of the scene's geometry and appearance. From this representation, photorealistic images can be rendered from arbitrary new viewpoints that were never directly observed. The approach represented a significant leap in quality over prior methods for view synthesis, producing images with fine detail, realistic lighting, and accurate reflections.
The central insight of NeRF is to represent a static 3D scene as a continuous function that maps a 5D input coordinate to a color and volume density value. The five dimensions consist of a 3D spatial location (x, y, z) and a 2D viewing direction (theta, phi). The spatial coordinates determine how dense (opaque) the scene is at a given point, while the viewing direction allows the model to capture view-dependent effects such as specular highlights and reflections.
Formally, the scene is modeled as a function:
F: (x, y, z, theta, phi) -> (r, g, b, sigma)
where (r, g, b) represents the emitted color at that location when viewed from the given direction, and sigma represents the volume density (a measure of how much light is absorbed or scattered at that point). This function is approximated by a multilayer perceptron (MLP) trained using only 2D images and their corresponding camera poses.
The volume density sigma is predicted as a function of the spatial location alone, which ensures multiview consistency (the geometry of the scene does not change depending on the viewing angle). The color, however, depends on both position and viewing direction, allowing the network to model non-Lambertian surface effects.
The neural network architecture used in the original NeRF paper consists of a relatively simple MLP. The network processes the 3D spatial coordinates through 8 fully connected layers, each with 256 hidden units and ReLU activation functions. A skip connection concatenates the input to the fifth layer, following a pattern inspired by ResNet-style architectures.
After these 8 layers, the network outputs the volume density sigma and a 256-dimensional feature vector. This feature vector is then concatenated with the encoded viewing direction and passed through one additional fully connected layer with 128 hidden units. That final layer outputs the view-dependent RGB color.
This design enforces an important inductive bias: volume density is a function of position only, while color depends on both position and viewing direction. This separation ensures that the geometry remains consistent across different viewpoints, while allowing appearance to change based on the observer's angle (capturing effects like specularity and reflections).
One of the key technical contributions of the original NeRF paper is the use of positional encoding to map low-dimensional input coordinates into a higher-dimensional space before passing them into the MLP. Without this encoding, the network struggles to represent high-frequency variations in color and geometry, producing overly smooth results.
The positional encoding function maps each scalar input value p to a vector using sinusoidal functions at exponentially increasing frequencies:
gamma(p) = (sin(2^0 * pi * p), cos(2^0 * pi * p), sin(2^1 * pi * p), cos(2^1 * pi * p), ..., sin(2^(L-1) * pi * p), cos(2^(L-1) * pi * p))
For the 3D spatial coordinates, L = 10 frequency bands are used, resulting in a 60-dimensional encoding (plus the 3 original coordinates). For the 2D viewing direction, L = 4 frequency bands are used, producing a 24-dimensional encoding. This approach is related to the concept of Fourier features and was theoretically justified by Tancik et al. in their concurrent work "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains" (NeurIPS 2020), which showed that such encodings allow MLPs to overcome the spectral bias toward low-frequency functions.
To render an image from the learned NeRF representation, the method employs classical volume rendering. For each pixel in the desired output image, a camera ray r(t) = o + td is cast from the camera origin o through the pixel into the scene, where d is the ray direction and t parameterizes points along the ray.
The expected color C(r) of the ray is computed by integrating the colors and densities along the ray between a near bound t_n and a far bound t_f:
C(r) = integral from t_n to t_f of T(t) * sigma(r(t)) * c(r(t), d) dt
where T(t) is the accumulated transmittance along the ray from t_n to t, representing the probability that the ray travels from t_n to t without hitting any other particle:
T(t) = exp(-integral from t_n to t of sigma(r(s)) ds)
In practice, this continuous integral is approximated numerically using quadrature. The ray is partitioned into N evenly spaced bins, and a single point is sampled uniformly at random from within each bin (stratified sampling). The discrete approximation becomes:
C_hat(r) = sum over i of T_i * (1 - exp(-sigma_i * delta_i)) * c_i
where delta_i is the distance between adjacent samples, and alpha_i = 1 - exp(-sigma_i * delta_i) represents the opacity of the i-th sample. This formula reduces to traditional alpha compositing and, importantly, is fully differentiable with respect to the network parameters. This differentiability allows the entire rendering pipeline to be trained end-to-end using gradient descent.
Rendering high-quality images by densely sampling every point along each ray is computationally expensive. Many sampled points fall in empty space or in occluded regions that do not contribute to the final rendered color. To address this inefficiency, NeRF employs a hierarchical sampling strategy with two networks: a "coarse" network and a "fine" network.
The coarse network is first evaluated at a set of N_c = 64 stratified sample points along each ray. The output density values from the coarse network are used to construct a piecewise-constant probability density function (PDF) along the ray. Regions with higher predicted density receive higher probability, indicating that they are more likely to contain visible surfaces.
A second set of N_f = 128 samples is then drawn from this distribution using inverse transform sampling. These additional samples, combined with the original 64 coarse samples (for a total of 192 points per ray), are passed through the fine network to produce the final high-quality color estimate. This approach concentrates computational resources on the parts of the scene that matter most for the final rendered output.
NeRF is trained in a per-scene optimization framework. Given a dataset of posed 2D images (photographs with known camera intrinsics and extrinsics), the network is optimized to minimize the difference between its rendered outputs and the ground-truth pixel colors.
The camera poses are typically obtained through Structure-from-Motion (SfM) preprocessing using tools such as COLMAP. COLMAP performs feature extraction (usually using SIFT features), feature matching across image pairs, and bundle adjustment to estimate camera parameters and produce a sparse 3D point cloud.
The training loss is the total squared error (L2 loss) between the rendered and true pixel colors, summed over both the coarse and fine networks:
L = sum over rays r of ( ||C_hat_coarse(r) - C(r)||^2 + ||C_hat_fine(r) - C(r)||^2 )
Training is performed using the Adam optimizer. In the original paper, training a single scene required approximately 100,000 to 300,000 iterations, taking roughly 1 to 2 days on a single NVIDIA V100 GPU. Rendering a single 800 x 800 image from the trained model required about 30 seconds, as each pixel requires multiple forward passes through the MLP.
NeRF's quality is typically evaluated using three standard image quality metrics:
On the NeRF-Synthetic (Blender) dataset, the original NeRF achieved an average PSNR of approximately 31 dB across the 8 test scenes, significantly outperforming prior methods such as Neural Volumes, Scene Representation Networks (SRN), and Local Light Field Fusion (LLFF) at the time of publication.
Despite its impressive results, the original NeRF method has several notable limitations:
These limitations motivated a wave of follow-up research that collectively addressed nearly every shortcoming, resulting in methods that are orders of magnitude faster, handle dynamic and unbounded scenes, and in some cases generalize across scenes.
The NeRF paper spawned an extraordinarily active research area. Below are some of the most important variants.
Mip-NeRF, proposed by Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan, addresses the aliasing problem in the original NeRF. Instead of casting infinitesimal rays through each pixel, Mip-NeRF casts 3D conical frustums. For each sample along the cone, a multivariate Gaussian is fitted to the conical frustum, and an Integrated Positional Encoding (IPE) featurizes the entire region rather than a single point.
This approach allows the network to reason about the scale at which the scene is being observed, naturally handling multiscale content. Mip-NeRF reduced error rates by 17% on the standard Blender dataset and by 60% on a challenging multiscale variant, while being 7% faster than the original NeRF and using half the model size (by eliminating the need for separate coarse and fine networks).
Mip-NeRF 360, by Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman, extends Mip-NeRF to handle unbounded 360-degree scenes. It introduces three key innovations: a non-linear scene parameterization that maps unbounded coordinates into a bounded domain; an online distillation procedure that replaces the coarse-to-fine sampling with a more efficient proposal network; and a novel distortion-based regularizer that encourages the model to produce compact, well-defined geometry.
Mip-NeRF 360 reduced mean-squared error by 57% compared to Mip-NeRF on challenging outdoor scenes and became the standard benchmark method for unbounded scene reconstruction. The accompanying mip-NeRF 360 dataset is now one of the most widely used benchmarks for evaluating novel view synthesis methods.
Instant Neural Graphics Primitives, proposed by Thomas Muller, Alex Evans, Christoph Schied, and Alexander Keller at NVIDIA, represents the most significant speedup in NeRF training and rendering. Published in ACM Transactions on Graphics (SIGGRAPH 2022), the paper replaces the slow positional encoding and large MLP of the original NeRF with a multiresolution hash encoding backed by a much smaller neural network.
The multiresolution hash encoding works by arranging trainable feature vectors into L resolution levels, each containing a hash table with up to T entries. For a given 3D point, the method finds the surrounding voxels at each resolution level, hashes their vertices to look up trainable feature vectors, and interpolates between them. These features from all resolution levels are concatenated and fed into a compact MLP (typically just 2 layers with 64 hidden units).
The hash encoding is trivially parallelizable on modern GPUs because each table lookup is independent. This enables Instant-NGP to train NeRF scenes in as little as 5 seconds (compared to 1 to 2 days for the original) and render in real time. On the Blender synthetic dataset, Instant-NGP achieves comparable or better quality to the original NeRF with a speedup of roughly 1000x in training time.
"NeRF in the Wild" (NeRF-W), by Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth, extends NeRF to work with uncontrolled photo collections, such as tourist photographs of landmarks scraped from the internet. The original NeRF assumes controlled capture conditions with consistent lighting and no transient objects (such as people walking through the scene).
NeRF-W addresses this by introducing per-image appearance embeddings that capture variations in lighting, exposure, and color processing across different photographs. It also adds a separate transient prediction head that models objects that appear in some images but not others (pedestrians, vehicles, etc.). These extensions allowed NeRF-W to reconstruct landmarks like the Trevi Fountain and the Brandenburg Gate from Flickr photo collections, producing temporally consistent, photorealistic novel views.
D-NeRF, by Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer, extends NeRF to dynamic scenes containing moving objects. The method uses two networks: a canonical network that represents the scene in a fixed reference configuration, and a deformation network that predicts how each 3D point moves from the canonical space to its position at a given time step t.
The deformation network takes a 3D position and a time value as input and outputs a displacement vector. This approach effectively factors the dynamic scene into a static canonical representation plus a learned deformation field, allowing the method to reconstruct non-rigidly deforming objects from a single monocular camera.
TensoRF, by Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su, models the radiance field as a 4D tensor (a 3D voxel grid with per-voxel multi-channel features) and applies tensor factorization to achieve compact and efficient representations. The paper introduces two decomposition approaches: classical CP decomposition, which factors the tensor into rank-one components; and a novel Vector-Matrix (VM) decomposition, which provides better expressiveness.
TensoRF with CP decomposition achieves better rendering quality than the original NeRF with model sizes under 4 MB. The VM decomposition variant further improves quality while maintaining fast reconstruction times (under 30 minutes for a single scene).
Block-NeRF, by Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, and Henrik Kretzschmar, scales NeRF to city-level scenes. The method decomposes a large environment into individually trained NeRF blocks that can be independently updated and seamlessly combined at render time using learned appearance alignment.
The authors demonstrated the approach on 2.8 million images of San Francisco, constructing the largest neural scene representation at the time of publication. Block-NeRF was developed in collaboration with Waymo and showcased the potential of NeRF for autonomous driving simulation.
Zip-NeRF, by Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman, combines the anti-aliasing capabilities of Mip-NeRF 360 with the speed of grid-based methods like Instant-NGP. The key challenge is that Mip-NeRF 360 reasons about conical frustums along rays, while grid-based methods operate on point samples, making the two approaches fundamentally incompatible.
Zip-NeRF addresses this through techniques from signal processing and rendering, producing a method that achieves error rates 8% to 77% lower than either Mip-NeRF 360 or Instant-NGP alone, while training 24x faster than Mip-NeRF 360. The paper received a Best Paper Finalist award at ICCV 2023.
Nerfacto is the default real-data method included in the Nerfstudio framework. Rather than a single published paper, it is a practical combination of the best-performing techniques from multiple NeRF variants: hash encoding from Instant-NGP, scene contraction from Mip-NeRF 360, appearance embeddings from NeRF-W, proposal-based sampling, and camera pose refinement. Nerfacto achieves quality comparable to Mip-NeRF 360 with approximately an order of magnitude speedup, making it one of the most practical methods for real-world NeRF captures.
| Method | Authors | Venue | Year | Key Innovation | Training Speed |
|---|---|---|---|---|---|
| NeRF | Mildenhall et al. | ECCV | 2020 | Original continuous 5D radiance field with volume rendering | ~1-2 days |
| NeRF in the Wild | Martin-Brualla et al. | CVPR | 2021 | Appearance embeddings and transient modeling for uncontrolled photos | ~1-2 days |
| Mip-NeRF | Barron et al. | ICCV | 2021 | Conical frustums and integrated positional encoding for anti-aliasing | ~1 day |
| D-NeRF | Pumarola et al. | CVPR | 2021 | Canonical space plus deformation network for dynamic scenes | ~1-2 days |
| Mip-NeRF 360 | Barron et al. | CVPR | 2022 | Non-linear scene contraction and distillation for unbounded scenes | ~1 day |
| Instant-NGP | Muller et al. | SIGGRAPH | 2022 | Multiresolution hash encoding for 1000x training speedup | ~5-15 seconds |
| TensoRF | Chen et al. | ECCV | 2022 | Tensor factorization (CP and VM decomposition) of radiance fields | ~30 minutes |
| Block-NeRF | Tancik et al. | CVPR | 2022 | City-scale scene decomposition into independently trained blocks | Per-block |
| Zip-NeRF | Barron et al. | ICCV | 2023 | Combines Mip-NeRF 360 anti-aliasing with grid-based speed | ~1 hour |
| Nerfacto | Tancik et al. | SIGGRAPH | 2023 | Best-of-breed combination for practical real-world captures | ~15-30 minutes |
3D Gaussian Splatting (3DGS), introduced by Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuhler, and George Drettakis at SIGGRAPH 2023, represents an alternative paradigm for novel view synthesis that has gained rapid adoption. While NeRF uses an implicit, continuous volumetric representation learned by a neural network, 3DGS uses an explicit collection of 3D Gaussian primitives that are directly optimized and rasterized.
| Feature | NeRF (and variants) | 3D Gaussian Splatting |
|---|---|---|
| Scene Representation | Implicit (MLP or grid + MLP) | Explicit (3D Gaussian primitives) |
| Rendering Approach | Volume rendering via ray marching | Rasterization of Gaussian splats |
| Rendering Speed | Seconds per frame (original); real-time with Instant-NGP | Real-time (100+ FPS at 1080p) |
| Training Speed | Hours to days (original); seconds with Instant-NGP | Minutes (typically 15-30 min) |
| Memory Usage | Compact (MLP weights, or hash tables) | Higher (millions of Gaussians with attributes) |
| Visual Quality | High (especially Zip-NeRF, Mip-NeRF 360) | Comparable or better on many benchmarks |
| Dynamic Scenes | Requires specialized extensions (D-NeRF) | More naturally suited to dynamic updates |
| Editability | Difficult (implicit representation) | Easier (explicit primitives can be manipulated) |
| Mesh Export | Requires marching cubes or similar post-processing | Can extract surfaces from Gaussian distributions |
| Maturity | Large body of research (2020 onward) | Rapidly growing (2023 onward) |
3DGS achieves real-time rendering at over 100 frames per second on consumer hardware by projecting 3D Gaussians onto the image plane using efficient GPU rasterization, avoiding the expensive per-pixel ray marching required by NeRF. However, the explicit representation of 3DGS typically requires more memory than NeRF's compact neural network weights. The two approaches are complementary in many respects, and hybrid methods that combine ideas from both paradigms continue to emerge.
Several standard datasets are used to evaluate NeRF methods. These benchmarks span synthetic and real-world scenes with varying complexity.
The NeRF-Synthetic dataset, introduced alongside the original NeRF paper, consists of 8 synthetic scenes rendered using Blender's Cycles path tracer: Chair, Drums, Ficus, Hotdog, Lego, Materials, Mic, and Ship. Each scene provides 100 training views, 100 validation views, and 200 test views at 800 x 800 resolution, with cameras placed on a hemisphere around each object on a white background. This dataset remains the most widely used benchmark for evaluating NeRF methods on synthetic data.
The LLFF dataset, introduced by Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar, contains 8 forward-facing real-world scenes captured with a cellphone. The camera motion is roughly planar (forward-facing), with relatively small variation in viewpoint. This dataset tests a method's ability to interpolate between nearby views of real scenes.
The Tanks and Temples benchmark, introduced by Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun, provides large-scale real-world scenes captured in realistic conditions with ground-truth 3D geometry from an industrial laser scanner. The dataset includes indoor and outdoor scenes of varying complexity and is divided into training, intermediate, and advanced subsets. It is commonly used to evaluate NeRF methods on challenging real-world geometry.
Introduced alongside the Mip-NeRF 360 paper, this dataset contains 9 unbounded real-world scenes (5 outdoor and 4 indoor) captured with inward-facing cameras that orbit around a central point of interest. The scenes include complex geometry, reflective surfaces, and fine detail, making this dataset one of the most challenging benchmarks for novel view synthesis. It has become a standard evaluation dataset for state-of-the-art methods.
Nerfstudio is a modular, open-source PyTorch-based framework for NeRF development, created by Matthew Tancik and collaborators at UC Berkeley's BAIR lab. The framework was presented at SIGGRAPH 2023 and published in ACM Transactions on Graphics.
Nerfstudio provides:
Nerfstudio has become the standard open-source platform for NeRF research and practical applications, with an active community and frequent updates incorporating new methods.
NeRF and its variants have found applications across many domains.
The original and most direct application of NeRF is generating photorealistic images of a scene from viewpoints not present in the training data. This has applications in photography, filmmaking, and virtual tours. Products from companies like Luma AI allow users to capture NeRF scenes with a smartphone and share interactive 3D representations on the web.
While NeRF was originally designed for view synthesis rather than explicit 3D geometry extraction, the learned density field implicitly encodes scene geometry. Surfaces can be extracted using marching cubes or similar algorithms applied to the density field. Methods like NeuS and VolSDF specifically optimize NeRF-like representations to produce high-quality surface reconstructions with signed distance functions.
Virtual reality and augmented reality applications benefit from NeRF's ability to generate photorealistic views of real-world environments from arbitrary camera positions. Real-time NeRF variants like Instant-NGP and 3DGS have made interactive VR/AR experiences based on neural scene representations feasible. Users can walk through reconstructed environments with six degrees of freedom.
NeRF has significant applications in autonomous driving simulation and testing. Block-NeRF demonstrated city-scale reconstruction for driving simulation using Waymo data. Several specialized methods have been developed for this domain, including DriveEnv-NeRF for creating simulation environments under varying lighting conditions, and Lightning NeRF for efficient outdoor scene reconstruction. These tools allow autonomous driving systems to be tested in photorealistic simulated environments derived from real-world data.
In robotics, NeRF provides dense scene representations that can be used for navigation, manipulation planning, and collision avoidance. Robots can build NeRF models of their environment from onboard camera observations and use the resulting 3D representations for path planning and object interaction. The differentiable nature of NeRF also allows integration into end-to-end learned robotic systems.
NeRF-based techniques have been adapted for medical imaging applications, including synthesizing novel views from limited CT or MRI scan data. This can potentially reduce radiation exposure for patients by requiring fewer scans while still providing clinicians with the views they need for diagnosis.
NeRF enables photorealistic capture and relighting of real-world objects and scenes for use in games, films, and advertising. Methods like Ref-NeRF improve the handling of reflective surfaces, making captured objects more amenable to relighting in new environments. The ability to capture real objects and place them in virtual scenes bridges the gap between real and synthetic content.
NeRF has been applied to create detailed 3D models of urban environments, historical buildings, and cultural heritage sites. The photorealistic quality of NeRF reconstructions, combined with the relatively lightweight capture requirements (just photographs from multiple angles), makes it an attractive tool for digital preservation of physical spaces.
Successful NeRF reconstruction depends on several practical factors:
| Year | Development |
|---|---|
| 2020 | NeRF (Mildenhall et al.) introduced at ECCV 2020 |
| 2020 | Fourier Features paper (Tancik et al.) provides theoretical grounding for positional encoding |
| 2021 | Mip-NeRF introduces anti-aliased rendering with conical frustums (ICCV 2021) |
| 2021 | NeRF in the Wild enables reconstruction from internet photo collections (CVPR 2021) |
| 2021 | D-NeRF extends to dynamic scenes (CVPR 2021) |
| 2021 | PlenOctrees and KiloNeRF achieve real-time NeRF rendering through caching and distillation |
| 2022 | Instant-NGP achieves 1000x training speedup with hash encoding (SIGGRAPH 2022) |
| 2022 | Mip-NeRF 360 handles unbounded 360-degree scenes (CVPR 2022) |
| 2022 | Block-NeRF demonstrates city-scale reconstruction (CVPR 2022) |
| 2022 | TensoRF introduces tensor factorization for efficient radiance fields (ECCV 2022) |
| 2023 | 3D Gaussian Splatting emerges as a competing paradigm (SIGGRAPH 2023) |
| 2023 | Zip-NeRF combines anti-aliasing and grid-based speed (ICCV 2023) |
| 2023 | Nerfstudio provides a standardized open-source framework (SIGGRAPH 2023) |
NeRF belongs to the broader family of neural scene representations, sometimes called neural fields or coordinate-based neural networks. Related approaches include: