# NeRF

> Source: https://aiwiki.ai/wiki/nerf
> Updated: 2026-06-21
> Categories: Computer Vision, Deep Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Neural Radiance Fields (NeRF)** is a method for synthesizing photorealistic novel views of a 3D scene by encoding the scene as a continuous 5D function (3D position plus 2D viewing direction) inside a single [multilayer perceptron](/wiki/perceptron) and rendering it with classic volume rendering. Introduced by Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng in the 2020 paper "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," the technique won a Best Paper Honorable Mention at the [European Conference on Computer Vision](/wiki/eccv) (ECCV) 2020 and became one of the most cited papers in modern [computer vision](/wiki/computer_vision), with the original arXiv preprint accumulating more than 9,000 citations.[1] The paper's opening claim is direct: it presents "a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views."[1]

NeRF takes a set of 2D images of a scene captured from known camera positions and learns a compact, continuous representation of the scene's geometry and appearance. From this representation, photorealistic images can be rendered from arbitrary new viewpoints that were never directly observed. The approach represented a significant leap in quality over prior methods for view synthesis, producing images with fine detail, realistic lighting, and accurate reflections, and it inspired a large family of follow-up methods for [3D reconstruction](/wiki/3d_reconstruction), [novel view synthesis](/wiki/novel_view_synthesis), and related tasks.[1]

## What is the core idea behind NeRF?

The central insight of NeRF is to represent a static 3D scene as a continuous function that maps a 5D input coordinate to a color and volume density value.[1] The five dimensions consist of a 3D spatial location (x, y, z) and a 2D viewing direction (theta, phi). The spatial coordinates determine how dense (opaque) the scene is at a given point, while the viewing direction allows the model to capture view-dependent effects such as specular highlights and reflections.

Formally, the scene is modeled as a function:

```
F: (x, y, z, theta, phi) -> (r, g, b, sigma)
```

where (r, g, b) represents the emitted color at that location when viewed from the given direction, and sigma represents the volume density (a measure of how much light is absorbed or scattered at that point). This function is approximated by a [multilayer perceptron](/wiki/perceptron) (MLP) trained using only 2D images and their corresponding camera poses.[1]

The volume density sigma is predicted as a function of the spatial location alone, which ensures multiview consistency (the geometry of the scene does not change depending on the viewing angle). The color, however, depends on both position and viewing direction, allowing the network to model non-Lambertian surface effects.[1]

## What does the NeRF MLP architecture look like?

The neural network architecture used in the original NeRF paper consists of a relatively simple [MLP](/wiki/perceptron). The network processes the 3D spatial coordinates through 8 fully connected layers, each with 256 hidden units and [ReLU](/wiki/relu) activation functions. A skip connection concatenates the input to the fifth layer, following a pattern inspired by [ResNet](/wiki/resnet)-style architectures.[1]

After these 8 layers, the network outputs the volume density sigma and a 256-dimensional feature vector. This feature vector is then concatenated with the encoded viewing direction and passed through one additional fully connected layer with 128 hidden units. That final layer outputs the view-dependent RGB color.[1]

This design enforces an important inductive bias: volume density is a function of position only, while color depends on both position and viewing direction. This separation ensures that the geometry remains consistent across different viewpoints, while allowing appearance to change based on the observer's angle (capturing effects like specularity and reflections).

## Why does NeRF use positional encoding?

One of the key technical contributions of the original NeRF paper is the use of positional encoding to map low-dimensional input coordinates into a higher-dimensional space before passing them into the MLP. Without this encoding, the network struggles to represent high-frequency variations in color and geometry, producing overly smooth results.[1]

The [positional encoding](/wiki/positional_encoding) function maps each scalar input value p to a vector using sinusoidal functions at exponentially increasing frequencies:

```
gamma(p) = (sin(2^0 * pi * p), cos(2^0 * pi * p), sin(2^1 * pi * p), cos(2^1 * pi * p), ..., sin(2^(L-1) * pi * p), cos(2^(L-1) * pi * p))
```

For the 3D spatial coordinates, L = 10 frequency bands are used, resulting in a 60-dimensional encoding (plus the 3 original coordinates). For the 2D viewing direction, L = 4 frequency bands are used, producing a 24-dimensional encoding.[1] This approach is related to the concept of Fourier features and was theoretically justified by Tancik et al. in their concurrent work "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains" ([NeurIPS](/wiki/neurips) 2020), which showed that such encodings allow MLPs to overcome the spectral bias toward low-frequency functions.[12]

## How does NeRF render an image with volume rendering?

To render an image from the learned NeRF representation, the method employs classical volume rendering. A defining property the authors highlight is that "volume rendering is naturally differentiable," which is what allows the scene function to be optimized from images alone.[1] For each pixel in the desired output image, a camera ray r(t) = o + td is cast from the camera origin o through the pixel into the scene, where d is the ray direction and t parameterizes points along the ray.[1]

The expected color C(r) of the ray is computed by integrating the colors and densities along the ray between a near bound t_n and a far bound t_f:

```
C(r) = integral from t_n to t_f of T(t) * sigma(r(t)) * c(r(t), d) dt
```

where T(t) is the accumulated transmittance along the ray from t_n to t, representing the probability that the ray travels from t_n to t without hitting any other particle:

```
T(t) = exp(-integral from t_n to t of sigma(r(s)) ds)
```

In practice, this continuous integral is approximated numerically using quadrature. The ray is partitioned into N evenly spaced bins, and a single point is sampled uniformly at random from within each bin (stratified sampling).[1] The discrete approximation becomes:

```
C_hat(r) = sum over i of T_i * (1 - exp(-sigma_i * delta_i)) * c_i
```

where delta_i is the distance between adjacent samples, and alpha_i = 1 - exp(-sigma_i * delta_i) represents the opacity of the i-th sample. This formula reduces to traditional [alpha compositing](/wiki/alpha_compositing) and, importantly, is fully differentiable with respect to the network parameters. This differentiability allows the entire rendering pipeline to be trained end-to-end using gradient descent.[1]

## Hierarchical Sampling

Rendering high-quality images by densely sampling every point along each ray is computationally expensive. Many sampled points fall in empty space or in occluded regions that do not contribute to the final rendered color. To address this inefficiency, NeRF employs a hierarchical sampling strategy with two networks: a "coarse" network and a "fine" network.[1]

The coarse network is first evaluated at a set of N_c = 64 stratified sample points along each ray. The output density values from the coarse network are used to construct a piecewise-constant probability density function (PDF) along the ray. Regions with higher predicted density receive higher probability, indicating that they are more likely to contain visible surfaces.

A second set of N_f = 128 samples is then drawn from this distribution using inverse transform sampling. These additional samples, combined with the original 64 coarse samples (for a total of 192 points per ray), are passed through the fine network to produce the final high-quality color estimate.[1] This approach concentrates computational resources on the parts of the scene that matter most for the final rendered output.

## How is a NeRF trained, and how long does it take?

NeRF is trained in a per-scene optimization framework. Given a dataset of posed 2D images (photographs with known camera intrinsics and extrinsics), the network is optimized to minimize the difference between its rendered outputs and the ground-truth pixel colors.[1]

The camera poses are typically obtained through Structure-from-Motion (SfM) preprocessing using tools such as [COLMAP](/wiki/colmap). COLMAP performs feature extraction (usually using [SIFT](/wiki/sift) features), feature matching across image pairs, and bundle adjustment to estimate camera parameters and produce a sparse 3D point cloud.

The training loss is the total squared error (L2 loss) between the rendered and true pixel colors, summed over both the coarse and fine networks:

```
L = sum over rays r of ( ||C_hat_coarse(r) - C(r)||^2 + ||C_hat_fine(r) - C(r)||^2 )
```

Training is performed using the [Adam optimizer](/wiki/adam_optimizer). In the original paper, training a single scene required approximately 100,000 to 300,000 iterations, taking roughly 1 to 2 days on a single NVIDIA V100 [GPU](/wiki/gpu_computing). Rendering a single 800 x 800 image from the trained model required about 30 seconds, as each pixel requires multiple forward passes through the MLP.[1] This combination of multi-day training and seconds-per-frame rendering is precisely the bottleneck that later methods such as Instant-NGP and 3D Gaussian Splatting were designed to remove.

## How is NeRF quality measured, and what scores did the original achieve?

NeRF's quality is typically evaluated using three standard image quality metrics:

- **PSNR (Peak Signal-to-Noise Ratio):** Measures pixel-level reconstruction accuracy in decibels. Higher is better.
- **SSIM (Structural Similarity Index):** Measures perceived structural similarity between images. Ranges from 0 to 1, with 1 being perfect.
- **LPIPS (Learned Perceptual Image Patch Similarity):** A learned perceptual metric based on deep features. Lower is better.

On the NeRF-Synthetic (Blender) dataset, the original NeRF achieved an average PSNR of approximately 31 dB across the 8 test scenes, significantly outperforming prior methods such as Neural Volumes, Scene Representation Networks (SRN), and Local Light Field Fusion (LLFF) at the time of publication.[1]

## What are the limitations of the original NeRF?

Despite its impressive results, the original NeRF method has several notable limitations:

- **Slow training:** Optimizing a single scene takes 1 to 2 days on a high-end GPU.
- **Slow rendering:** Generating images requires querying the MLP hundreds of times per pixel, resulting in rendering times of tens of seconds per frame.
- **Static scenes only:** The original formulation assumes the scene is completely static. Any moving objects or changing lighting conditions between training images cause artifacts.
- **Requires accurate camera poses:** NeRF depends on precise camera pose estimates, typically from COLMAP. Inaccurate poses lead to blurry or inconsistent reconstructions.
- **Bounded scenes:** The original NeRF struggles with unbounded or 360-degree scenes, as the positional encoding and sampling strategy are designed for forward-facing or object-centric captures.
- **Aliasing:** NeRF samples individual points along rays, which can produce aliasing artifacts when images are captured at varying distances from the scene.
- **Per-scene optimization:** Each scene requires training a separate network from scratch, with no generalization to new scenes.

These limitations motivated a wave of follow-up research that collectively addressed nearly every shortcoming, resulting in methods that are orders of magnitude faster, handle dynamic and unbounded scenes, and in some cases generalize across scenes.

## What are the most important NeRF variants and extensions?

The NeRF paper spawned an extraordinarily active research area. Below are some of the most important variants.

### Mip-NeRF (ICCV 2021)

Mip-NeRF, proposed by Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan, addresses the aliasing problem in the original NeRF. Instead of casting infinitesimal rays through each pixel, Mip-NeRF casts 3D conical frustums. For each sample along the cone, a multivariate Gaussian is fitted to the conical frustum, and an Integrated Positional Encoding (IPE) featurizes the entire region rather than a single point.[2]

This approach allows the network to reason about the scale at which the scene is being observed, naturally handling multiscale content. Mip-NeRF reduced error rates by 17% on the standard Blender dataset and by 60% on a challenging multiscale variant, while being 7% faster than the original NeRF and using half the model size (by eliminating the need for separate coarse and fine networks).[2]

### Mip-NeRF 360 (CVPR 2022)

Mip-NeRF 360, by Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman, extends Mip-NeRF to handle unbounded 360-degree scenes. It introduces three key innovations: a non-linear scene parameterization that maps unbounded coordinates into a bounded domain; an online distillation procedure that replaces the coarse-to-fine sampling with a more efficient proposal network; and a novel distortion-based regularizer that encourages the model to produce compact, well-defined geometry.[3]

Mip-NeRF 360 reduced mean-squared error by 57% compared to Mip-NeRF on challenging outdoor scenes and became the standard benchmark method for unbounded scene reconstruction.[3] The accompanying mip-NeRF 360 dataset is now one of the most widely used benchmarks for evaluating [novel view synthesis](/wiki/novel_view_synthesis) methods.

### Instant-NGP (SIGGRAPH 2022)

Instant Neural Graphics Primitives, proposed by Thomas Muller, Alex Evans, Christoph Schied, and Alexander Keller at [NVIDIA](/wiki/nvidia), represents the most significant speedup in NeRF training and rendering. Published in ACM Transactions on Graphics (SIGGRAPH 2022), the paper replaces the slow positional encoding and large MLP of the original NeRF with a multiresolution hash encoding backed by a much smaller neural network.[4]

The multiresolution hash encoding works by arranging trainable feature vectors into L resolution levels, each containing a hash table with up to T entries. For a given 3D point, the method finds the surrounding voxels at each resolution level, hashes their vertices to look up trainable feature vectors, and interpolates between them. These features from all resolution levels are concatenated and fed into a compact MLP (typically just 2 layers with 64 hidden units).[4]

The hash encoding is trivially parallelizable on modern GPUs because each table lookup is independent. The authors report that this "combined speedup of several orders of magnitude" enables "training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds."[4] In practice this lets Instant-NGP train NeRF scenes in as little as a few seconds (compared to 1 to 2 days for the original) and render in real time, while achieving comparable or better quality to the original NeRF on the Blender synthetic dataset, a roughly 1000x reduction in training time.[4]

### NeRF in the Wild (CVPR 2021)

"NeRF in the Wild" (NeRF-W), by Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth, extends NeRF to work with uncontrolled photo collections, such as tourist photographs of landmarks scraped from the internet. The original NeRF assumes controlled capture conditions with consistent lighting and no transient objects (such as people walking through the scene).[5]

NeRF-W addresses this by introducing per-image appearance embeddings that capture variations in lighting, exposure, and color processing across different photographs. It also adds a separate transient prediction head that models objects that appear in some images but not others (pedestrians, vehicles, etc.). These extensions allowed NeRF-W to reconstruct landmarks like the Trevi Fountain and the Brandenburg Gate from Flickr photo collections, producing temporally consistent, photorealistic novel views.[5]

### D-NeRF (CVPR 2021)

D-NeRF, by Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer, extends NeRF to dynamic scenes containing moving objects. The method uses two networks: a canonical network that represents the scene in a fixed reference configuration, and a deformation network that predicts how each 3D point moves from the canonical space to its position at a given time step t.[6]

The deformation network takes a 3D position and a time value as input and outputs a displacement vector. This approach effectively factors the dynamic scene into a static canonical representation plus a learned deformation field, allowing the method to reconstruct non-rigidly deforming objects from a single monocular camera.[6]

### TensoRF (ECCV 2022)

TensoRF, by Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su, models the radiance field as a 4D tensor (a 3D voxel grid with per-voxel multi-channel features) and applies tensor factorization to achieve compact and efficient representations. The paper introduces two decomposition approaches: classical CP decomposition, which factors the tensor into rank-one components; and a novel Vector-Matrix (VM) decomposition, which provides better expressiveness.[7]

TensoRF with CP decomposition achieves better rendering quality than the original NeRF with model sizes under 4 MB. The VM decomposition variant further improves quality while maintaining fast reconstruction times (under 30 minutes for a single scene).[7]

### Block-NeRF (CVPR 2022)

[Block-NeRF](/wiki/block_nerf), by Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, and Henrik Kretzschmar, scales NeRF to city-level scenes. The method decomposes a large environment into individually trained NeRF blocks that can be independently updated and seamlessly combined at render time using learned appearance alignment.[8]

The authors demonstrated the approach on 2.8 million images of San Francisco, constructing the largest neural scene representation at the time of publication. Block-NeRF was developed in collaboration with [Waymo](/wiki/waymo) and showcased the potential of NeRF for autonomous driving simulation.[8]

### Zip-NeRF (ICCV 2023)

Zip-NeRF, by Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman, combines the anti-aliasing capabilities of Mip-NeRF 360 with the speed of grid-based methods like Instant-NGP. The key challenge is that Mip-NeRF 360 reasons about conical frustums along rays, while grid-based methods operate on point samples, making the two approaches fundamentally incompatible.[9]

Zip-NeRF addresses this through techniques from signal processing and rendering, producing a method that achieves error rates 8% to 77% lower than either Mip-NeRF 360 or Instant-NGP alone, while training 24x faster than Mip-NeRF 360. The paper received a Best Paper Finalist award at ICCV 2023.[9]

### Nerfacto

Nerfacto is the default real-data method included in the Nerfstudio framework. Rather than a single published paper, it is a practical combination of the best-performing techniques from multiple NeRF variants: hash encoding from Instant-NGP, scene contraction from Mip-NeRF 360, appearance embeddings from NeRF-W, proposal-based sampling, and camera pose refinement. Nerfacto achieves quality comparable to Mip-NeRF 360 with approximately an order of magnitude speedup, making it one of the most practical methods for real-world NeRF captures.[11]

## Table of Major NeRF Variants

| Method | Authors | Venue | Year | Key Innovation | Training Speed |
|---|---|---|---|---|---|
| NeRF | Mildenhall et al. | ECCV | 2020 | Original continuous 5D radiance field with volume rendering | ~1-2 days |
| NeRF in the Wild | Martin-Brualla et al. | CVPR | 2021 | Appearance embeddings and transient modeling for uncontrolled photos | ~1-2 days |
| Mip-NeRF | Barron et al. | ICCV | 2021 | Conical frustums and integrated positional encoding for anti-aliasing | ~1 day |
| D-NeRF | Pumarola et al. | CVPR | 2021 | Canonical space plus deformation network for dynamic scenes | ~1-2 days |
| Mip-NeRF 360 | Barron et al. | CVPR | 2022 | Non-linear scene contraction and distillation for unbounded scenes | ~1 day |
| Instant-NGP | Muller et al. | SIGGRAPH | 2022 | Multiresolution hash encoding for 1000x training speedup | ~5-15 seconds |
| TensoRF | Chen et al. | ECCV | 2022 | Tensor factorization (CP and VM decomposition) of radiance fields | ~30 minutes |
| Block-NeRF | Tancik et al. | CVPR | 2022 | City-scale scene decomposition into independently trained blocks | Per-block |
| Zip-NeRF | Barron et al. | ICCV | 2023 | Combines Mip-NeRF 360 anti-aliasing with grid-based speed | ~1 hour |
| Nerfacto | Tancik et al. | SIGGRAPH | 2023 | Best-of-breed combination for practical real-world captures | ~15-30 minutes |

## How does NeRF differ from 3D Gaussian Splatting?

[3D Gaussian Splatting](/wiki/gaussian_splatting) (3DGS), introduced by Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuhler, and George Drettakis at SIGGRAPH 2023, represents an alternative paradigm for novel view synthesis that has gained rapid adoption. While NeRF uses an implicit, continuous volumetric representation learned by a neural network, 3DGS uses an explicit collection of 3D Gaussian primitives that are directly optimized and rasterized.[10] The 3DGS paper states that its goal is "high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution" while matching the visual quality of the best prior NeRF methods such as Mip-NeRF 360.[10]

| Feature | NeRF (and variants) | 3D Gaussian Splatting |
|---|---|---|
| Scene Representation | Implicit (MLP or grid + MLP) | Explicit (3D Gaussian primitives) |
| Rendering Approach | Volume rendering via ray marching | Rasterization of Gaussian splats |
| Rendering Speed | Seconds per frame (original); real-time with Instant-NGP | Real-time (>= 30 fps at 1080p, 100+ FPS on benchmark scenes) |
| Training Speed | Hours to days (original); seconds with Instant-NGP | Minutes (typically 15-30 min) |
| Memory Usage | Compact (MLP weights, or hash tables) | Higher (millions of Gaussians with attributes) |
| Visual Quality | High (especially Zip-NeRF, Mip-NeRF 360) | Comparable or better on many benchmarks |
| Dynamic Scenes | Requires specialized extensions (D-NeRF) | More naturally suited to dynamic updates |
| Editability | Difficult (implicit representation) | Easier (explicit primitives can be manipulated) |
| Mesh Export | Requires marching cubes or similar post-processing | Can extract surfaces from Gaussian distributions |
| Maturity | Large body of research (2020 onward) | Rapidly growing (2023 onward) |

3DGS achieves real-time rendering by projecting 3D Gaussians onto the image plane using efficient tile-based GPU rasterization, avoiding the expensive per-pixel ray marching required by NeRF.[10] The original paper guarantees at least 30 fps at 1080p and reports well over 100 fps on standard benchmark scenes, a regime that the original NeRF, at roughly 30 seconds per frame, could not approach.[10] However, the explicit representation of 3DGS typically requires more memory than NeRF's compact neural network weights. The two approaches are complementary in many respects, and hybrid methods that combine ideas from both paradigms continue to emerge.

## Datasets and Benchmarks

Several standard datasets are used to evaluate NeRF methods. These benchmarks span synthetic and real-world scenes with varying complexity.

### NeRF-Synthetic (Blender)

The NeRF-Synthetic dataset, introduced alongside the original NeRF paper, consists of 8 synthetic scenes rendered using Blender's Cycles path tracer: Chair, Drums, Ficus, Hotdog, Lego, Materials, Mic, and Ship. Each scene provides 100 training views, 100 validation views, and 200 test views at 800 x 800 resolution, with cameras placed on a hemisphere around each object on a white background.[1] This dataset remains the most widely used benchmark for evaluating NeRF methods on synthetic data.

### LLFF (Local Light Field Fusion)

The LLFF dataset, introduced by Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar, contains 8 forward-facing real-world scenes captured with a cellphone.[14] The camera motion is roughly planar (forward-facing), with relatively small variation in viewpoint. This dataset tests a method's ability to interpolate between nearby views of real scenes.

### Tanks and Temples

The Tanks and Temples benchmark, introduced by Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun, provides large-scale real-world scenes captured in realistic conditions with ground-truth 3D geometry from an industrial laser scanner.[13] The dataset includes indoor and outdoor scenes of varying complexity and is divided into training, intermediate, and advanced subsets. It is commonly used to evaluate NeRF methods on challenging real-world geometry.

### Mip-NeRF 360 Dataset

Introduced alongside the Mip-NeRF 360 paper, this dataset contains 9 unbounded real-world scenes (5 outdoor and 4 indoor) captured with inward-facing cameras that orbit around a central point of interest.[3] The scenes include complex geometry, reflective surfaces, and fine detail, making this dataset one of the most challenging benchmarks for novel view synthesis. It has become a standard evaluation dataset for state-of-the-art methods.

## Nerfstudio Framework

[Nerfstudio](/wiki/nerfstudio) is a modular, open-source [PyTorch](/wiki/pytorch)-based framework for NeRF development, created by Matthew Tancik and collaborators at UC Berkeley's BAIR lab. The framework was presented at [SIGGRAPH](/wiki/siggraph) 2023 and published in ACM Transactions on Graphics.[11]

Nerfstudio provides:

- A **modular API** that separates data loading, model architecture, sampling strategy, and rendering into interchangeable components.
- **Real-time visualization** through an integrated web-based viewer that displays training progress and allows interactive camera control.
- **Data processing pipelines** that handle video or image input, run COLMAP for pose estimation, and prepare data for training.
- **Multiple built-in methods** including Nerfacto, Instant-NGP, Mip-NeRF, and support for 3D [Gaussian Splatting](/wiki/gaussian_splatting) via the Splatfacto model.
- **Export functionality** for producing videos, point clouds, and mesh representations from trained models.
- **Logging and profiling** integration with TensorBoard and [Weights & Biases](/wiki/wandb).
- **Benchmarking scripts** for standardized evaluation on the Blender and mip-NeRF 360 datasets.

Nerfstudio has become the standard open-source platform for NeRF research and practical applications, with an active community and frequent updates incorporating new methods.[11]

## What is NeRF used for?

NeRF and its variants have found applications across many domains.

### Novel View Synthesis

The original and most direct application of NeRF is generating photorealistic images of a scene from viewpoints not present in the training data. This has applications in photography, filmmaking, and virtual tours. Products from companies like Luma AI allow users to capture NeRF scenes with a smartphone and share interactive 3D representations on the web.

### 3D Reconstruction

While NeRF was originally designed for view synthesis rather than explicit 3D geometry extraction, the learned density field implicitly encodes scene geometry. Surfaces can be extracted using marching cubes or similar algorithms applied to the density field. Methods like NeuS and VolSDF specifically optimize NeRF-like representations to produce high-quality surface reconstructions with signed distance functions.

### Virtual Reality and Augmented Reality

[Virtual reality](/wiki/virtual_reality) and [augmented reality](/wiki/augmented_reality) applications benefit from NeRF's ability to generate photorealistic views of real-world environments from arbitrary camera positions. Real-time NeRF variants like Instant-NGP and 3DGS have made interactive VR/AR experiences based on neural scene representations feasible. Users can walk through reconstructed environments with six degrees of freedom.

### Autonomous Driving

NeRF has significant applications in [autonomous driving](/wiki/autonomous_driving) simulation and testing. Block-NeRF demonstrated city-scale reconstruction for driving simulation using Waymo data.[8] Several specialized methods have been developed for this domain, including DriveEnv-NeRF for creating simulation environments under varying lighting conditions, and Lightning NeRF for efficient outdoor scene reconstruction. These tools allow autonomous driving systems to be tested in photorealistic simulated environments derived from real-world data.

### Robotics

In [robotics](/wiki/robotics), NeRF provides dense scene representations that can be used for navigation, manipulation planning, and collision avoidance. Robots can build NeRF models of their environment from onboard camera observations and use the resulting 3D representations for path planning and object interaction. The differentiable nature of NeRF also allows integration into end-to-end learned robotic systems.

### Medical Imaging

NeRF-based techniques have been adapted for medical imaging applications, including synthesizing novel views from limited CT or MRI scan data. This can potentially reduce radiation exposure for patients by requiring fewer scans while still providing clinicians with the views they need for diagnosis.

### Digital Content Creation

NeRF enables photorealistic capture and relighting of real-world objects and scenes for use in games, films, and advertising. Methods like Ref-NeRF improve the handling of reflective surfaces, making captured objects more amenable to relighting in new environments. The ability to capture real objects and place them in virtual scenes bridges the gap between real and synthetic content.

### Urban Mapping and Cultural Heritage

NeRF has been applied to create detailed 3D models of urban environments, historical buildings, and cultural heritage sites. The photorealistic quality of NeRF reconstructions, combined with the relatively lightweight capture requirements (just photographs from multiple angles), makes it an attractive tool for digital preservation of physical spaces.

## Training and Capture Considerations

Successful NeRF reconstruction depends on several practical factors:

- **Image coverage:** The scene should be photographed from many overlapping viewpoints. Insufficient angular coverage leads to artifacts in unobserved regions.
- **Camera pose accuracy:** Precise camera poses are needed. COLMAP is the standard tool for estimating poses from images, but it can fail on textureless surfaces, repetitive patterns, or scenes with few distinctive features.
- **Consistent lighting:** The original NeRF assumes static illumination. Variable lighting between captures can cause inconsistencies, although methods like NeRF-W address this.
- **Static scene:** Moving objects during capture cause ghosting artifacts unless dynamic-aware methods are used.
- **Image quality:** Motion blur, out-of-focus regions, and compression artifacts in the training images directly affect reconstruction quality.

## When was NeRF released, and how has the field evolved?

| Year | Development |
|---|---|
| 2020 | NeRF (Mildenhall et al.) introduced at ECCV 2020 |
| 2020 | Fourier Features paper (Tancik et al.) provides theoretical grounding for positional encoding |
| 2021 | Mip-NeRF introduces anti-aliased rendering with conical frustums (ICCV 2021) |
| 2021 | NeRF in the Wild enables reconstruction from internet photo collections (CVPR 2021) |
| 2021 | D-NeRF extends to dynamic scenes (CVPR 2021) |
| 2021 | PlenOctrees and KiloNeRF achieve real-time NeRF rendering through caching and distillation |
| 2022 | Instant-NGP achieves 1000x training speedup with hash encoding (SIGGRAPH 2022) |
| 2022 | Mip-NeRF 360 handles unbounded 360-degree scenes (CVPR 2022) |
| 2022 | Block-NeRF demonstrates city-scale reconstruction (CVPR 2022) |
| 2022 | TensoRF introduces tensor factorization for efficient radiance fields (ECCV 2022) |
| 2023 | 3D Gaussian Splatting emerges as a competing paradigm (SIGGRAPH 2023) |
| 2023 | Zip-NeRF combines anti-aliasing and grid-based speed (ICCV 2023) |
| 2023 | Nerfstudio provides a standardized open-source framework (SIGGRAPH 2023) |

## Relationship to Other Neural Scene Representations

NeRF belongs to the broader family of neural scene representations, sometimes called neural fields or coordinate-based neural networks. Related approaches include:

- **Neural Signed Distance Functions (Neural SDFs):** Methods like DeepSDF and NeuS represent geometry as the zero-level set of a learned signed distance function, which produces cleaner surface reconstructions than density-based methods.
- **Light Field Networks:** These directly predict the color along a ray without explicit 3D volume rendering, trading off 3D consistency for rendering speed.
- **[Generative Adversarial Networks](/wiki/generative_adversarial_network) for 3D:** Methods like EG3D and GRAF use NeRF-like representations within GAN frameworks to generate novel 3D objects from learned distributions.
- **[Diffusion Models](/wiki/diffusion_models) for 3D:** Recent methods use 2D diffusion model priors (such as [Stable Diffusion](/wiki/stable_diffusion)) to guide NeRF optimization, enabling text-to-3D generation (DreamFusion, Magic3D) and single-image 3D reconstruction.

## See Also

- [Computer Vision](/wiki/computer_vision)
- [Deep Learning](/wiki/deep_learning)
- [3D Gaussian Splatting](/wiki/gaussian_splatting)
- [Convolutional Neural Network](/wiki/convolutional_neural_network)
- [Diffusion Models](/wiki/diffusion_models)
- [Volume Rendering](/wiki/volume_rendering)
- [Positional Encoding](/wiki/positional_encoding)

## References

1. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., & Ng, R. (2020). "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." *Proceedings of the European Conference on Computer Vision (ECCV)*. [arXiv:2003.08934](https://arxiv.org/abs/2003.08934)

2. Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., & Srinivasan, P.P. (2021). "Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*. [arXiv:2103.13415](https://arxiv.org/abs/2103.13415)

3. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., & Hedman, P. (2022). "Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. [arXiv:2111.12077](https://arxiv.org/abs/2111.12077)

4. Muller, T., Evans, A., Schied, C., & Keller, A. (2022). "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding." *ACM Transactions on Graphics (SIGGRAPH)*, 41(4). [arXiv:2201.05989](https://arxiv.org/abs/2201.05989)

5. Martin-Brualla, R., Radwan, N., Sajjadi, M.S.M., Barron, J.T., Dosovitskiy, A., & Duckworth, D. (2021). "NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. [arXiv:2008.02268](https://arxiv.org/abs/2008.02268)

6. Pumarola, A., Corona, E., Pons-Moll, G., & Moreno-Noguer, F. (2021). "D-NeRF: Neural Radiance Fields for Dynamic Scenes." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. [arXiv:2011.13961](https://arxiv.org/abs/2011.13961)

7. Chen, A., Xu, Z., Geiger, A., Yu, J., & Su, H. (2022). "TensoRF: Tensorial Radiance Fields." *Proceedings of the European Conference on Computer Vision (ECCV)*. [Project page](https://apchenstu.github.io/TensoRF/)

8. Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., & Kretzschmer, H. (2022). "Block-NeRF: Scalable Large Scene Neural View Synthesis." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. [arXiv:2202.05263](https://arxiv.org/abs/2202.05263)

9. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., & Hedman, P. (2023). "Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*. [arXiv:2304.06706](https://arxiv.org/abs/2304.06706)

10. Kerbl, B., Kopanas, G., Leimkuhler, T., & Drettakis, G. (2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering." *ACM Transactions on Graphics (SIGGRAPH)*, 42(4). [arXiv:2308.04079](https://arxiv.org/abs/2308.04079)

11. Tancik, M., Weber, E., Ng, E., Li, R., Yi, B., Kerr, J., Wang, T., Kristoffersen, A., Austin, J., Salahi, K., Ahber, A., Conde, D., & Kanazawa, A. (2023). "Nerfstudio: A Modular Framework for Neural Radiance Field Development." *ACM SIGGRAPH 2023 Conference Proceedings*. [arXiv:2302.04264](https://arxiv.org/abs/2302.04264)

12. Tancik, M., Srinivasan, P.P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J.T., & Ng, R. (2020). "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains." *Advances in Neural Information Processing Systems (NeurIPS)*. [arXiv:2006.10739](https://arxiv.org/abs/2006.10739)

13. Knapitsch, A., Park, J., Zhou, Q.-Y., & Koltun, V. (2017). "Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction." *ACM Transactions on Graphics*, 36(4).

14. Mildenhall, B., Srinivasan, P.P., Ortiz-Cayon, R., Kalantari, N.K., Ramamoorthi, R., Ng, R., & Kar, A. (2019). "Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines." *ACM Transactions on Graphics (SIGGRAPH)*.