NeRF

3D Vision Computer Vision Deep Learning Neural Networks

26 min read

Updated Apr 26, 2026

NeRF (Neural Radiance Fields)

Neural Radiance Fields (NeRF) is a method for synthesizing novel views of complex 3D scenes by representing them as continuous volumetric functions learned by a neural network. Introduced by Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng in their 2020 paper "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," the technique received the Best Paper Honorable Mention at the European Conference on Computer Vision (ECCV) 2020. NeRF has since become one of the most influential papers in computer vision, amassing thousands of citations and inspiring a large family of follow-up methods for 3D reconstruction, novel view synthesis, and related tasks.

At its core, NeRF takes a set of 2D images of a scene captured from known camera positions and learns a compact, continuous representation of the scene's geometry and appearance. From this representation, photorealistic images can be rendered from arbitrary new viewpoints that were never directly observed. The approach represented a significant leap in quality over prior methods for view synthesis, producing images with fine detail, realistic lighting, and accurate reflections.

Core Idea

The central insight of NeRF is to represent a static 3D scene as a continuous function that maps a 5D input coordinate to a color and volume density value. The five dimensions consist of a 3D spatial location (x, y, z) and a 2D viewing direction (theta, phi). The spatial coordinates determine how dense (opaque) the scene is at a given point, while the viewing direction allows the model to capture view-dependent effects such as specular highlights and reflections.

Formally, the scene is modeled as a function:

F: (x, y, z, theta, phi) -> (r, g, b, sigma)

where (r, g, b) represents the emitted color at that location when viewed from the given direction, and sigma represents the volume density (a measure of how much light is absorbed or scattered at that point). This function is approximated by a multilayer perceptron (MLP) trained using only 2D images and their corresponding camera poses.

The volume density sigma is predicted as a function of the spatial location alone, which ensures multiview consistency (the geometry of the scene does not change depending on the viewing angle). The color, however, depends on both position and viewing direction, allowing the network to model non-Lambertian surface effects.

MLP Architecture

The neural network architecture used in the original NeRF paper consists of a relatively simple MLP. The network processes the 3D spatial coordinates through 8 fully connected layers, each with 256 hidden units and ReLU activation functions. A skip connection concatenates the input to the fifth layer, following a pattern inspired by ResNet-style architectures.

After these 8 layers, the network outputs the volume density sigma and a 256-dimensional feature vector. This feature vector is then concatenated with the encoded viewing direction and passed through one additional fully connected layer with 128 hidden units. That final layer outputs the view-dependent RGB color.

This design enforces an important inductive bias: volume density is a function of position only, while color depends on both position and viewing direction. This separation ensures that the geometry remains consistent across different viewpoints, while allowing appearance to change based on the observer's angle (capturing effects like specularity and reflections).

Positional Encoding

One of the key technical contributions of the original NeRF paper is the use of positional encoding to map low-dimensional input coordinates into a higher-dimensional space before passing them into the MLP. Without this encoding, the network struggles to represent high-frequency variations in color and geometry, producing overly smooth results.

The positional encoding function maps each scalar input value p to a vector using sinusoidal functions at exponentially increasing frequencies:

gamma(p) = (sin(2^0 * pi * p), cos(2^0 * pi * p), sin(2^1 * pi * p), cos(2^1 * pi * p), ..., sin(2^(L-1) * pi * p), cos(2^(L-1) * pi * p))

For the 3D spatial coordinates, L = 10 frequency bands are used, resulting in a 60-dimensional encoding (plus the 3 original coordinates). For the 2D viewing direction, L = 4 frequency bands are used, producing a 24-dimensional encoding. This approach is related to the concept of Fourier features and was theoretically justified by Tancik et al. in their concurrent work "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains" (NeurIPS 2020), which showed that such encodings allow MLPs to overcome the spectral bias toward low-frequency functions.

Volume Rendering

To render an image from the learned NeRF representation, the method employs classical volume rendering. For each pixel in the desired output image, a camera ray r(t) = o + td is cast from the camera origin o through the pixel into the scene, where d is the ray direction and t parameterizes points along the ray.

The expected color C(r) of the ray is computed by integrating the colors and densities along the ray between a near bound t_n and a far bound t_f:

C(r) = integral from t_n to t_f of T(t) * sigma(r(t)) * c(r(t), d) dt

where T(t) is the accumulated transmittance along the ray from t_n to t, representing the probability that the ray travels from t_n to t without hitting any other particle:

T(t) = exp(-integral from t_n to t of sigma(r(s)) ds)

In practice, this continuous integral is approximated numerically using quadrature. The ray is partitioned into N evenly spaced bins, and a single point is sampled uniformly at random from within each bin (stratified sampling). The discrete approximation becomes:

C_hat(r) = sum over i of T_i * (1 - exp(-sigma_i * delta_i)) * c_i

where delta_i is the distance between adjacent samples, and alpha_i = 1 - exp(-sigma_i * delta_i) represents the opacity of the i-th sample. This formula reduces to traditional alpha compositing and, importantly, is fully differentiable with respect to the network parameters. This differentiability allows the entire rendering pipeline to be trained end-to-end using gradient descent.

Hierarchical Sampling

Rendering high-quality images by densely sampling every point along each ray is computationally expensive. Many sampled points fall in empty space or in occluded regions that do not contribute to the final rendered color. To address this inefficiency, NeRF employs a hierarchical sampling strategy with two networks: a "coarse" network and a "fine" network.

The coarse network is first evaluated at a set of N_c = 64 stratified sample points along each ray. The output density values from the coarse network are used to construct a piecewise-constant probability density function (PDF) along the ray. Regions with higher predicted density receive higher probability, indicating that they are more likely to contain visible surfaces.

A second set of N_f = 128 samples is then drawn from this distribution using inverse transform sampling. These additional samples, combined with the original 64 coarse samples (for a total of 192 points per ray), are passed through the fine network to produce the final high-quality color estimate. This approach concentrates computational resources on the parts of the scene that matter most for the final rendered output.

Training Process

NeRF is trained in a per-scene optimization framework. Given a dataset of posed 2D images (photographs with known camera intrinsics and extrinsics), the network is optimized to minimize the difference between its rendered outputs and the ground-truth pixel colors.

The camera poses are typically obtained through Structure-from-Motion (SfM) preprocessing using tools such as COLMAP. COLMAP performs feature extraction (usually using SIFT features), feature matching across image pairs, and bundle adjustment to estimate camera parameters and produce a sparse 3D point cloud.

The training loss is the total squared error (L2 loss) between the rendered and true pixel colors, summed over both the coarse and fine networks:

L = sum over rays r of ( ||C_hat_coarse(r) - C(r)||^2 + ||C_hat_fine(r) - C(r)||^2 )

Training is performed using the Adam optimizer. In the original paper, training a single scene required approximately 100,000 to 300,000 iterations, taking roughly 1 to 2 days on a single NVIDIA V100 GPU. Rendering a single 800 x 800 image from the trained model required about 30 seconds, as each pixel requires multiple forward passes through the MLP.

Evaluation Metrics and Results

NeRF's quality is typically evaluated using three standard image quality metrics:

PSNR (Peak Signal-to-Noise Ratio): Measures pixel-level reconstruction accuracy in decibels. Higher is better.
SSIM (Structural Similarity Index): Measures perceived structural similarity between images. Ranges from 0 to 1, with 1 being perfect.
LPIPS (Learned Perceptual Image Patch Similarity): A learned perceptual metric based on deep features. Lower is better.

On the NeRF-Synthetic (Blender) dataset, the original NeRF achieved an average PSNR of approximately 31 dB across the 8 test scenes, significantly outperforming prior methods such as Neural Volumes, Scene Representation Networks (SRN), and Local Light Field Fusion (LLFF) at the time of publication.

Limitations of the Original NeRF

Despite its impressive results, the original NeRF method has several notable limitations:

Slow training: Optimizing a single scene takes 1 to 2 days on a high-end GPU.
Slow rendering: Generating images requires querying the MLP hundreds of times per pixel, resulting in rendering times of tens of seconds per frame.
Static scenes only: The original formulation assumes the scene is completely static. Any moving objects or changing lighting conditions between training images cause artifacts.
Requires accurate camera poses: NeRF depends on precise camera pose estimates, typically from COLMAP. Inaccurate poses lead to blurry or inconsistent reconstructions.
Bounded scenes: The original NeRF struggles with unbounded or 360-degree scenes, as the positional encoding and sampling strategy are designed for forward-facing or object-centric captures.
Aliasing: NeRF samples individual points along rays, which can produce aliasing artifacts when images are captured at varying distances from the scene.
Per-scene optimization: Each scene requires training a separate network from scratch, with no generalization to new scenes.

These limitations motivated a wave of follow-up research that collectively addressed nearly every shortcoming, resulting in methods that are orders of magnitude faster, handle dynamic and unbounded scenes, and in some cases generalize across scenes.

Key Variants and Extensions

The NeRF paper spawned an extraordinarily active research area. Below are some of the most important variants.

Mip-NeRF (ICCV 2021)

Mip-NeRF, proposed by Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan, addresses the aliasing problem in the original NeRF. Instead of casting infinitesimal rays through each pixel, Mip-NeRF casts 3D conical frustums. For each sample along the cone, a multivariate Gaussian is fitted to the conical frustum, and an Integrated Positional Encoding (IPE) featurizes the entire region rather than a single point.

This approach allows the network to reason about the scale at which the scene is being observed, naturally handling multiscale content. Mip-NeRF reduced error rates by 17% on the standard Blender dataset and by 60% on a challenging multiscale variant, while being 7% faster than the original NeRF and using half the model size (by eliminating the need for separate coarse and fine networks).

Mip-NeRF 360 (CVPR 2022)

Mip-NeRF 360, by Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman, extends Mip-NeRF to handle unbounded 360-degree scenes. It introduces three key innovations: a non-linear scene parameterization that maps unbounded coordinates into a bounded domain; an online distillation procedure that replaces the coarse-to-fine sampling with a more efficient proposal network; and a novel distortion-based regularizer that encourages the model to produce compact, well-defined geometry.

Mip-NeRF 360 reduced mean-squared error by 57% compared to Mip-NeRF on challenging outdoor scenes and became the standard benchmark method for unbounded scene reconstruction. The accompanying mip-NeRF 360 dataset is now one of the most widely used benchmarks for evaluating novel view synthesis methods.

Instant-NGP (SIGGRAPH 2022)

Instant Neural Graphics Primitives, proposed by Thomas Muller, Alex Evans, Christoph Schied, and Alexander Keller at NVIDIA, represents the most significant speedup in NeRF training and rendering. Published in ACM Transactions on Graphics (SIGGRAPH 2022), the paper replaces the slow positional encoding and large MLP of the original NeRF with a multiresolution hash encoding backed by a much smaller neural network.

The multiresolution hash encoding works by arranging trainable feature vectors into L resolution levels, each containing a hash table with up to T entries. For a given 3D point, the method finds the surrounding voxels at each resolution level, hashes their vertices to look up trainable feature vectors, and interpolates between them. These features from all resolution levels are concatenated and fed into a compact MLP (typically just 2 layers with 64 hidden units).

The hash encoding is trivially parallelizable on modern GPUs because each table lookup is independent. This enables Instant-NGP to train NeRF scenes in as little as 5 seconds (compared to 1 to 2 days for the original) and render in real time. On the Blender synthetic dataset, Instant-NGP achieves comparable or better quality to the original NeRF with a speedup of roughly 1000x in training time.

NeRF in the Wild (CVPR 2021)

"NeRF in the Wild" (NeRF-W), by Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth, extends NeRF to work with uncontrolled photo collections, such as tourist photographs of landmarks scraped from the internet. The original NeRF assumes controlled capture conditions with consistent lighting and no transient objects (such as people walking through the scene).

NeRF-W addresses this by introducing per-image appearance embeddings that capture variations in lighting, exposure, and color processing across different photographs. It also adds a separate transient prediction head that models objects that appear in some images but not others (pedestrians, vehicles, etc.). These extensions allowed NeRF-W to reconstruct landmarks like the Trevi Fountain and the Brandenburg Gate from Flickr photo collections, producing temporally consistent, photorealistic novel views.

D-NeRF (CVPR 2021)

D-NeRF, by Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer, extends NeRF to dynamic scenes containing moving objects. The method uses two networks: a canonical network that represents the scene in a fixed reference configuration, and a deformation network that predicts how each 3D point moves from the canonical space to its position at a given time step t.

The deformation network takes a 3D position and a time value as input and outputs a displacement vector. This approach effectively factors the dynamic scene into a static canonical representation plus a learned deformation field, allowing the method to reconstruct non-rigidly deforming objects from a single monocular camera.

TensoRF (ECCV 2022)

TensoRF, by Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su, models the radiance field as a 4D tensor (a 3D voxel grid with per-voxel multi-channel features) and applies tensor factorization to achieve compact and efficient representations. The paper introduces two decomposition approaches: classical CP decomposition, which factors the tensor into rank-one components; and a novel Vector-Matrix (VM) decomposition, which provides better expressiveness.

TensoRF with CP decomposition achieves better rendering quality than the original NeRF with model sizes under 4 MB. The VM decomposition variant further improves quality while maintaining fast reconstruction times (under 30 minutes for a single scene).

Block-NeRF (CVPR 2022)

Block-NeRF, by Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, and Henrik Kretzschmar, scales NeRF to city-level scenes. The method decomposes a large environment into individually trained NeRF blocks that can be independently updated and seamlessly combined at render time using learned appearance alignment.

The authors demonstrated the approach on 2.8 million images of San Francisco, constructing the largest neural scene representation at the time of publication. Block-NeRF was developed in collaboration with Waymo and showcased the potential of NeRF for autonomous driving simulation.

Zip-NeRF (ICCV 2023)

Zip-NeRF, by Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman, combines the anti-aliasing capabilities of Mip-NeRF 360 with the speed of grid-based methods like Instant-NGP. The key challenge is that Mip-NeRF 360 reasons about conical frustums along rays, while grid-based methods operate on point samples, making the two approaches fundamentally incompatible.

Zip-NeRF addresses this through techniques from signal processing and rendering, producing a method that achieves error rates 8% to 77% lower than either Mip-NeRF 360 or Instant-NGP alone, while training 24x faster than Mip-NeRF 360. The paper received a Best Paper Finalist award at ICCV 2023.

Nerfacto

Nerfacto is the default real-data method included in the Nerfstudio framework. Rather than a single published paper, it is a practical combination of the best-performing techniques from multiple NeRF variants: hash encoding from Instant-NGP, scene contraction from Mip-NeRF 360, appearance embeddings from NeRF-W, proposal-based sampling, and camera pose refinement. Nerfacto achieves quality comparable to Mip-NeRF 360 with approximately an order of magnitude speedup, making it one of the most practical methods for real-world NeRF captures.

Table of Major NeRF Variants

Method	Authors	Venue	Year	Key Innovation	Training Speed
NeRF	Mildenhall et al.	ECCV	2020	Original continuous 5D radiance field with volume rendering	~1-2 days
NeRF in the Wild	Martin-Brualla et al.	CVPR	2021	Appearance embeddings and transient modeling for uncontrolled photos	~1-2 days
Mip-NeRF	Barron et al.	ICCV	2021	Conical frustums and integrated positional encoding for anti-aliasing	~1 day
D-NeRF	Pumarola et al.	CVPR	2021	Canonical space plus deformation network for dynamic scenes	~1-2 days
Mip-NeRF 360	Barron et al.	CVPR	2022	Non-linear scene contraction and distillation for unbounded scenes	~1 day
Instant-NGP	Muller et al.	SIGGRAPH	2022	Multiresolution hash encoding for 1000x training speedup	~5-15 seconds
TensoRF	Chen et al.	ECCV	2022	Tensor factorization (CP and VM decomposition) of radiance fields	~30 minutes
Block-NeRF	Tancik et al.	CVPR	2022	City-scale scene decomposition into independently trained blocks	Per-block
Zip-NeRF	Barron et al.	ICCV	2023	Combines Mip-NeRF 360 anti-aliasing with grid-based speed	~1 hour
Nerfacto	Tancik et al.	SIGGRAPH	2023	Best-of-breed combination for practical real-world captures	~15-30 minutes

Comparison with 3D Gaussian Splatting

3D Gaussian Splatting (3DGS), introduced by Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuhler, and George Drettakis at SIGGRAPH 2023, represents an alternative paradigm for novel view synthesis that has gained rapid adoption. While NeRF uses an implicit, continuous volumetric representation learned by a neural network, 3DGS uses an explicit collection of 3D Gaussian primitives that are directly optimized and rasterized.

Feature	NeRF (and variants)	3D Gaussian Splatting
Scene Representation	Implicit (MLP or grid + MLP)	Explicit (3D Gaussian primitives)
Rendering Approach	Volume rendering via ray marching	Rasterization of Gaussian splats
Rendering Speed	Seconds per frame (original); real-time with Instant-NGP	Real-time (100+ FPS at 1080p)
Training Speed	Hours to days (original); seconds with Instant-NGP	Minutes (typically 15-30 min)
Memory Usage	Compact (MLP weights, or hash tables)	Higher (millions of Gaussians with attributes)
Visual Quality	High (especially Zip-NeRF, Mip-NeRF 360)	Comparable or better on many benchmarks
Dynamic Scenes	Requires specialized extensions (D-NeRF)	More naturally suited to dynamic updates
Editability	Difficult (implicit representation)	Easier (explicit primitives can be manipulated)
Mesh Export	Requires marching cubes or similar post-processing	Can extract surfaces from Gaussian distributions
Maturity	Large body of research (2020 onward)	Rapidly growing (2023 onward)

3DGS achieves real-time rendering at over 100 frames per second on consumer hardware by projecting 3D Gaussians onto the image plane using efficient GPU rasterization, avoiding the expensive per-pixel ray marching required by NeRF. However, the explicit representation of 3DGS typically requires more memory than NeRF's compact neural network weights. The two approaches are complementary in many respects, and hybrid methods that combine ideas from both paradigms continue to emerge.

Datasets and Benchmarks

Several standard datasets are used to evaluate NeRF methods. These benchmarks span synthetic and real-world scenes with varying complexity.

NeRF-Synthetic (Blender)

The NeRF-Synthetic dataset, introduced alongside the original NeRF paper, consists of 8 synthetic scenes rendered using Blender's Cycles path tracer: Chair, Drums, Ficus, Hotdog, Lego, Materials, Mic, and Ship. Each scene provides 100 training views, 100 validation views, and 200 test views at 800 x 800 resolution, with cameras placed on a hemisphere around each object on a white background. This dataset remains the most widely used benchmark for evaluating NeRF methods on synthetic data.

LLFF (Local Light Field Fusion)

The LLFF dataset, introduced by Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar, contains 8 forward-facing real-world scenes captured with a cellphone. The camera motion is roughly planar (forward-facing), with relatively small variation in viewpoint. This dataset tests a method's ability to interpolate between nearby views of real scenes.

Tanks and Temples

The Tanks and Temples benchmark, introduced by Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun, provides large-scale real-world scenes captured in realistic conditions with ground-truth 3D geometry from an industrial laser scanner. The dataset includes indoor and outdoor scenes of varying complexity and is divided into training, intermediate, and advanced subsets. It is commonly used to evaluate NeRF methods on challenging real-world geometry.

Mip-NeRF 360 Dataset

Introduced alongside the Mip-NeRF 360 paper, this dataset contains 9 unbounded real-world scenes (5 outdoor and 4 indoor) captured with inward-facing cameras that orbit around a central point of interest. The scenes include complex geometry, reflective surfaces, and fine detail, making this dataset one of the most challenging benchmarks for novel view synthesis. It has become a standard evaluation dataset for state-of-the-art methods.

Nerfstudio Framework

Nerfstudio is a modular, open-source PyTorch-based framework for NeRF development, created by Matthew Tancik and collaborators at UC Berkeley's BAIR lab. The framework was presented at SIGGRAPH 2023 and published in ACM Transactions on Graphics.

Nerfstudio provides:

A modular API that separates data loading, model architecture, sampling strategy, and rendering into interchangeable components.
Real-time visualization through an integrated web-based viewer that displays training progress and allows interactive camera control.
Data processing pipelines that handle video or image input, run COLMAP for pose estimation, and prepare data for training.
Multiple built-in methods including Nerfacto, Instant-NGP, Mip-NeRF, and support for 3D Gaussian Splatting via the Splatfacto model.
Export functionality for producing videos, point clouds, and mesh representations from trained models.
Logging and profiling integration with TensorBoard and Weights & Biases.
Benchmarking scripts for standardized evaluation on the Blender and mip-NeRF 360 datasets.

Nerfstudio has become the standard open-source platform for NeRF research and practical applications, with an active community and frequent updates incorporating new methods.

Applications

NeRF and its variants have found applications across many domains.

Novel View Synthesis

The original and most direct application of NeRF is generating photorealistic images of a scene from viewpoints not present in the training data. This has applications in photography, filmmaking, and virtual tours. Products from companies like Luma AI allow users to capture NeRF scenes with a smartphone and share interactive 3D representations on the web.

3D Reconstruction

While NeRF was originally designed for view synthesis rather than explicit 3D geometry extraction, the learned density field implicitly encodes scene geometry. Surfaces can be extracted using marching cubes or similar algorithms applied to the density field. Methods like NeuS and VolSDF specifically optimize NeRF-like representations to produce high-quality surface reconstructions with signed distance functions.

Virtual Reality and Augmented Reality

Virtual reality and augmented reality applications benefit from NeRF's ability to generate photorealistic views of real-world environments from arbitrary camera positions. Real-time NeRF variants like Instant-NGP and 3DGS have made interactive VR/AR experiences based on neural scene representations feasible. Users can walk through reconstructed environments with six degrees of freedom.

Autonomous Driving

NeRF has significant applications in autonomous driving simulation and testing. Block-NeRF demonstrated city-scale reconstruction for driving simulation using Waymo data. Several specialized methods have been developed for this domain, including DriveEnv-NeRF for creating simulation environments under varying lighting conditions, and Lightning NeRF for efficient outdoor scene reconstruction. These tools allow autonomous driving systems to be tested in photorealistic simulated environments derived from real-world data.

Robotics

In robotics, NeRF provides dense scene representations that can be used for navigation, manipulation planning, and collision avoidance. Robots can build NeRF models of their environment from onboard camera observations and use the resulting 3D representations for path planning and object interaction. The differentiable nature of NeRF also allows integration into end-to-end learned robotic systems.

Medical Imaging

NeRF-based techniques have been adapted for medical imaging applications, including synthesizing novel views from limited CT or MRI scan data. This can potentially reduce radiation exposure for patients by requiring fewer scans while still providing clinicians with the views they need for diagnosis.

Digital Content Creation

NeRF enables photorealistic capture and relighting of real-world objects and scenes for use in games, films, and advertising. Methods like Ref-NeRF improve the handling of reflective surfaces, making captured objects more amenable to relighting in new environments. The ability to capture real objects and place them in virtual scenes bridges the gap between real and synthetic content.

Urban Mapping and Cultural Heritage

NeRF has been applied to create detailed 3D models of urban environments, historical buildings, and cultural heritage sites. The photorealistic quality of NeRF reconstructions, combined with the relatively lightweight capture requirements (just photographs from multiple angles), makes it an attractive tool for digital preservation of physical spaces.

Training and Capture Considerations

Successful NeRF reconstruction depends on several practical factors:

Image coverage: The scene should be photographed from many overlapping viewpoints. Insufficient angular coverage leads to artifacts in unobserved regions.
Camera pose accuracy: Precise camera poses are needed. COLMAP is the standard tool for estimating poses from images, but it can fail on textureless surfaces, repetitive patterns, or scenes with few distinctive features.
Consistent lighting: The original NeRF assumes static illumination. Variable lighting between captures can cause inconsistencies, although methods like NeRF-W address this.
Static scene: Moving objects during capture cause ghosting artifacts unless dynamic-aware methods are used.
Image quality: Motion blur, out-of-focus regions, and compression artifacts in the training images directly affect reconstruction quality.

Timeline of Key Developments

Year	Development
2020	NeRF (Mildenhall et al.) introduced at ECCV 2020
2020	Fourier Features paper (Tancik et al.) provides theoretical grounding for positional encoding
2021	Mip-NeRF introduces anti-aliased rendering with conical frustums (ICCV 2021)
2021	NeRF in the Wild enables reconstruction from internet photo collections (CVPR 2021)
2021	D-NeRF extends to dynamic scenes (CVPR 2021)
2021	PlenOctrees and KiloNeRF achieve real-time NeRF rendering through caching and distillation
2022	Instant-NGP achieves 1000x training speedup with hash encoding (SIGGRAPH 2022)
2022	Mip-NeRF 360 handles unbounded 360-degree scenes (CVPR 2022)
2022	Block-NeRF demonstrates city-scale reconstruction (CVPR 2022)
2022	TensoRF introduces tensor factorization for efficient radiance fields (ECCV 2022)
2023	3D Gaussian Splatting emerges as a competing paradigm (SIGGRAPH 2023)
2023	Zip-NeRF combines anti-aliasing and grid-based speed (ICCV 2023)
2023	Nerfstudio provides a standardized open-source framework (SIGGRAPH 2023)

Relationship to Other Neural Scene Representations

NeRF belongs to the broader family of neural scene representations, sometimes called neural fields or coordinate-based neural networks. Related approaches include:

Neural Signed Distance Functions (Neural SDFs): Methods like DeepSDF and NeuS represent geometry as the zero-level set of a learned signed distance function, which produces cleaner surface reconstructions than density-based methods.
Light Field Networks: These directly predict the color along a ray without explicit 3D volume rendering, trading off 3D consistency for rendering speed.
Generative Adversarial Networks for 3D: Methods like EG3D and GRAF use NeRF-like representations within GAN frameworks to generate novel 3D objects from learned distributions.
Diffusion Models for 3D: Recent methods use 2D diffusion model priors (such as Stable Diffusion) to guide NeRF optimization, enabling text-to-3D generation (DreamFusion, Magic3D) and single-image 3D reconstruction.

References

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., & Ng, R. (2020). "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." *Proceedings of the European Conference on Computer Vision (ECCV)*. arXiv:2003.08934
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., & Srinivasan, P.P. (2021). "Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*. arXiv:2103.13415
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., & Hedman, P. (2022). "Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2111.12077
Muller, T., Evans, A., Schied, C., & Keller, A. (2022). "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding." *ACM Transactions on Graphics (SIGGRAPH)*, 41(4). arXiv:2201.05989
Martin-Brualla, R., Radwan, N., Sajjadi, M.S.M., Barron, J.T., Dosovitskiy, A., & Duckworth, D. (2021). "NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2008.02268
Pumarola, A., Corona, E., Pons-Moll, G., & Moreno-Noguer, F. (2021). "D-NeRF: Neural Radiance Fields for Dynamic Scenes." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2011.13961
Chen, A., Xu, Z., Geiger, A., Yu, J., & Su, H. (2022). "TensoRF: Tensorial Radiance Fields." *Proceedings of the European Conference on Computer Vision (ECCV)*. Project page
Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., & Kretzschmer, H. (2022). "Block-NeRF: Scalable Large Scene Neural View Synthesis." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2202.05263
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., & Hedman, P. (2023). "Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*. arXiv:2304.06706
Kerbl, B., Kopanas, G., Leimkuhler, T., & Drettakis, G. (2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering." *ACM Transactions on Graphics (SIGGRAPH)*, 42(4). arXiv:2308.04079
Tancik, M., Weber, E., Ng, E., Li, R., Yi, B., Kerr, J., Wang, T., Kristoffersen, A., Austin, J., Salahi, K., Ahber, A., Conde, D., & Kanazawa, A. (2023). "Nerfstudio: A Modular Framework for Neural Radiance Field Development." *ACM SIGGRAPH 2023 Conference Proceedings*. arXiv:2302.04264
Tancik, M., Srinivasan, P.P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J.T., & Ng, R. (2020). "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:2006.10739
Knapitsch, A., Park, J., Zhou, Q.-Y., & Koltun, V. (2017). "Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction." *ACM Transactions on Graphics*, 36(4).
Mildenhall, B., Srinivasan, P.P., Ortiz-Cayon, R., Kalantari, N.K., Ramamoorthi, R., Ng, R., & Kar, A. (2019). "Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines." *ACM Transactions on Graphics (SIGGRAPH)*.

NeRF (Neural Radiance Fields)

Core Idea

MLP Architecture

Positional Encoding

Volume Rendering

Hierarchical Sampling

Training Process

Evaluation Metrics and Results

Limitations of the Original NeRF

Key Variants and Extensions

Mip-NeRF (ICCV 2021)

Mip-NeRF 360 (CVPR 2022)

Instant-NGP (SIGGRAPH 2022)

NeRF in the Wild (CVPR 2021)

D-NeRF (CVPR 2021)

TensoRF (ECCV 2022)

Block-NeRF (CVPR 2022)

Zip-NeRF (ICCV 2023)

Nerfacto

Table of Major NeRF Variants

Comparison with 3D Gaussian Splatting

Datasets and Benchmarks

NeRF-Synthetic (Blender)

LLFF (Local Light Field Fusion)

Tanks and Temples

Mip-NeRF 360 Dataset

Nerfstudio Framework

Applications

Novel View Synthesis

3D Reconstruction

Virtual Reality and Augmented Reality

Autonomous Driving

Robotics

Medical Imaging

Digital Content Creation

Urban Mapping and Cultural Heritage

Training and Capture Considerations

Timeline of Key Developments

Relationship to Other Neural Scene Representations

See Also

References

Related Articles

Depth estimation

Convolutional Neural Network

ResNet

EfficientNet

YOLO (object detection)

VGG

NeRF (Neural Radiance Fields)

Core Idea

MLP Architecture

Positional Encoding

Volume Rendering

Hierarchical Sampling

Training Process

Evaluation Metrics and Results

Limitations of the Original NeRF

Key Variants and Extensions

Mip-NeRF (ICCV 2021)

Mip-NeRF 360 (CVPR 2022)

Instant-NGP (SIGGRAPH 2022)

NeRF in the Wild (CVPR 2021)

D-NeRF (CVPR 2021)

TensoRF (ECCV 2022)

Block-NeRF (CVPR 2022)

Zip-NeRF (ICCV 2023)

Nerfacto

Table of Major NeRF Variants

Comparison with 3D Gaussian Splatting

Datasets and Benchmarks

NeRF-Synthetic (Blender)

LLFF (Local Light Field Fusion)

Tanks and Temples

Mip-NeRF 360 Dataset

Nerfstudio Framework

Applications

Novel View Synthesis

3D Reconstruction

Virtual Reality and Augmented Reality

Autonomous Driving