Neural architecture search

Deep Learning MLOps Machine Learning Neural Networks

31 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v9 · 6,108 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Neural architecture search (NAS) is a technique for automating the design of neural network architectures. Rather than relying on human experts to hand-craft network topologies, NAS methods use algorithms to explore a defined space of possible architectures, evaluate candidates according to a performance metric, and return high-performing designs with minimal human intervention.^[10] The field was catalyzed by a 2017 paper from Barret Zoph and Quoc V. Le at Google Brain, which demonstrated that a reinforcement learning-based controller could discover convolutional neural network architectures competitive with those designed by human researchers.^[1] Since then, NAS has grown into a major subfield of automated machine learning (AutoML), producing architectures such as NASNet, AmoebaNet, EfficientNet, and MobileNetV3 that have set accuracy and efficiency records across a range of benchmarks.^[2]^[3]^[7]

NAS research addresses three core questions: what architectures to consider (the search space), how to explore that space (the search strategy), and how to estimate the quality of candidate architectures without fully training each one (performance estimation).^[10] Progress on all three fronts has reduced the computational cost of architecture search from roughly 22,400 GPU hours in the original 2017 method to about 1.5 GPU days for DARTS in 2019 and to seconds per architecture with zero-cost proxies, making the technique practical for a growing number of applications.^[1]^[5]

Background and Motivation

Designing neural network architectures has historically been a labor-intensive process that requires deep expertise. The progression from simple multilayer perceptrons to LeNet, AlexNet, VGGNet, GoogLeNet, and ResNet involved years of trial and error, intuition, and careful experimentation by experienced researchers. Each architectural innovation, such as skip connections, inception modules, or batch normalization, required creative insight and extensive validation.

As deep learning expanded into more diverse tasks and hardware platforms, the manual design process became a bottleneck. Different tasks (image classification, object detection, semantic segmentation, language modeling) and different deployment targets (cloud GPUs, mobile phones, edge devices) call for different architectural trade-offs. Designing a separate architecture for each combination of task and target is impractical at scale.

Neural architecture search addresses this problem by framing architecture design as an optimization problem that can be solved algorithmically.^[10] The goal is to find an architecture that maximizes a given objective (typically validation accuracy, sometimes combined with latency or model size constraints) within a predefined search space. This framing connects architecture design to the broader AutoML agenda of reducing the human effort required to build effective machine learning systems.

The Foundational Work: Zoph and Le (2017)

The paper that brought NAS into the spotlight was "Neural Architecture Search with Reinforcement Learning" by Barret Zoph and Quoc V. Le, published at ICLR 2017.^[1] As the authors summarized it, "we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set."^[1] The central idea was to use a recurrent neural network (RNN) as a controller that generates descriptions of candidate neural network architectures. The controller is trained with reinforcement learning: specifically, the REINFORCE policy gradient algorithm.^[1] The reward signal is the validation accuracy of the generated architecture after it has been trained on a target dataset.

How the Controller Works

The controller RNN generates architecture descriptions one decision at a time. For convolutional networks, the controller outputs a sequence of tokens specifying, for each layer, the filter height, filter width, stride height, stride width, and number of filters.^[1] For recurrent networks, the controller generates the topology and activation functions of a recurrent cell.

At each step, the controller samples from a softmax distribution over possible choices. Once a complete architecture has been generated, it is trained from scratch on the target dataset, and its validation accuracy serves as the reward for updating the controller's parameters via the REINFORCE algorithm with a moving average baseline to reduce variance.^[1]

Results and Computational Cost

On CIFAR-10, the NAS-designed convolutional network achieved a test error rate of 3.65%, which the paper reported was "0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme."^[1] On the Penn Treebank language modeling benchmark, the method discovered a novel recurrent cell that outperformed the standard LSTM cell and other baselines.^[1]

The computational cost was enormous by any standard. The search required 800 GPUs running in parallel for 28 days, amounting to roughly 22,400 GPU hours.^[1] This expense limited early NAS research to well-resourced laboratories, but it also motivated a sustained effort to reduce search costs, which has been one of the defining themes of the field.

How does NAS work? The three core components

Every NAS method can be decomposed into three components: the search space, the search strategy, and the performance estimation strategy.^[10] These components interact with each other and collectively determine the efficiency and effectiveness of the search.

Search Space

The search space defines the set of architectures that the NAS algorithm can consider.^[10] A well-designed search space balances expressiveness (the ability to represent high-performing architectures) against tractability (keeping the space small enough to search efficiently).

Macro Search Space

In the earliest NAS work, the search space was defined at the level of the entire network. The algorithm decided the type of operation at each layer, along with hyperparameters such as kernel size, number of filters, and stride.^[1] This is called a macro search space because decisions are made about the global structure of the network. Macro search spaces can represent a wide variety of architectures, but they are very large and computationally expensive to explore.

Cell-Based (Micro) Search Space

To reduce the size of the search space, Zoph et al. (2018) introduced the cell-based or micro search space in the NASNet paper.^[2] Instead of searching for the entire network, the algorithm searches for a small, reusable building block called a cell. A cell is a directed acyclic graph (DAG) where each node represents a feature map and each edge represents an operation (such as a 3x3 convolution, 5x5 separable convolution, max pooling, or identity mapping).^[2]

Two types of cells are typically searched:

Normal cell: Preserves the spatial dimensions of the input feature map.
Reduction cell: Reduces the spatial dimensions (height and width) by a factor of two.

The full network is then constructed by stacking copies of these cells in a predefined pattern. This approach has two major advantages. First, it reduces the search space from an exponentially large space of full networks to a much smaller space of cell structures. Second, it enables transferability: cells discovered on a small dataset like CIFAR-10 can be stacked into deeper networks for larger datasets like ImageNet, avoiding the cost of searching directly on the larger dataset.^[2]

Hierarchical Search Space

Some methods combine macro and micro search. In a hierarchical search space, the lower level defines the micro-structure of cells or blocks, while the upper level controls macro-level decisions such as the number of cells per stage, where to place reduction cells, and the channel width at each stage. MnasNet (Tan et al., 2019) used a factorized hierarchical search space in which the network was divided into seven blocks, each with independently searched architectures, allowing different parts of the network to use different operations and configurations.^[6]

Search Strategy

The search strategy determines how the NAS algorithm navigates the search space.^[10] Three broad families of search strategies have emerged: reinforcement learning-based, evolutionary, and gradient-based.

Reinforcement Learning-Based Search

The original NAS paper used reinforcement learning, with the controller RNN generating architectures and the REINFORCE algorithm updating the controller based on validation accuracy rewards.^[1] Subsequent RL-based methods refined this approach. MnasNet used Proximal Policy Optimization (PPO) instead of REINFORCE, and added a multi-objective reward that incorporated real-world latency measured on a target device.^[6]

RL-based methods are flexible and can optimize for complex, non-differentiable objectives. Their main drawback is sample inefficiency: they typically need to evaluate thousands of candidate architectures before converging, which can be very expensive.

Evolutionary Search

Evolutionary algorithms maintain a population of architectures and iteratively improve them through mutation and selection. In the context of NAS, a mutation might change the operation at a given edge in a cell (for example, replacing a 3x3 convolution with a 5x5 separable convolution), add or remove a connection, or modify a hyperparameter.^[3]

The most influential evolutionary NAS method is regularized evolution, introduced by Real et al. (2019) in the paper that produced AmoebaNet.^[3] Regularized evolution modifies standard tournament selection by introducing an aging mechanism: the oldest individual in the population is removed at each step, regardless of its fitness.^[3] This prevents the population from being dominated by early high-fitness individuals and encourages continued exploration.

In controlled experiments using the same search space (the NASNet search space), Real et al. found that evolutionary search and RL-based search achieved similar final accuracy, but evolution had better anytime performance, meaning it found good architectures faster during the search process.^[3] Evolution also tended to find smaller models at equivalent accuracy levels.

Gradient-Based Search

Gradient-based methods reformulate the discrete architecture search problem as a continuous optimization problem that can be solved with gradient descent. The most prominent method in this category is DARTS (Differentiable Architecture Search), proposed by Liu, Simonyan, and Yang in 2019 (ICLR).^[5]

DARTS works by constructing a mixed operation at each edge of the cell DAG. Instead of selecting a single operation, DARTS places a weighted combination of all candidate operations at each edge.^[5] The weights (called architecture parameters) are continuous and learned jointly with the network weights through backpropagation. During the search, the architecture parameters and the network weights are optimized in an alternating fashion using a bi-level optimization scheme: the network weights are updated on the training set, and the architecture parameters are updated on the validation set.^[5]

Once the search is complete, the final discrete architecture is obtained by selecting the operation with the highest architecture parameter weight at each edge.

DARTS reduced the search cost on CIFAR-10 to approximately 1.5 GPU days (on a single GPU), compared to 2,000 GPU days for NASNet and 3,150 GPU days for AmoebaNet.^[5] It achieved a test error of 2.83% on CIFAR-10, competitive with the best RL and evolutionary results.^[5] The authors framed the central advantage as being "orders of magnitude faster than state-of-the-art non-differentiable techniques," because the continuous relaxation lets the entire search be solved with gradient descent rather than thousands of separate candidate evaluations.^[5] However, DARTS has known stability issues: the bi-level optimization can sometimes collapse to trivial architectures dominated by skip connections. Follow-up methods such as P-DARTS, PC-DARTS, and Fair DARTS have proposed various fixes for these stability problems.

Performance Estimation Strategy

Training each candidate architecture to full convergence is the most expensive part of the NAS pipeline. Performance estimation strategies aim to reduce this cost by approximating the true performance of an architecture without fully training it.^[10]

Training from Scratch

The simplest approach, used in the original NAS paper, trains every candidate architecture from scratch for a fixed number of epochs and uses the resulting validation accuracy as the performance estimate.^[1] This provides an accurate signal but is extremely expensive.

Low-Fidelity Estimates

Low-fidelity methods reduce evaluation cost by training on a smaller dataset, training for fewer epochs, training a smaller version of the architecture (fewer layers or channels), or using lower-resolution inputs.^[10] The assumption is that the relative ranking of architectures is preserved even when using cheaper proxy evaluations. While this assumption does not always hold perfectly, low-fidelity estimates have proven effective in practice.

The most impactful cost reduction technique has been weight sharing, introduced by ENAS (Efficient Neural Architecture Search) from Pham et al. (2018).^[4] Instead of training each candidate architecture from scratch, ENAS constructs a single large supernet (also called an over-parameterized network) that contains all possible architectures as subgraphs.^[4] The supernet's weights are shared across all candidate architectures, so training the supernet simultaneously trains all candidates.

The controller then samples architectures from the supernet and evaluates them using the shared weights, without any additional training. This reduced the cost of NAS by roughly 1,000x compared to the original approach, bringing the search cost down to about 0.5 GPU days on CIFAR-10.^[4]

One-shot NAS methods extend this idea. In a one-shot approach, the supernet is trained once (the "one shot"), and then individual architectures are extracted and evaluated using the inherited weights. The training of the supernet and the search for the best architecture are decoupled into two separate phases.

The main challenge with weight sharing is weight coupling: since all architectures share the same weights, the shared weights may not accurately reflect the performance of any individual architecture when trained independently.^[13] Despite this limitation, weight sharing has become the dominant paradigm in modern NAS due to its dramatic cost savings.

Surrogate Models

Surrogate-based methods train a predictor (such as a Gaussian process, random forest, or neural network) to estimate architecture performance based on architectural features. The predictor is trained on a small set of fully evaluated architectures and then used to cheaply estimate the performance of new candidates.^[10] This approach is especially useful when combined with Bayesian optimization or evolutionary search.

Zero-Cost Proxies

More recently, researchers have explored zero-cost proxies that estimate architecture quality without any training at all. These methods compute statistics of the randomly initialized network (such as gradient norms, the number of linear regions, or the Jacobian of the network) and use them as a proxy for trained performance. While not as accurate as training-based estimates, zero-cost proxies can evaluate thousands of architectures per second and are useful for pruning the search space before applying more expensive evaluation.

Key Architectures Discovered by NAS

Several NAS-discovered architectures have had significant impact on the field. The following table summarizes the most notable ones.

Architecture	Year	Search Strategy	Search Space	Search Cost (GPU Days)	Key Result
NAS (Zoph & Le)	2017	Reinforcement learning	Macro	~22,400	3.65% error on CIFAR-10
NASNet	2018	Reinforcement learning	Cell-based	~2,000	82.7% top-1 on ImageNet
AmoebaNet-A	2019	Regularized evolution	Cell-based (NASNet)	~3,150	83.9% top-1 on ImageNet (scaled)
ENAS	2018	RL + weight sharing	Cell-based	~0.5	2.89% error on CIFAR-10
DARTS	2019	Gradient-based	Cell-based	~1.5	2.83% error on CIFAR-10
MnasNet	2019	RL (PPO)	Factorized hierarchical	~3,800	75.2% top-1 on ImageNet, 78ms latency
ProxylessNAS	2019	Gradient-based + RL	Cell-based	~8.3	75.1% top-1 on ImageNet (mobile)
FBNet	2019	Gradient-based	Cell-based	~9	74.9% top-1 on ImageNet (mobile)
EfficientNet-B0	2019	RL (MnasNet-based)	MBConv-based	~3,800	77.1% top-1 on ImageNet
MobileNetV3	2019	RL + NetAdapt	MBConv-based	N/A	75.2% top-1, 22ms latency (Pixel phone)

NASNet

NASNet was introduced in 2018 by Zoph, Vasudevan, Shlens, and Le in the paper "Learning Transferable Architectures for Scalable Image Recognition," published at CVPR 2018.^[2] NASNet's key contribution was the cell-based search space and the demonstration of transferability. The search was conducted on CIFAR-10 using 500 GPUs over 4 days, and the discovered cells were then stacked into a deeper network for ImageNet.^[2]

NASNet-A (Large), with 88.9 million parameters, achieved 82.7% top-1 accuracy on ImageNet, surpassing the best human-designed architectures of the time.^[2] The paper also introduced ScheduledDropPath, a regularization technique that applies dropout-like stochastic depth to skip connections with a linearly increasing drop probability during training.^[2]

NASNet demonstrated that searching on a small proxy dataset and transferring to a larger target dataset was a viable strategy for reducing search costs. This transfer approach has since become standard practice in NAS research.

AmoebaNet

AmoebaNet was produced by the regularized evolution method of Real, Aggarwal, Huang, and Le, published at AAAI 2019.^[3] The paper used the same NASNet search space as a controlled comparison between evolutionary and RL-based search strategies.^[3]

AmoebaNet-A achieved accuracy comparable to NASNet when matched for model size, and when scaled up to a larger configuration (469 million parameters), it achieved 83.9% top-1 / 96.6% top-5 accuracy on ImageNet, setting a new record at the time.^[3] The search consumed 3,150 GPU days using 450 GPUs over 7 days.^[3]

The key finding was that regularized evolution matched RL in final accuracy but had better anytime performance and was simpler to implement.^[3] The aging mechanism in regularized evolution also helped the search explore more diverse regions of the search space by preventing stagnation.

EfficientNet

EfficientNet, introduced by Tan and Le at ICML 2019, combined NAS with a novel compound scaling method.^[7] The baseline architecture, EfficientNet-B0, was discovered using an RL-based search procedure adapted from MnasNet, with a multi-objective reward balancing accuracy and FLOPs.^[7] The search space consisted of mobile inverted bottleneck convolution (MBConv) blocks.

The compound scaling method then scaled B0 uniformly across three dimensions (depth, width, and input resolution) using a single compound coefficient.^[7] This produced a family of models from B0 to B7 that achieved consistently better accuracy-to-efficiency trade-offs than previous architectures.

According to the paper, EfficientNet-B7 "achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet."^[7] The work demonstrated that NAS could produce not just a single architecture but a principled family of models spanning a wide range of computational budgets.

MobileNetV3

MobileNetV3, published by Howard et al. at ICCV 2019, combined hardware-aware NAS with manual architectural refinements.^[8] The design process had two phases. First, a platform-aware NAS approach (based on MnasNet) searched for a global network structure optimized for latency on a Pixel phone. Second, the NetAdapt algorithm performed fine-grained, layer-wise optimization to further reduce latency while maintaining accuracy.^[8]

On top of the NAS-discovered structure, the authors made several manual modifications: redesigning the expensive initial and final layers, introducing the hard-swish activation function (a computationally cheaper approximation of swish), and adding squeeze-and-excitation modules to certain layers.^[8]

Two variants were released: MobileNetV3-Large (for higher-resource use cases) and MobileNetV3-Small (for constrained environments). MobileNetV3-Large achieved 75.2% top-1 accuracy on ImageNet with a latency of 22 milliseconds on a Pixel phone.^[8] Compared to MobileNetV2, MobileNetV3-Large was 3.2 percentage points more accurate while being 15% faster, and MobileNetV3-Small was 4.6 percentage points more accurate while being 5% faster.^[8]

MobileNetV3 is notable as an example of NAS and human design working in combination rather than as alternatives. The NAS component discovered the high-level structure, while human engineers refined the details based on domain knowledge about mobile hardware.

What is hardware-aware NAS?

Traditional NAS optimizes for accuracy alone or uses FLOPs as a proxy for computational cost. However, FLOPs do not perfectly correlate with real-world inference latency because different operations have different hardware utilization characteristics.^[6] A depthwise separable convolution may have fewer FLOPs than a standard convolution but run slower on certain hardware due to lower arithmetic intensity.

Hardware-aware NAS addresses this gap by incorporating actual hardware measurements (latency, energy consumption, or memory usage) directly into the search objective.^[6] Instead of minimizing FLOPs, the search minimizes (or constrains) the actual inference time on a target device.

MnasNet: The Pioneering Approach

MnasNet (Tan et al., CVPR 2019) was one of the first NAS methods to explicitly incorporate real-world latency.^[6] The search used a multi-objective reward function:

Reward = ACC(m) x [LAT(m) / T]^w

where ACC(m) is the model's accuracy, LAT(m) is the measured latency on a Pixel phone, T is a target latency, and w is a weight factor (set to -0.07) that controls the trade-off.^[6] The latency was measured by running each candidate model on an actual Pixel phone rather than relying on a theoretical estimate.^[6]

MnasNet achieved 75.2% top-1 accuracy on ImageNet with 78 milliseconds of latency on a Pixel phone, which was 1.8x faster than MobileNetV2 at 0.5 percentage points higher accuracy and 2.3x faster than NASNet at 1.2 percentage points higher accuracy.^[6]

ProxylessNAS

ProxylessNAS (Cai, Zhu, and Han, ICLR 2019) took hardware-aware NAS further by searching directly on the target task and hardware without proxy datasets or proxy architectures.^[9] The method used a gradient-based search on an over-parameterized supernet, with path-level binarization to reduce memory consumption.^[9] ProxylessNAS could target different hardware platforms (GPU, CPU, or mobile) by simply changing the latency measurement in the objective function.^[9]

The search cost was approximately 8.3 GPU days, and the resulting architectures achieved competitive accuracy with significantly lower latency than previous methods.^[9] ProxylessNAS cut search cost by roughly 200x compared to earlier hardware-aware methods.^[9]

Broader Impact

Hardware-aware NAS has been adopted by major technology companies for production deployment. The core insight, that the search objective should reflect the actual deployment constraints rather than a proxy metric, has influenced how practitioners think about model design more broadly. Methods such as FBNet, Single-Path NAS, and OFA (Once-for-All) have extended hardware-aware NAS to support multiple hardware targets from a single search.^[12]

One-shot NAS methods represent the most significant cost reduction in the field. The central idea is to train a single supernet that encodes all possible architectures in the search space as subgraphs.^[4] Once the supernet is trained, individual architectures can be evaluated by extracting the corresponding subgraph and using the inherited shared weights, without any additional training.

The Supernet Concept

A supernet is an over-parameterized network where each edge in the computational graph contains all candidate operations.^[4] During training, different subgraphs (corresponding to different architectures) are activated and trained. The weights are shared across all subgraphs, meaning that an operation at a given position in the network uses the same weights regardless of which architecture it appears in.

Training Strategies

Several strategies have been proposed for training the supernet:

Uniform sampling (Single-Path One-Shot): At each training step, a single path through the supernet is randomly sampled and trained. This approach, proposed by Guo et al. (2020), treats all architectures equally and avoids the bias that can arise from training some architectures more than others.^[13]
Progressive shrinking (OFA): The Once-for-All method (Cai et al., 2020) trains the supernet using progressive shrinking, starting with the largest subnetwork and gradually including smaller subnetworks.^[12] This produces a supernet that supports a wide range of architectural configurations without retraining.^[12]
Differentiable approaches (FBNet, DARTS): Some methods parameterize the supernet's architecture choices as continuous variables and optimize them with gradient descent, as described in the DARTS section above.^[5]

Challenges

The main challenge with one-shot NAS is the ranking consistency problem: the relative ranking of architectures under shared weights may not match their ranking when trained independently.^[13] Weight coupling, where the shared weights represent a compromise across all architectures rather than being optimal for any single one, is the root cause. If the supernet's ranking is unreliable, the search may select suboptimal architectures.

Researchers have addressed this through various techniques, including larger supernets, better training schedules, progressive shrinking, and improved sampling strategies.^[12] Despite these challenges, one-shot methods remain the dominant paradigm due to their dramatic cost advantages.

Cost Reduction: From Thousands of GPU Days to Hours

The reduction in NAS search cost over time is one of the field's most striking achievements. The following table illustrates this progression.

Method	Year	Search Cost (GPU Days)	Key Cost Reduction Technique
NAS (Zoph & Le)	2017	~22,400	None (train from scratch)
NASNet	2018	~2,000	Cell-based search space, proxy dataset
AmoebaNet	2019	~3,150	Cell-based search space
ENAS	2018	~0.5	Weight sharing
DARTS	2019	~1.5	Continuous relaxation, gradient-based
ProxylessNAS	2019	~8.3	Path-level binarization
P-DARTS	2019	~0.3	Progressive search depth
Single-Path One-Shot	2020	~12 (supernet) + search	Uniform path sampling
FairDARTS	2020	~0.4	Sigmoid-based relaxation
Zero-cost proxies	2021+	< 0.01	No training required

The cost reduction has come from multiple sources:

Smaller search spaces: Moving from macro to cell-based search spaces reduced the number of possible architectures by orders of magnitude.^[2]
Proxy tasks: Searching on CIFAR-10 and transferring to ImageNet avoids the cost of training on the larger dataset.^[2]
Weight sharing: Training one supernet instead of thousands of individual architectures provides roughly 1,000x speedup.^[4]
Gradient-based search: Replacing RL or evolution with differentiable search eliminates the need for thousands of architecture evaluations.^[5]
Early stopping and low-fidelity estimates: Training for fewer epochs or on smaller datasets provides cheaper performance signals.
Zero-cost proxies: Eliminating training entirely brings the cost down to seconds per architecture.

Comparison of Search Strategy Families

The following table compares the three main families of NAS search strategies across several dimensions.

Dimension	Reinforcement Learning	Evolutionary	Gradient-Based
Optimization type	Policy gradient over discrete space	Population-based, mutation and selection	Gradient descent over continuous relaxation
Typical search cost	High (thousands of GPU days)	High (thousands of GPU days)	Low (1-4 GPU days)
Handling of non-differentiable objectives	Natural (reward shaping)	Natural (fitness function)	Requires differentiable relaxation or surrogate
Hardware-aware search	Straightforward (add latency to reward)	Straightforward (add latency to fitness)	Requires differentiable latency model or lookup table
Stability	Generally stable	Generally stable	Can collapse to degenerate architectures (e.g., skip-connection-heavy)
Multi-objective optimization	Supported via reward shaping	Supported via Pareto front	Requires scalarization or additional constraints
Notable methods	NAS, NASNet, MnasNet	AmoebaNet, Large-Scale Evolution	DARTS, ProxylessNAS, FBNet

Each strategy family has distinct strengths. RL-based methods are flexible and handle complex objectives well. Evolutionary methods are simple to implement, embarrassingly parallel, and robust. Gradient-based methods are fast but require careful handling to avoid instability.

Applications Beyond Image Classification

While NAS originated in the context of image classification, it has been applied to a broad range of tasks and domains.

Object Detection and Segmentation

NAS-FPN (Ghiasi, Lin, and Le, CVPR 2019) applied NAS to discover feature pyramid network architectures for object detection. The resulting architecture outperformed the hand-designed FPN on the COCO benchmark. Auto-DeepLab (Liu et al., CVPR 2019) extended NAS to semantic segmentation, searching over both the cell structure and the macro-level network topology.

Natural Language Processing

The original Zoph and Le (2017) paper already demonstrated NAS for recurrent cell design.^[1] Subsequent work has applied NAS to discover transformer architectures. The Evolved Transformer (So, Le, and Lian, ICML 2019) used evolutionary NAS to discover a transformer variant that outperformed the standard transformer on machine translation benchmarks while using fewer parameters.^[14]

Speech and Audio

NAS has been applied to automatic speech recognition, keyword spotting, and audio classification. Hardware-aware NAS is particularly relevant for speech applications that need to run on edge devices with strict latency constraints.

Graph Neural Networks

GraphNAS and related methods apply architecture search to graph neural networks, searching over aggregation functions, neighborhood sampling strategies, and layer configurations for tasks like node classification and graph classification.

Medical Imaging

NAS has shown promise in medical imaging tasks such as pathology classification, tumor segmentation, and radiology analysis, where the optimal architecture may differ significantly from those designed for natural image datasets.

NAS Benchmarks

Standardized benchmarks have been important for fair comparison of NAS methods. The most widely used benchmarks include:

Benchmark	Year	Description	Search Space Size
NAS-Bench-101	2019	Tabular benchmark with all architectures in a cell-based space evaluated on CIFAR-10	423,624 architectures
NAS-Bench-201	2020	Smaller cell-based space evaluated on CIFAR-10, CIFAR-100, and ImageNet-16-120	15,625 architectures
NAS-Bench-301	2021	Surrogate benchmark for the DARTS search space	~10^18 architectures
TransNAS-Bench-101	2021	Benchmark for transfer NAS across multiple tasks	7,352 architectures

These benchmarks enable researchers to evaluate NAS algorithms without the cost of actually training architectures, because the ground-truth performance of every architecture in the search space has been precomputed.^[11] This has accelerated research and improved reproducibility.^[11]

Limitations and Open Challenges

Despite significant progress, NAS faces several ongoing challenges.

Reproducibility

Many NAS papers report results with high variance, and small differences in training protocols (learning rate schedules, data augmentation, random seeds) can significantly affect the reported accuracy. Some studies have found that random search within the same search space can match or exceed the performance of sophisticated NAS algorithms, raising questions about how much of the reported improvement comes from the search algorithm versus the search space design.^[10]

Generalization of Search Spaces

Most NAS research has focused on convolutional architectures for image classification.^[10] The search spaces, operations, and evaluation protocols are heavily tailored to this setting. Applying NAS to new domains (such as transformer architectures for language or multimodal models) often requires designing new search spaces from scratch, which itself requires expertise.

Supernet Training Reliability

Weight sharing introduces a gap between the performance predicted by the supernet and the true performance of a standalone architecture.^[13] Improving the ranking consistency of supernets remains an active research area.

Is NAS too expensive and energy-intensive?

The compute cost of early NAS became one of the most cited criticisms of the technique. The original 2017 search required 800 GPUs running for 28 days, roughly 22,400 GPU hours.^[1] A widely discussed 2019 analysis by Strubell, Ganesh, and McCallum, "Energy and Policy Considerations for Deep Learning in NLP," estimated that a full neural architecture search to find the Evolved Transformer emitted about 626,155 pounds of CO2 equivalent, which the authors noted is roughly five times the lifetime emissions of an average American car.^[16] They argued that this single search produced "an increase of just 0.1 BLEU at the cost of at least $150k in on-demand compute time and non-trivial carbon emissions," making the energy footprint of architecture search a policy concern.^[16]

That estimate was later sharply revised downward. Patterson and colleagues at Google and UC Berkeley re-examined the same Evolved Transformer search in their 2021 paper "Carbon Emissions and Large Neural Network Training" and concluded that the true emissions were about 3.2 metric tons of CO2 equivalent rather than the 284 metric tons implied by the earlier estimate, an overestimate of roughly 88x.^[17] The gap arose because the search ran on smaller proxy tasks rather than full-scale models, and on Google's energy-efficient TPU v2 hardware in optimized datacenters rather than the assumed P100 GPUs in an average facility.^[17] The episode illustrates how sensitive carbon estimates are to assumptions about hardware, datacenter efficiency, and the use of proxy tasks.

While modern methods such as weight sharing, gradient-based search, and zero-cost proxies are far more efficient, the total computational cost of NAS research across the community remains substantial. Recent work, such as CE-NAS, has begun to incorporate carbon efficiency directly into the search objective.

Integration with Foundation Models

As the field shifts toward large pretrained foundation models, the role of NAS is evolving. Rather than searching for architectures from scratch, NAS may be used to design efficient adapters, pruning strategies, or fine-tuning configurations for existing large models. The intersection of NAS with large language models is an emerging area of research.

Timeline of Major Milestones

Year	Milestone
2017	Zoph and Le publish the foundational NAS paper using RL (ICLR 2017)
2017	Real et al. demonstrate large-scale evolutionary NAS
2018	NASNet introduces the cell-based search space and transferable architectures (CVPR 2018)
2018	ENAS introduces weight sharing, reducing cost by 1,000x (ICML 2018)
2019	DARTS introduces gradient-based search (ICLR 2019)
2019	AmoebaNet demonstrates regularized evolution (AAAI 2019)
2019	MnasNet introduces hardware-aware NAS with real latency measurements (CVPR 2019)
2019	EfficientNet combines NAS with compound scaling (ICML 2019)
2019	MobileNetV3 combines NAS with NetAdapt and manual refinements (ICCV 2019)
2019	ProxylessNAS enables direct search on target task and hardware (ICLR 2019)
2019	NAS-Bench-101 provides the first tabular NAS benchmark
2020	Once-for-All enables deployment-time architecture specialization
2021+	Zero-cost proxies eliminate the need for training during search
2024+	LLM-guided NAS and carbon-efficient NAS emerge as new research directions

Summary

Neural architecture search has transformed the design of neural networks from a manual art into an algorithmic optimization problem. Starting with the computationally intensive RL-based approach of Zoph and Le in 2017, the field has developed increasingly efficient methods, from evolutionary search (AmoebaNet) to gradient-based optimization (DARTS) to weight-sharing supernets (ENAS, One-Shot NAS).^[1]^[3]^[5] These advances have reduced search costs from tens of thousands of GPU days to single-digit GPU hours or even minutes.

NAS has produced architectures, including NASNet, AmoebaNet, EfficientNet, and MobileNetV3, that matched or surpassed the best human-designed networks on major benchmarks.^[2]^[7]^[8] Hardware-aware NAS has further extended the technique by optimizing for real-world deployment constraints, not just accuracy.^[6] While challenges remain around reproducibility, search space generalization, and the environmental cost of large-scale search, NAS continues to evolve, with recent work exploring its integration with large language models, zero-cost evaluation proxies, and carbon-efficient optimization.

References

Zoph, B., & Le, Q. V. (2017). "Neural Architecture Search with Reinforcement Learning." *Proceedings of the 5th International Conference on Learning Representations (ICLR 2017)*. arXiv:1611.01578 ↩
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). "Learning Transferable Architectures for Scalable Image Recognition." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018)*. arXiv:1707.07012 ↩
Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2019). "Regularized Evolution for Image Classifier Architecture Search." *Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2019)*, 33, 4780-4789. arXiv:1802.01548 ↩
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). "Efficient Neural Architecture Search via Parameter Sharing." *Proceedings of the 35th International Conference on Machine Learning (ICML 2018)*. arXiv:1802.03268 ↩
Liu, H., Simonyan, K., & Yang, Y. (2019). "DARTS: Differentiable Architecture Search." *Proceedings of the 7th International Conference on Learning Representations (ICLR 2019)*. arXiv:1806.09055 ↩
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). "MnasNet: Platform-Aware Neural Architecture Search for Mobile." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019)*. arXiv:1807.11626 ↩
Tan, M., & Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." *Proceedings of the 36th International Conference on Machine Learning (ICML 2019)*, pp. 6105-6114. arXiv:1905.11946 ↩
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., & Adam, H. (2019). "Searching for MobileNetV3." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019)*. arXiv:1905.02244 ↩
Cai, H., Zhu, L., & Han, S. (2019). "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware." *Proceedings of the 7th International Conference on Learning Representations (ICLR 2019)*. arXiv:1812.00332 ↩
Elsken, T., Metzen, J. H., & Hutter, F. (2019). "Neural Architecture Search: A Survey." *Journal of Machine Learning Research*, 20(55), 1-21. JMLR ↩
Ying, C., Klein, A., Christiansen, E., Real, E., Murphy, K., & Hutter, F. (2019). "NAS-Bench-101: Towards Reproducible Neural Architecture Search." *Proceedings of the 36th International Conference on Machine Learning (ICML 2019)*. ↩
Cai, H., Gan, C., Wang, T., Zhang, Z., & Han, S. (2020). "Once-for-All: Train One Network and Specialize it for Efficient Deployment." *Proceedings of the 8th International Conference on Learning Representations (ICLR 2020)*. arXiv:1908.09791 ↩
Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., & Sun, J. (2020). "Single Path One-Shot Neural Architecture Search with Uniform Path Sampling." *Proceedings of the European Conference on Computer Vision (ECCV 2020)*. ↩
So, D. R., Le, Q. V., & Lian, C. (2019). "The Evolved Transformer." *Proceedings of the 36th International Conference on Machine Learning (ICML 2019)*. arXiv:1901.11117 ↩
Weng, L. (2020). "Neural Architecture Search." *Lil'Log*. https://lilianweng.github.io/posts/2020-08-06-nas/
Strubell, E., Ganesh, A., & McCallum, A. (2019). "Energy and Policy Considerations for Deep Learning in NLP." *Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019)*. arXiv:1906.02243 ↩
Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., & Dean, J. (2021). "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

8 revisions by 1 contributors · full history

Suggest edit