Neural architecture search (NAS) is a technique for automating the design of neural network architectures. Rather than relying on human experts to hand-craft network topologies, NAS methods use algorithms to explore a defined space of possible architectures, evaluate candidates according to a performance metric, and return high-performing designs with minimal human intervention. The field was catalyzed by a 2017 paper from Barret Zoph and Quoc V. Le at Google Brain, which demonstrated that a reinforcement learning-based controller could discover convolutional neural network architectures competitive with those designed by human researchers. Since then, NAS has grown into a major subfield of automated machine learning (AutoML), producing architectures such as NASNet, AmoebaNet, EfficientNet, and MobileNetV3 that have set accuracy and efficiency records across a range of benchmarks.
NAS research addresses three core questions: what architectures to consider (the search space), how to explore that space (the search strategy), and how to estimate the quality of candidate architectures without fully training each one (performance estimation). Progress on all three fronts has reduced the computational cost of architecture search from tens of thousands of GPU hours down to a few GPU hours, making the technique practical for a growing number of applications.
Designing neural network architectures has historically been a labor-intensive process that requires deep expertise. The progression from simple multilayer perceptrons to LeNet, AlexNet, VGGNet, GoogLeNet, and ResNet involved years of trial and error, intuition, and careful experimentation by experienced researchers. Each architectural innovation, such as skip connections, inception modules, or batch normalization, required creative insight and extensive validation.
As deep learning expanded into more diverse tasks and hardware platforms, the manual design process became a bottleneck. Different tasks (image classification, object detection, semantic segmentation, language modeling) and different deployment targets (cloud GPUs, mobile phones, edge devices) call for different architectural trade-offs. Designing a separate architecture for each combination of task and target is impractical at scale.
Neural architecture search addresses this problem by framing architecture design as an optimization problem that can be solved algorithmically. The goal is to find an architecture that maximizes a given objective (typically validation accuracy, sometimes combined with latency or model size constraints) within a predefined search space. This framing connects architecture design to the broader AutoML agenda of reducing the human effort required to build effective machine learning systems.
The paper that brought NAS into the spotlight was "Neural Architecture Search with Reinforcement Learning" by Barret Zoph and Quoc V. Le, published at ICLR 2017. The central idea was to use a recurrent neural network (RNN) as a controller that generates descriptions of candidate neural network architectures. The controller is trained with reinforcement learning: specifically, the REINFORCE policy gradient algorithm. The reward signal is the validation accuracy of the generated architecture after it has been trained on a target dataset.
The controller RNN generates architecture descriptions one decision at a time. For convolutional networks, the controller outputs a sequence of tokens specifying, for each layer, the filter height, filter width, stride height, stride width, and number of filters. For recurrent networks, the controller generates the topology and activation functions of a recurrent cell.
At each step, the controller samples from a softmax distribution over possible choices. Once a complete architecture has been generated, it is trained from scratch on the target dataset, and its validation accuracy serves as the reward for updating the controller's parameters via the REINFORCE algorithm with a moving average baseline to reduce variance.
On CIFAR-10, the NAS-designed convolutional network achieved a test error rate of 3.65%, which was 0.09 percentage points better than the previous best result using a comparable architectural scheme. On the Penn Treebank language modeling benchmark, the method discovered a novel recurrent cell that outperformed the standard LSTM cell and other baselines.
The computational cost was enormous by any standard. The search required 800 GPUs running in parallel for 28 days, amounting to roughly 22,400 GPU hours. This expense limited early NAS research to well-resourced laboratories, but it also motivated a sustained effort to reduce search costs, which has been one of the defining themes of the field.
Every NAS method can be decomposed into three components: the search space, the search strategy, and the performance estimation strategy. These components interact with each other and collectively determine the efficiency and effectiveness of the search.
The search space defines the set of architectures that the NAS algorithm can consider. A well-designed search space balances expressiveness (the ability to represent high-performing architectures) against tractability (keeping the space small enough to search efficiently).
In the earliest NAS work, the search space was defined at the level of the entire network. The algorithm decided the type of operation at each layer, along with hyperparameters such as kernel size, number of filters, and stride. This is called a macro search space because decisions are made about the global structure of the network. Macro search spaces can represent a wide variety of architectures, but they are very large and computationally expensive to explore.
To reduce the size of the search space, Zoph et al. (2018) introduced the cell-based or micro search space in the NASNet paper. Instead of searching for the entire network, the algorithm searches for a small, reusable building block called a cell. A cell is a directed acyclic graph (DAG) where each node represents a feature map and each edge represents an operation (such as a 3x3 convolution, 5x5 separable convolution, max pooling, or identity mapping).
Two types of cells are typically searched:
The full network is then constructed by stacking copies of these cells in a predefined pattern. This approach has two major advantages. First, it reduces the search space from an exponentially large space of full networks to a much smaller space of cell structures. Second, it enables transferability: cells discovered on a small dataset like CIFAR-10 can be stacked into deeper networks for larger datasets like ImageNet, avoiding the cost of searching directly on the larger dataset.
Some methods combine macro and micro search. In a hierarchical search space, the lower level defines the micro-structure of cells or blocks, while the upper level controls macro-level decisions such as the number of cells per stage, where to place reduction cells, and the channel width at each stage. MnasNet (Tan et al., 2019) used a factorized hierarchical search space in which the network was divided into seven blocks, each with independently searched architectures, allowing different parts of the network to use different operations and configurations.
The search strategy determines how the NAS algorithm navigates the search space. Three broad families of search strategies have emerged: reinforcement learning-based, evolutionary, and gradient-based.
The original NAS paper used reinforcement learning, with the controller RNN generating architectures and the REINFORCE algorithm updating the controller based on validation accuracy rewards. Subsequent RL-based methods refined this approach. MnasNet used Proximal Policy Optimization (PPO) instead of REINFORCE, and added a multi-objective reward that incorporated real-world latency measured on a target device.
RL-based methods are flexible and can optimize for complex, non-differentiable objectives. Their main drawback is sample inefficiency: they typically need to evaluate thousands of candidate architectures before converging, which can be very expensive.
Evolutionary algorithms maintain a population of architectures and iteratively improve them through mutation and selection. In the context of NAS, a mutation might change the operation at a given edge in a cell (for example, replacing a 3x3 convolution with a 5x5 separable convolution), add or remove a connection, or modify a hyperparameter.
The most influential evolutionary NAS method is regularized evolution, introduced by Real et al. (2019) in the paper that produced AmoebaNet. Regularized evolution modifies standard tournament selection by introducing an aging mechanism: the oldest individual in the population is removed at each step, regardless of its fitness. This prevents the population from being dominated by early high-fitness individuals and encourages continued exploration.
In controlled experiments using the same search space (the NASNet search space), Real et al. found that evolutionary search and RL-based search achieved similar final accuracy, but evolution had better anytime performance, meaning it found good architectures faster during the search process. Evolution also tended to find smaller models at equivalent accuracy levels.
Gradient-based methods reformulate the discrete architecture search problem as a continuous optimization problem that can be solved with gradient descent. The most prominent method in this category is DARTS (Differentiable Architecture Search), proposed by Liu, Simonyan, and Yang in 2019 (ICLR).
DARTS works by constructing a mixed operation at each edge of the cell DAG. Instead of selecting a single operation, DARTS places a weighted combination of all candidate operations at each edge. The weights (called architecture parameters) are continuous and learned jointly with the network weights through backpropagation. During the search, the architecture parameters and the network weights are optimized in an alternating fashion using a bi-level optimization scheme: the network weights are updated on the training set, and the architecture parameters are updated on the validation set.
Once the search is complete, the final discrete architecture is obtained by selecting the operation with the highest architecture parameter weight at each edge.
DARTS reduced the search cost on CIFAR-10 to approximately 1.5 GPU days (on a single GPU), compared to 2,000 GPU days for NASNet and 3,150 GPU days for AmoebaNet. It achieved a test error of 2.83% on CIFAR-10, competitive with the best RL and evolutionary results. However, DARTS has known stability issues: the bi-level optimization can sometimes collapse to trivial architectures dominated by skip connections. Follow-up methods such as P-DARTS, PC-DARTS, and Fair DARTS have proposed various fixes for these stability problems.
Training each candidate architecture to full convergence is the most expensive part of the NAS pipeline. Performance estimation strategies aim to reduce this cost by approximating the true performance of an architecture without fully training it.
The simplest approach, used in the original NAS paper, trains every candidate architecture from scratch for a fixed number of epochs and uses the resulting validation accuracy as the performance estimate. This provides an accurate signal but is extremely expensive.
Low-fidelity methods reduce evaluation cost by training on a smaller dataset, training for fewer epochs, training a smaller version of the architecture (fewer layers or channels), or using lower-resolution inputs. The assumption is that the relative ranking of architectures is preserved even when using cheaper proxy evaluations. While this assumption does not always hold perfectly, low-fidelity estimates have proven effective in practice.
The most impactful cost reduction technique has been weight sharing, introduced by ENAS (Efficient Neural Architecture Search) from Pham et al. (2018). Instead of training each candidate architecture from scratch, ENAS constructs a single large supernet (also called an over-parameterized network) that contains all possible architectures as subgraphs. The supernet's weights are shared across all candidate architectures, so training the supernet simultaneously trains all candidates.
The controller then samples architectures from the supernet and evaluates them using the shared weights, without any additional training. This reduced the cost of NAS by roughly 1,000x compared to the original approach, bringing the search cost down to about 0.5 GPU days on CIFAR-10.
One-shot NAS methods extend this idea. In a one-shot approach, the supernet is trained once (the "one shot"), and then individual architectures are extracted and evaluated using the inherited weights. The training of the supernet and the search for the best architecture are decoupled into two separate phases.
The main challenge with weight sharing is weight coupling: since all architectures share the same weights, the shared weights may not accurately reflect the performance of any individual architecture when trained independently. Despite this limitation, weight sharing has become the dominant paradigm in modern NAS due to its dramatic cost savings.
Surrogate-based methods train a predictor (such as a Gaussian process, random forest, or neural network) to estimate architecture performance based on architectural features. The predictor is trained on a small set of fully evaluated architectures and then used to cheaply estimate the performance of new candidates. This approach is especially useful when combined with Bayesian optimization or evolutionary search.
More recently, researchers have explored zero-cost proxies that estimate architecture quality without any training at all. These methods compute statistics of the randomly initialized network (such as gradient norms, the number of linear regions, or the Jacobian of the network) and use them as a proxy for trained performance. While not as accurate as training-based estimates, zero-cost proxies can evaluate thousands of architectures per second and are useful for pruning the search space before applying more expensive evaluation.
Several NAS-discovered architectures have had significant impact on the field. The following table summarizes the most notable ones.
| Architecture | Year | Search Strategy | Search Space | Search Cost (GPU Days) | Key Result |
|---|---|---|---|---|---|
| NAS (Zoph & Le) | 2017 | Reinforcement learning | Macro | ~22,400 | 3.65% error on CIFAR-10 |
| NASNet | 2018 | Reinforcement learning | Cell-based | ~2,000 | 82.7% top-1 on ImageNet |
| AmoebaNet-A | 2019 | Regularized evolution | Cell-based (NASNet) | ~3,150 | 83.9% top-1 on ImageNet (scaled) |
| ENAS | 2018 | RL + weight sharing | Cell-based | ~0.5 | 2.89% error on CIFAR-10 |
| DARTS | 2019 | Gradient-based | Cell-based | ~1.5 | 2.83% error on CIFAR-10 |
| MnasNet | 2019 | RL (PPO) | Factorized hierarchical | ~3,800 | 75.2% top-1 on ImageNet, 78ms latency |
| ProxylessNAS | 2019 | Gradient-based + RL | Cell-based | ~8.3 | 75.1% top-1 on ImageNet (mobile) |
| FBNet | 2019 | Gradient-based | Cell-based | ~9 | 74.9% top-1 on ImageNet (mobile) |
| EfficientNet-B0 | 2019 | RL (MnasNet-based) | MBConv-based | ~3,800 | 77.1% top-1 on ImageNet |
| MobileNetV3 | 2019 | RL + NetAdapt | MBConv-based | N/A | 75.2% top-1, 22ms latency (Pixel phone) |
NASNet was introduced in 2018 by Zoph, Vasudevan, Shlens, and Le in the paper "Learning Transferable Architectures for Scalable Image Recognition," published at CVPR 2018. NASNet's key contribution was the cell-based search space and the demonstration of transferability. The search was conducted on CIFAR-10 using 500 GPUs over 4 days, and the discovered cells were then stacked into a deeper network for ImageNet.
NASNet-A (Large), with 88.9 million parameters, achieved 82.7% top-1 accuracy on ImageNet, surpassing the best human-designed architectures of the time. The paper also introduced ScheduledDropPath, a regularization technique that applies dropout-like stochastic depth to skip connections with a linearly increasing drop probability during training.
NASNet demonstrated that searching on a small proxy dataset and transferring to a larger target dataset was a viable strategy for reducing search costs. This transfer approach has since become standard practice in NAS research.
AmoebaNet was produced by the regularized evolution method of Real, Aggarwal, Huang, and Le, published at AAAI 2019. The paper used the same NASNet search space as a controlled comparison between evolutionary and RL-based search strategies.
AmoebaNet-A achieved accuracy comparable to NASNet when matched for model size, and when scaled up to a larger configuration (469 million parameters), it achieved 83.9% top-1 / 96.6% top-5 accuracy on ImageNet, setting a new record at the time. The search consumed 3,150 GPU days using 450 GPUs over 7 days.
The key finding was that regularized evolution matched RL in final accuracy but had better anytime performance and was simpler to implement. The aging mechanism in regularized evolution also helped the search explore more diverse regions of the search space by preventing stagnation.
EfficientNet, introduced by Tan and Le at ICML 2019, combined NAS with a novel compound scaling method. The baseline architecture, EfficientNet-B0, was discovered using an RL-based search procedure adapted from MnasNet, with a multi-objective reward balancing accuracy and FLOPs. The search space consisted of mobile inverted bottleneck convolution (MBConv) blocks.
The compound scaling method then scaled B0 uniformly across three dimensions (depth, width, and input resolution) using a single compound coefficient. This produced a family of models from B0 to B7 that achieved consistently better accuracy-to-efficiency trade-offs than previous architectures.
EfficientNet-B7 achieved 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster on inference than the best existing convolutional network at the time. The work demonstrated that NAS could produce not just a single architecture but a principled family of models spanning a wide range of computational budgets.
MobileNetV3, published by Howard et al. at ICCV 2019, combined hardware-aware NAS with manual architectural refinements. The design process had two phases. First, a platform-aware NAS approach (based on MnasNet) searched for a global network structure optimized for latency on a Pixel phone. Second, the NetAdapt algorithm performed fine-grained, layer-wise optimization to further reduce latency while maintaining accuracy.
On top of the NAS-discovered structure, the authors made several manual modifications: redesigning the expensive initial and final layers, introducing the hard-swish activation function (a computationally cheaper approximation of swish), and adding squeeze-and-excitation modules to certain layers.
Two variants were released: MobileNetV3-Large (for higher-resource use cases) and MobileNetV3-Small (for constrained environments). MobileNetV3-Large achieved 75.2% top-1 accuracy on ImageNet with a latency of 22 milliseconds on a Pixel phone. Compared to MobileNetV2, MobileNetV3-Large was 3.2 percentage points more accurate while being 15% faster, and MobileNetV3-Small was 4.6 percentage points more accurate while being 5% faster.
MobileNetV3 is notable as an example of NAS and human design working in combination rather than as alternatives. The NAS component discovered the high-level structure, while human engineers refined the details based on domain knowledge about mobile hardware.
Traditional NAS optimizes for accuracy alone or uses FLOPs as a proxy for computational cost. However, FLOPs do not perfectly correlate with real-world inference latency because different operations have different hardware utilization characteristics. A depthwise separable convolution may have fewer FLOPs than a standard convolution but run slower on certain hardware due to lower arithmetic intensity.
Hardware-aware NAS addresses this gap by incorporating actual hardware measurements (latency, energy consumption, or memory usage) directly into the search objective. Instead of minimizing FLOPs, the search minimizes (or constrains) the actual inference time on a target device.
MnasNet (Tan et al., CVPR 2019) was one of the first NAS methods to explicitly incorporate real-world latency. The search used a multi-objective reward function:
Reward = ACC(m) x [LAT(m) / T]^w
where ACC(m) is the model's accuracy, LAT(m) is the measured latency on a Pixel phone, T is a target latency, and w is a weight factor (set to -0.07) that controls the trade-off. The latency was measured by running each candidate model on an actual Pixel phone rather than relying on a theoretical estimate.
MnasNet achieved 75.2% top-1 accuracy on ImageNet with 78 milliseconds of latency on a Pixel phone, which was 1.8x faster than MobileNetV2 at 0.5 percentage points higher accuracy and 2.3x faster than NASNet at 1.2 percentage points higher accuracy.
ProxylessNAS (Cai, Zhu, and Han, ICLR 2019) took hardware-aware NAS further by searching directly on the target task and hardware without proxy datasets or proxy architectures. The method used a gradient-based search on an over-parameterized supernet, with path-level binarization to reduce memory consumption. ProxylessNAS could target different hardware platforms (GPU, CPU, or mobile) by simply changing the latency measurement in the objective function.
The search cost was approximately 8.3 GPU days, and the resulting architectures achieved competitive accuracy with significantly lower latency than previous methods. ProxylessNAS cut search cost by roughly 200x compared to earlier hardware-aware methods.
Hardware-aware NAS has been adopted by major technology companies for production deployment. The core insight, that the search objective should reflect the actual deployment constraints rather than a proxy metric, has influenced how practitioners think about model design more broadly. Methods such as FBNet, Single-Path NAS, and OFA (Once-for-All) have extended hardware-aware NAS to support multiple hardware targets from a single search.
One-shot NAS methods represent the most significant cost reduction in the field. The central idea is to train a single supernet that encodes all possible architectures in the search space as subgraphs. Once the supernet is trained, individual architectures can be evaluated by extracting the corresponding subgraph and using the inherited shared weights, without any additional training.
A supernet is an over-parameterized network where each edge in the computational graph contains all candidate operations. During training, different subgraphs (corresponding to different architectures) are activated and trained. The weights are shared across all subgraphs, meaning that an operation at a given position in the network uses the same weights regardless of which architecture it appears in.
Several strategies have been proposed for training the supernet:
The main challenge with one-shot NAS is the ranking consistency problem: the relative ranking of architectures under shared weights may not match their ranking when trained independently. Weight coupling, where the shared weights represent a compromise across all architectures rather than being optimal for any single one, is the root cause. If the supernet's ranking is unreliable, the search may select suboptimal architectures.
Researchers have addressed this through various techniques, including larger supernets, better training schedules, progressive shrinking, and improved sampling strategies. Despite these challenges, one-shot methods remain the dominant paradigm due to their dramatic cost advantages.
The reduction in NAS search cost over time is one of the field's most striking achievements. The following table illustrates this progression.
| Method | Year | Search Cost (GPU Days) | Key Cost Reduction Technique |
|---|---|---|---|
| NAS (Zoph & Le) | 2017 | ~22,400 | None (train from scratch) |
| NASNet | 2018 | ~2,000 | Cell-based search space, proxy dataset |
| AmoebaNet | 2019 | ~3,150 | Cell-based search space |
| ENAS | 2018 | ~0.5 | Weight sharing |
| DARTS | 2019 | ~1.5 | Continuous relaxation, gradient-based |
| ProxylessNAS | 2019 | ~8.3 | Path-level binarization |
| P-DARTS | 2019 | ~0.3 | Progressive search depth |
| Single-Path One-Shot | 2020 | ~12 (supernet) + search | Uniform path sampling |
| FairDARTS | 2020 | ~0.4 | Sigmoid-based relaxation |
| Zero-cost proxies | 2021+ | < 0.01 | No training required |
The cost reduction has come from multiple sources:
The following table compares the three main families of NAS search strategies across several dimensions.
| Dimension | Reinforcement Learning | Evolutionary | Gradient-Based |
|---|---|---|---|
| Optimization type | Policy gradient over discrete space | Population-based, mutation and selection | Gradient descent over continuous relaxation |
| Typical search cost | High (thousands of GPU days) | High (thousands of GPU days) | Low (1-4 GPU days) |
| Handling of non-differentiable objectives | Natural (reward shaping) | Natural (fitness function) | Requires differentiable relaxation or surrogate |
| Hardware-aware search | Straightforward (add latency to reward) | Straightforward (add latency to fitness) | Requires differentiable latency model or lookup table |
| Stability | Generally stable | Generally stable | Can collapse to degenerate architectures (e.g., skip-connection-heavy) |
| Multi-objective optimization | Supported via reward shaping | Supported via Pareto front | Requires scalarization or additional constraints |
| Notable methods | NAS, NASNet, MnasNet | AmoebaNet, Large-Scale Evolution | DARTS, ProxylessNAS, FBNet |
Each strategy family has distinct strengths. RL-based methods are flexible and handle complex objectives well. Evolutionary methods are simple to implement, embarrassingly parallel, and robust. Gradient-based methods are fast but require careful handling to avoid instability.
While NAS originated in the context of image classification, it has been applied to a broad range of tasks and domains.
NAS-FPN (Ghiasi, Lin, and Le, CVPR 2019) applied NAS to discover feature pyramid network architectures for object detection. The resulting architecture outperformed the hand-designed FPN on the COCO benchmark. Auto-DeepLab (Liu et al., CVPR 2019) extended NAS to semantic segmentation, searching over both the cell structure and the macro-level network topology.
The original Zoph and Le (2017) paper already demonstrated NAS for recurrent cell design. Subsequent work has applied NAS to discover transformer architectures. The Evolved Transformer (So, Le, and Lian, ICML 2019) used evolutionary NAS to discover a transformer variant that outperformed the standard transformer on machine translation benchmarks while using fewer parameters.
NAS has been applied to automatic speech recognition, keyword spotting, and audio classification. Hardware-aware NAS is particularly relevant for speech applications that need to run on edge devices with strict latency constraints.
GraphNAS and related methods apply architecture search to graph neural networks, searching over aggregation functions, neighborhood sampling strategies, and layer configurations for tasks like node classification and graph classification.
NAS has shown promise in medical imaging tasks such as pathology classification, tumor segmentation, and radiology analysis, where the optimal architecture may differ significantly from those designed for natural image datasets.
Standardized benchmarks have been important for fair comparison of NAS methods. The most widely used benchmarks include:
| Benchmark | Year | Description | Search Space Size |
|---|---|---|---|
| NAS-Bench-101 | 2019 | Tabular benchmark with all architectures in a cell-based space evaluated on CIFAR-10 | 423,624 architectures |
| NAS-Bench-201 | 2020 | Smaller cell-based space evaluated on CIFAR-10, CIFAR-100, and ImageNet-16-120 | 15,625 architectures |
| NAS-Bench-301 | 2021 | Surrogate benchmark for the DARTS search space | ~10^18 architectures |
| TransNAS-Bench-101 | 2021 | Benchmark for transfer NAS across multiple tasks | 7,352 architectures |
These benchmarks enable researchers to evaluate NAS algorithms without the cost of actually training architectures, because the ground-truth performance of every architecture in the search space has been precomputed. This has accelerated research and improved reproducibility.
Despite significant progress, NAS faces several ongoing challenges.
Many NAS papers report results with high variance, and small differences in training protocols (learning rate schedules, data augmentation, random seeds) can significantly affect the reported accuracy. Some studies have found that random search within the same search space can match or exceed the performance of sophisticated NAS algorithms, raising questions about how much of the reported improvement comes from the search algorithm versus the search space design.
Most NAS research has focused on convolutional architectures for image classification. The search spaces, operations, and evaluation protocols are heavily tailored to this setting. Applying NAS to new domains (such as transformer architectures for language or multimodal models) often requires designing new search spaces from scratch, which itself requires expertise.
Weight sharing introduces a gap between the performance predicted by the supernet and the true performance of a standalone architecture. Improving the ranking consistency of supernets remains an active research area.
Early NAS methods consumed enormous amounts of energy. While modern methods are far more efficient, the total computational cost of NAS research across the community is substantial. Recent work, such as CE-NAS, has begun to incorporate carbon efficiency into the search objective.
As the field shifts toward large pretrained foundation models, the role of NAS is evolving. Rather than searching for architectures from scratch, NAS may be used to design efficient adapters, pruning strategies, or fine-tuning configurations for existing large models. The intersection of NAS with large language models is an emerging area of research.
| Year | Milestone |
|---|---|
| 2017 | Zoph and Le publish the foundational NAS paper using RL (ICLR 2017) |
| 2017 | Real et al. demonstrate large-scale evolutionary NAS |
| 2018 | NASNet introduces the cell-based search space and transferable architectures (CVPR 2018) |
| 2018 | ENAS introduces weight sharing, reducing cost by 1,000x (ICML 2018) |
| 2019 | DARTS introduces gradient-based search (ICLR 2019) |
| 2019 | AmoebaNet demonstrates regularized evolution (AAAI 2019) |
| 2019 | MnasNet introduces hardware-aware NAS with real latency measurements (CVPR 2019) |
| 2019 | EfficientNet combines NAS with compound scaling (ICML 2019) |
| 2019 | MobileNetV3 combines NAS with NetAdapt and manual refinements (ICCV 2019) |
| 2019 | ProxylessNAS enables direct search on target task and hardware (ICLR 2019) |
| 2019 | NAS-Bench-101 provides the first tabular NAS benchmark |
| 2020 | Once-for-All enables deployment-time architecture specialization |
| 2021+ | Zero-cost proxies eliminate the need for training during search |
| 2024+ | LLM-guided NAS and carbon-efficient NAS emerge as new research directions |
Neural architecture search has transformed the design of neural networks from a manual art into an algorithmic optimization problem. Starting with the computationally intensive RL-based approach of Zoph and Le in 2017, the field has developed increasingly efficient methods, from evolutionary search (AmoebaNet) to gradient-based optimization (DARTS) to weight-sharing supernets (ENAS, One-Shot NAS). These advances have reduced search costs from tens of thousands of GPU days to single-digit GPU hours or even minutes.
NAS has produced architectures, including NASNet, AmoebaNet, EfficientNet, and MobileNetV3, that matched or surpassed the best human-designed networks on major benchmarks. Hardware-aware NAS has further extended the technique by optimizing for real-world deployment constraints, not just accuracy. While challenges remain around reproducibility, search space generalization, and the environmental cost of large-scale search, NAS continues to evolve, with recent work exploring its integration with large language models, zero-cost evaluation proxies, and carbon-efficient optimization.