Pruning

AI Inference Machine Learning Training & Optimization

49 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

98 citations

Revision

v7 · 9,848 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Pruning is a family of techniques used in machine learning and artificial intelligence to remove parts of a model or search space that are estimated to be unnecessary for accuracy or optimality. In predictive models (for example decision trees and artificial neural networks), pruning reduces overfitting and improves efficiency by eliminating branches, neurons, filters, or weights that contribute little to validation performance.^[1]^[2] In search algorithms (for example alpha-beta pruning for the minimax procedure), pruning discards branches that provably cannot change the final decision, reducing computation without affecting correctness.^[3]

In deep learning, pruning removes parameters from artificial neural networks, creating sparse networks by eliminating unnecessary weights, neurons, filters, or entire layers, enabling deployment on resource-constrained devices and reducing inference costs. The technique has evolved from early theoretical work in the late 1980s to become essential for deploying modern large language models and deep neural networks on edge devices, achieving compression ratios of 50-95% with minimal accuracy loss. Foundational results include the 1989 Optimal Brain Damage method, which cut a network's parameters by 60% with minimal accuracy impact,^[2] and the 2016 Deep Compression pipeline, which combined pruning with quantization and Huffman coding to reduce model storage by 35-49 times without loss of accuracy.^[5] By 2024, more than 3,000 pruning papers had been published, with successful deployments by NVIDIA, Meta, Intel, and Qualcomm achieving 2-29 times speedups in production systems.^[4]

Motivation and overview

Many ML models are overparameterized, containing redundancies that do not affect generalization. Pruning targets such redundancies to achieve multiple goals:

Generalization and regularization: Reduce overfitting by simplifying the hypothesis. Post-pruned decision trees often outperform fully grown trees on held-out data, and early pruning work in neural networks was motivated primarily by improving generalization rather than compression.^[1]^[2]
Efficiency: Decrease parameters, memory, energy, and latency for deployment on edge devices and servers. Deep model compression pipelines combining pruning with quantization can achieve 35-49 times storage reduction on models like AlexNet and VGG-16 without accuracy loss.^[5]
Search optimization: Reduce the effective branching factor in combinatorial search while preserving optimal results (for example alpha-beta pruning).^[3]
Interpretability: Create simpler, more transparent models where the decision-making process is easier to understand and explain, particularly important in regulated industries like healthcare and finance.^[6]

The evolution of pruning techniques from focusing on individual weights (unstructured) to entire network components (structured) reflects a maturation of the field, driven by practical engineering challenges of achieving real-world performance gains on existing hardware.

History

Early development (1988-1995)

Research on neural network pruning traces back to 1988, emerging from neurobiological studies showing the human brain's resistance to damage and natural synaptic pruning during development.^[7] This biological analogy to synaptic pruning in human brains, where unnecessary connections are eliminated during development, provided the conceptual foundation for artificial neural network pruning.^[8]

At the end of the 1980s and beginning of the 1990s, the field expanded rapidly from seminal studies, with two major branches emerging: sensitivity calculation methods that evaluated each parameter's contribution to the error function, and penalty-term methods using regularization to encourage sparse networks.^[9]

The original motivation differed from modern applications: pruning was intended to improve generalization rather than compression, based on theory and experience showing that networks with excessive parameters do not generalize well for fixed training data.^[2]

Yann LeCun, with John S. Denker and Sara A. Solla, published Optimal Brain Damage at NeurIPS 1989, introducing parameter "saliency" computed using diagonal Hessian approximations.^[2] This influential work, with 4,897 citations, demonstrated 60% parameter reduction on handwritten digit recognition networks with minimal accuracy impact. As the authors put it, "By removing unimportant weights from a network, several improvements can be expected: better generalization, fewer training examples required, and improved speed of learning and/or classification."^[2] The method approximates the change in objective function using Taylor series expansion, showing that removing unimportant weights improves generalization and reduces required training examples.

Babak Hassibi and David G. Stork extended this work with Optimal Brain Surgeon (1992-1993), using the full Hessian matrix rather than diagonal approximations and allowing optimal weight adjustments after pruning.^[10] The method achieved 90%, 76%, and 62% weight reduction on MONK's benchmark problems, significantly outperforming magnitude-based methods, which the authors noted "often remove the wrong weights."^[10]

Other foundational contributions included:

Mozer and Smolensky's "Skeletonization" (1989): Early work on removing hidden units
E.D. Karnin's simple pruning procedure (1990): Proposing sensitivity-based approach during training^[11]
Randall Reed's comprehensive survey (1993): Synthesized early approaches, summarizing pruning algorithms from the 1980s and 1990s, highlighting their role in reducing connections based on information theory and heuristics^[12]

Modern resurgence (2015-present)

The field experienced renewed interest starting in 2015 with Song Han's work on magnitude-based iterative pruning, demonstrating 9-13 times compression on AlexNet and VGGNet.^[13] His subsequent "Deep Compression" paper (2016) combined pruning with quantization and Huffman coding, achieving 35-49 times compression without accuracy loss and becoming one of the top-5 most cited papers in ISCA's 50-year history.^[4]^[5]

In 2016, Hao Li et al. introduced filter pruning for convolutional networks in "Pruning Filters for Efficient ConvNets", targeting entire convolutional filters to create cascading effects through the network.^[14]

The Lottery Ticket Hypothesis, proposed by Jonathan Frankle and Michael Carbin in 2018, revolutionized understanding of pruning by showing that randomly-initialized dense networks contain sparse subnetworks ("winning tickets") that can match full network accuracy when trained in isolation.^[15] This ICLR 2019 Best Paper Award winner demonstrated that networks with 10-20% of original parameters could achieve comparable performance, fundamentally changing perspectives on network sparsity and suggesting that large over-parameterized networks are not just learning effective weights, but acting as a search space to find well-initialized sparse structures.

From 2020-2024, the field exploded with more than 3,000 pruning papers published, representing over half of all neural network compression research.^[4] Recent advances include:

Efficient LLM pruning methods: SparseGPT (2023) enabling one-shot pruning of 175-billion parameter models^[16], and Wanda (2023) combining weight magnitudes with activation norms^[17]
Vision transformer token pruning: Dynamic sparsification methods
Hardware-aware structured pruning techniques: Methods specifically designed for GPU and edge accelerators
Pruning at initialization methods: SNIP (2019), GraSP (2020), SynFlow (2020)

Major technology companies including NVIDIA, Meta, Intel, AMD, and Qualcomm have deployed pruning in production systems for model compression and edge deployment.

Pruning in decision trees

A decision tree grown to purity typically overfits the training data. Decision tree pruning plays a crucial role in preventing overfitting and improving model generalization by simplifying the tree structure. Pruning simplifies the tree by removing nodes and subtrees that do not improve validation accuracy or that add unnecessary complexity.^[18]

What is pre-pruning (early stopping)?

Pre-pruning, also known as early stopping, limits growth during induction with constraints applied during the tree-building process. This prevents the tree from growing to its full complexity in the first place.^[19] Common pre-pruning criteria include:

Maximum Depth: Limiting the maximum number of levels the tree can grow
Minimum Samples per Leaf: Requiring a leaf node to contain a minimum number of training samples
Minimum Samples per Split: Requiring a node to have at least a certain number of samples before it can be split
Minimum Impurity Decrease: Requiring a split to reduce the node's impurity (for example Gini impurity or entropy) by at least a specified threshold
Statistical Tests: Using tests like the chi-squared test to determine if a proposed split is statistically significant^[20]

The primary advantage of pre-pruning is its computational efficiency. By preventing the tree from growing to its full complexity, it saves significant training time, which can be crucial for very large datasets.^[21]

However, pre-pruning suffers from a significant drawback known as the horizon effect.^[22] A split may seem unpromising based on a local stopping criterion, causing the algorithm to halt growth. However, this "weak" split might have led to very informative splits further down the branch. Pre-pruning's greedy nature prevents it from seeing beyond this short-term horizon, potentially leading to a suboptimal, less accurate tree.

What is post-pruning (backward pruning)?

Post-pruning, sometimes called backward pruning, is the more common and often more effective approach.^[21] In this strategy, the decision tree is first allowed to grow to its maximum size, fitting the training data completely (and thus, overfitting). Afterwards, the algorithm goes back through the tree and systematically removes nodes and subtrees that do not significantly contribute to its predictive accuracy.^[23]

The decision to prune a node is typically based on its performance on a separate validation dataset (also called a pruning set) or by using a metric that penalizes complexity.^[22] By first growing the full tree, post-pruning avoids the horizon effect, as it has a global view of all potential splits. While this is computationally more expensive than pre-pruning, it generally leads to more accurate and robust models.^[19]

Post-pruning algorithms can traverse the tree in two ways:^[22]

Bottom-up: The algorithm starts at the leaf nodes and works its way up towards the root. For each internal node, it evaluates whether replacing its subtree with a single leaf node would improve performance. This is the most common approach as it ensures that the relevance of an entire subtree is considered before any pruning decision is made about its parent nodes.
Top-down: The algorithm starts at the root and traverses downwards. This approach is less common because it risks pruning a large subtree that may contain highly valuable nodes deep within it.^[21]

Post-pruning algorithms

Reduced Error Pruning (REP)

Reduced Error Pruning is one of the simplest and most intuitive post-pruning algorithms.^[22] It relies on a separate dataset, known as a pruning or validation set, which was not used to train the tree. The algorithm works as follows:^[24]^[25]

Build a full tree: First, a decision tree is grown to its maximum depth on the training data until it overfits
Use a validation set: The data is split into a training set and a validation set. The validation set is used exclusively for evaluating pruning decisions
Iterate bottom-up: The algorithm iterates through every non-leaf (internal) node in the tree, starting from the nodes closest to the leaves and moving up towards the root
Evaluate pruning: For each node, it considers the effect of "pruning" it. Pruning a node means removing the entire subtree rooted at that node and replacing it with a single leaf node. The class assigned to this new leaf node is the majority class of the training examples that fall under that node
Compare accuracy: The accuracy of the original tree (with the subtree intact) is compared to the accuracy of the pruned tree (with the new leaf node) on the validation set
Make the decision: If the pruned tree has an accuracy on the validation set that is equal to or better than the original tree, the subtree is permanently removed. Otherwise, the subtree is kept
Repeat: This process is repeated for all internal nodes until no more nodes can be pruned without decreasing the validation accuracy

The main advantage of REP is its simplicity and speed.^[22] However, its effectiveness depends heavily on the size and representativeness of the validation set. If the validation set is too small, the pruning decisions may be unreliable and could lead to removing useful subtrees.

Cost-Complexity Pruning (CCP)

Cost-Complexity Pruning, also known as weakest link pruning, is a more sophisticated and widely used method introduced in the CART algorithm.^[23] Instead of relying solely on a validation set's error rate, CCP introduces a regularization parameter, $\alpha$ (alpha), that explicitly penalizes the complexity of the tree.

The algorithm defines a cost-complexity measure for a tree $T$ as:^[26]^[27]

R_\alpha(T) = R(T) + \alpha |T_{\text{leaves}}|

Where:

$R(T)$ is the total misclassification error of the tree $T$ on the training data
$|T_{\text{leaves}}|$ is the number of terminal (leaf) nodes in the tree $T$ , serving as a measure of its complexity
$\alpha \ge 0$ is the complexity parameter. It controls the trade-off between the tree's fit to the training data and its complexity. A value of $\alpha=0$ means no penalty for complexity, resulting in the largest tree. As $\alpha$ increases, the penalty for having more leaves grows, leading to more aggressive pruning and smaller trees^[28]

The CCP algorithm works as follows:^[29]^[30]

Build a full tree: A maximal tree, Tmax, is grown on the entire training dataset
Find the weakest link: For each internal node t in the tree, the algorithm calculates an "effective alpha" ( $\alpha_{\text{eff}}$ ). This is the value of $\alpha$ at which pruning the subtree at $t$ becomes beneficial. The node with the smallest non-negative $\alpha_{\text{eff}}$ is considered the "weakest link"
Generate a sequence of trees: The algorithm starts with $T_0 = T_{\max}$ . It then finds the weakest link in $T_0$ , prunes it to create a new tree $T_1$ , finds the weakest link in $T_1$ , prunes it to create $T_2$ , and so on. This process continues until only the root node is left. This generates a finite sequence of optimally pruned subtrees for a range of $\alpha$ values
Select the best tree: The final step is to choose the best tree from this sequence. This is typically done using k-fold cross-validation. For each tree in the sequence, its performance is evaluated on unseen data (the validation folds). The tree that achieves the best performance (for example highest accuracy) is selected as the final model^[28]

CCP is a powerful and principled method for finding the right-sized tree. By generating a sequence of candidate trees and using cross-validation, it provides a robust way to select a model that generalizes well. Modern libraries like scikit-learn expose CCP through the ccp_alpha hyperparameter.^[31]

Comparison of decision tree pruning methods

Method	When to use	Computational cost	Pros	Cons	Canonical reference
Pre-pruning	Limited data, fast training desired	Low (prevents growth)	Simple; prevents deep overfitting early; computationally efficient	May stop too soon (horizon effect); missed structure	^[18]
Cost-complexity post-pruning	Standard CART workflow	High (full tree + pruning)	Strong CV-based model selection; nested subtrees; avoids horizon effect	Requires pruning path & CV; computationally expensive	^[26]^[1]
Reduced-error post-pruning	Separate validation set available	Medium (full tree + validation)	Conceptually simple; robust; easy to implement	Needs hold-out set; can be aggressive; less data for training	^[24]

Practical considerations

Modern libraries expose both pre- and post-pruning controls. For example, scikit-learn's DecisionTreeClassifier supports pre-pruning (for example max_depth, min_samples_leaf, min_samples_split, min_impurity_decrease) and post-pruning via ccp_alpha.^[18]

The choice between pre-pruning and post-pruning represents a classic algorithmic trade-off between computational efficiency and model optimality. Pre-pruning is computationally cheap, making it suitable for rapid prototyping or on massive datasets where building a full tree is infeasible.^[23] In contrast, post-pruning is computationally expensive but makes more globally informed decisions, typically resulting in a more robust final model. The choice is therefore not merely technical but strategic, depending on the available resources and performance requirements of the application.

Classic works in decision tree pruning include Quinlan's C4.5 algorithm, which popularized decision-tree pruning (including a form of pessimistic error-based post-pruning).^[32]

Pruning in artificial neural networks

In the domain of deep learning, pruning has become an essential technique for model compression and optimization. Modern deep neural networks, such as those used for computer vision and natural language processing, are often massively over-parameterized, containing millions or even billions of parameters.^[33] This over-parameterization, while beneficial for achieving high accuracy during training, results in models that are computationally expensive, slow to run, and have large memory footprints, making them difficult to deploy on resource-constrained devices like smartphones or embedded systems.^[34]

Definition and mathematical formulation

Formally, neural network pruning transforms a model $f(x; W)$ into $f(x; M \odot W')$ , where $M \in \{0, 1\}^{|W'|}$ is a binary mask setting certain parameters to zero, $W'$ is a (potentially modified) collection of parameters, and $\odot$ represents the elementwise product operator.^[8] The goal is reducing parameter count and computational resources while maintaining accuracy on the task.

The process creates sparsity by eliminating connections (unstructured pruning) or entire structures (structured pruning). For a network with $L$ layers and parameters $\theta = \{W_1, W_2, \ldots, W_L\}$ , pruning identifies a subset $S \subset \theta$ of parameters to remove, typically by computing importance scores and removing low-importance parameters according to a pruning criterion.

The pruning optimization problem can be formulated as:

\begin{aligned} \text{minimize:} \quad & L(f(x; M \odot W), y) \\ \text{subject to:} \quad & \lVert M \rVert_0 \le k \end{aligned}

where $L$ is the loss function, $\lVert M \rVert_0$ counts non-zero elements in the mask, and $k$ is the target number of remaining parameters. This NP-hard combinatorial optimization problem requires approximation methods in practice.

How does pruning differ by granularity?

Type	Granularity	Hardware Speedup	Typical Sparsity	Advantages	Disadvantages
Unstructured	Individual weights	Requires specialized hardware	70-95%	Highest compression, better accuracy	No speedup on standard hardware
Structured	Filters/channels/layers	Universal speedup	20-60%	Hardware-friendly, real acceleration	Lower compression, more accuracy loss
Semi-structured	Block patterns (N:M)	GPU-optimized	50% (2:4)	Hardware support, good compression	Limited patterns, requires recent GPUs

Unstructured pruning

Unstructured pruning, also called fine-grained pruning, removes individual weights anywhere in the network without pattern constraints.^[4] This approach achieves 50-90% sparsity with minimal accuracy loss by creating irregular sparse matrices. For example, VGG-16 on CIFAR-10 achieves 92.36% accuracy at 10% density with unstructured pruning versus 89.33% with structured pruning.^[35]

However, unstructured pruning suffers a critical limitation: it provides no speedup on standard hardware without specialized sparse computation libraries or hardware support. GPUs and CPUs are highly optimized for dense matrix operations, and the irregular sparsity pattern created by unstructured pruning means the underlying computation (matrix multiplication) still operates on the original dense matrix dimensions, with many multiplications by zero that are not skipped.^[36]^[37]

Structured pruning

Structured pruning removes entire filters, channels, neurons, or layers while maintaining regular architecture.^[14] This hardware-friendly approach achieves universal speedup on standard processors by reducing both memory and FLOPs. For instance, ResNet-50 on ImageNet achieves 2 times acceleration with about 1.4% top-5 accuracy loss using L1-norm filter pruning.^[4]

Structured pruning exploits cascading effects: pruning an output filter in layer L automatically removes corresponding input channels in layer L+1, creating architectural modifications without specialized kernels. Research shows VGG-16 contains 90% of weights in fully-connected layers but only 1% of FLOPs in convolutional operations, making filter pruning particularly effective for CNNs.^[38]

Types of structured pruning include:

Neuron pruning: Removes entire nodes using metrics like Average Percentage of Zeros (APoZ), which identifies neurons outputting mostly zeros on calibration data^[39]
Filter/channel pruning: Targets entire convolutional filters (output channels). Pruning a filter in layer L removes one output channel from layer L and the corresponding input channel for layer L+1^[40]^[41]
Layer pruning: Removes entire layers to reduce network depth. For large language models, the Block Influence (BI) method measures the extent a layer alters hidden states for intelligent layer removal^[42]

Semi-structured pruning

Semi-structured pruning represents a middle ground, exemplified by N:M sparsity patterns where N of every M consecutive weights are non-zero.^[43] NVIDIA's 2:4 structured sparsity achieves 2 times speedup on Ampere A100 GPUs using sparse tensor cores while maintaining 50% sparsity, combining advantages of both approaches.^[90] This method is particularly effective because modern NVIDIA GPUs have dedicated hardware support for 2:4 sparsity patterns.

Types of pruning by timing

Pruning methods can also be categorized based on when pruning is applied relative to the model training process.^[44]

Post-training pruning

Post-training pruning is the traditional and most common approach. A dense network is first trained to convergence. Then, a pruning algorithm is applied to remove unimportant parameters. This is often followed by a "fine-tuning" phase, where the pruned network is retrained for a few epochs to allow the remaining weights to adjust and recover any accuracy lost during pruning.^[8]^[44] This is often done iteratively: prune, fine-tune, prune, fine-tune, achieving gradual adaptation to sparsity and reduced catastrophic forgetting.

During-training pruning

During-training pruning integrates the pruning process directly into the training phase. Sparsity is encouraged from the beginning or introduced gradually as training progresses. This can be achieved through regularization methods (like L1 regularization, which pushes weights towards zero) or by using dynamic pruning masks that are updated during training.^[44] Methods include DeepR (2018) using stochastic updates, and dynamic pruning where masks change at runtime.

Pruning at initialization

Pruning at initialization (PaI), also known as pre-training pruning, is where the network is pruned at initialization, before any training has occurred. Inspired by the Lottery Ticket Hypothesis, methods like SNIP (2019) use connection sensitivity to prune one-shot,^[45] and GraSP (2020) preserves gradient flow.^[46] Advantages include reduced training costs, though critical analysis by Frankle et al. (2020) showed pruning-at-initialization methods often underperform magnitude pruning after training.^[47]

What is the difference between global and local pruning?

Local pruning applies independent pruning within each layer with uniform or preset ratios, preventing layer collapse but yielding suboptimal sparsity distributions.^[4] This safer approach works well when importance varies greatly across layers and prevents catastrophic performance degradation.

Global pruning ranks importance across the entire network, automatically discovering optimal layer-wise sparsity patterns. While generally achieving better accuracy, global pruning risks layer collapse at high speedup ratios.^[35] For LLMs, where outlier features exhibit 20 times magnitude differences across layers, protected global pruning preserves at least 10% of parameters per group to mitigate this risk.

Pruning methods and algorithms

Magnitude-based pruning

Magnitude-based pruning, dating to 1988 and popularized by Han et al. (2015), prunes weights with smallest absolute values: prune if $|w|$ < threshold $\tau$ .^[13] The core assumption is that weights with a small absolute value (magnitude) have a smaller impact on the network's output and thus contribute less to its predictive power.

Despite its simplicity, magnitude pruning remains a strong baseline, with TensorFlow reporting 6 times compression with minimal loss.^[48] This can be applied at different scopes:^[49]

Layer-wise pruning: A separate pruning threshold (or percentage) is determined for each layer. Weights within each layer are ranked by magnitude, and the lowest-ranking ones are removed.
Global pruning: All weights across the entire network (or all prunable layers) are collected into a single group. They are ranked globally by magnitude, and a single threshold is used to prune the lowest-ranking weights, regardless of which layer they belong to. Global pruning is often more effective as it allows the algorithm to automatically discover which layers are more sensitive to pruning.^[8]

For filter pruning, L1 and L2 norms rank importance: $\text{Score}(f) = \sum |w_i|$ for L1, $\text{Score}(f) = \sqrt{\sum w_i^2}$ for L2. Modern variants include Wanda, which combines weight magnitudes with activation norms: $\text{Score}(w) = |w| \times \lVert x \rVert$ , outperforming pure magnitude methods on LLMs.^[17] Recent work includes confident magnitude-based pruning (2024) adding uncertainty quantification.^[50]

Gradient-based and second-order methods

Optimal Brain Damage

Optimal Brain Damage (OBD) approximates the change in objective function using Taylor series expansion with three key simplifications: diagonal Hessian approximation (cross terms neglected), extremal approximation (gradient term gi = 0 at convergence), and quadratic approximation (higher-order terms discarded).^[2]

The final saliency formula becomes:

$S_{k}={\frac {h_{kk}u_{k}^{2}}{2}}$ where $h_{kk}$ is the diagonal Hessian element computed via backpropagation and $u_k$ is the weight value. OBD successfully reduced a 2578-parameter network by 60% (removing 1500 parameters) with minimal accuracy impact, demonstrating that removing unimportant weights improves generalization and reduces required training examples.

Optimal Brain Surgeon

Optimal Brain Surgeon (OBS) extends OBD by using the full inverse Hessian $H^{-1}$ rather than diagonal approximations, allowing weight modifications during pruning.^[10] For pruning weight $q$ , optimal weight changes are:

$\delta w=-{\frac {w_{q}}{(H^{-1}){qq}}}\times H^{-1}e{q}$ with saliency:

$L_{q}={\frac {w_{q}^{2}}{2[H^{-1}]_{qq}}}$ OBS significantly outperforms magnitude-based methods and OBD, which "often remove the wrong weights," permitting more aggressive pruning for the same training error and yielding better test generalization.^[10] Extensions include Layer-wise Optimal Brain Surgeon (2017)^[51] and The Combinatorial Brain Surgeon (2022) for simultaneous weight removal.

Taylor expansion methods

First-order Taylor expansion approximates loss change when pruning parameter $h$ :

$\Delta C\approx {\frac {\partial C}{\partial h}}\cdot \Delta h$ yielding importance:

$I(h)=\left|{\frac {\partial C}{\partial h}}\cdot h\right|$ This computationally efficient criterion requires only first-order gradients, demonstrating 10 times reduction on 3D-convolutional filters with small accuracy drops.^[52]

Modern gradient-based methods include:

SNIP (Single-shot Network Pruning): Uses connection sensitivity $\partial L / \partial m_c$ normalized across parameters^[45]
GraSP: Removes weights with least effect on gradient flow preservation^[46]
Mean Gradient Method: Novel criterion for CNNs achieving 5.64 times FLOPs reduction on VGG-16 CIFAR-10 with under 1% accuracy loss^[53]

Regularization-based pruning

L1 regularization (Lasso) adds penalty $\lambda \lVert \theta \rVert_1 = \lambda \sum_i |\theta_i|$ to the loss, inducing exact sparsity by driving weights to zero through non-differentiable subgradients. L2 regularization (Ridge) adds $\lambda \lVert \theta \rVert_2^2 = \lambda \sum_i \theta_i^2$ , encouraging small weights without exact zeros.^[54]

Growing Regularization gradually increases penalty λ(t) over training iterations for improved pruning schedules, addressing Hessian information exploitation. DeepHoyer introduces scale-invariant, differentiable sparsity measures: DeepHoyer-Square (DHS) $= \left(\lVert \theta \rVert_1 / \lVert \theta \rVert_2\right)^2$ , optimizable via standard SGD.^[55]

Network Slimming (2017) prunes channels by penalizing batch normalization scaling factors with L1 regularization, achieving 20 times model size reduction and 5 times computing operations reduction on VGG-16 CIFAR-10.^[56]

Pruning schedules

One-shot pruning removes the target percentage in a single step after training, offering negligible pruning cost and fast execution but requiring carefully designed criteria and risking layer collapse.^[4] Examples include SNIP, SynFlow (data-free), and SparseGPT for 100B+ parameter LLMs. One-shot methods are particularly valuable for very large models where iterative retraining is prohibitively expensive.

Iterative pruning alternates score-prune-update cycles: train to performance level, prune p% parameters, fine-tune several epochs, repeat until target sparsity.^[13] While computationally expensive, iterative methods achieve better final accuracy through gradual adaptation to sparsity and reduced catastrophic forgetting. Studies on VGG-16 CIFAR-10 and LLaMA-7B consistently show iterative outperforming one-shot approaches.

Automated Gradual Pruning (AGP) uses polynomial sparsity schedules:^[57]

$s(t)=s_{f}+(s_{i}-s_{f})\left(1-{\frac {t-t_{0}}{n\cdot \Delta t}}\right)^{3}$ where $s_f$ is final sparsity, $s_i$ is initial sparsity, and the schedule gradually increases sparsity over training.

The Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis, proposed by Jonathan Frankle and Michael Carbin at MIT in 2018, states that randomly-initialized dense networks $f(x; \theta_0)$ contain sparse subnetworks $f(x; m \odot \theta_0)$ (where $m$ is a binary mask) that, when trained in isolation, reach accuracy comparable to the full network.^[15] In the authors' words, "dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that, when trained in isolation, reach test accuracy comparable to the original network in a similar number of iterations."^[15] This ICLR 2019 Best Paper Award winner demonstrated winning tickets with 10-20% of original parameters achieving full network performance.

The hypothesis provides a powerful theoretical framework for understanding why pruning can be so effective, suggesting that large, over-parameterized networks are not just learning effective weights; they are also acting as a search space to find well-initialized sparse structures that are inherently good at learning.

Iterative magnitude pruning

The Iterative Magnitude Pruning (IMP) algorithm identifies winning tickets:

Randomly initialize network: $\theta_0$
Train to convergence: $\theta$
Prune p% of weights by magnitude: create mask $m$
Reset remaining weights: $\theta_{\text{pruned}} = m \odot \theta_0$
Repeat steps 2-4

The rewinding variant, proposed for stabilizing larger networks, resets to weights at iteration $k$ (not initialization): $\theta_{\text{pruned}} = m \odot \theta_k$ , establishing that networks become stable to SGD noise early in training, creating linearly-connected minima.^[58]

The key insight of the LTH is that the structure of the sparse subnetwork and its specific initial weight values are both crucial. A stronger version of the hypothesis has also been proven, showing that a sufficiently over-parameterized network contains a subnetwork that can approximate a target function well even before any training.^[59]

Theoretical validation

Malach et al. (2020) provided the first theoretical proof of a strong lottery ticket hypothesis for two-layer networks, formally validating that pruning is sufficient.^[60] Extensions include applications to pre-trained BERT networks, where matching subnetworks exist at 40-90% sparsity at initialization.^[61]

Critical analysis by Frankle et al. (2020) showed pruning-at-initialization methods underperform magnitude pruning after training, with shuffling weights preserving accuracy, suggesting these methods identify architecture rather than specific initializations.^[47]

Pruning in search and planning

Outside of model training, pruning is fundamental in symbolic AI search. In two-player games, alpha-beta pruning eliminates branches that cannot affect the minimax value, enabling deeper searches with the same compute while returning the same optimal move as plain minimax under perfect play.^[3]

Alpha-beta pruning is a search algorithm that seeks to decrease the number of nodes evaluated by the minimax algorithm in its search tree. It stops evaluating a move when at least one possibility has been found that proves the move to be worse than a previously examined move. Such moves need not be evaluated further. When applied to a standard minimax tree, it returns the same move as minimax would, but prunes away branches that cannot possibly influence the final decision.

The algorithm maintains two values, $\alpha$ and $\beta$ , which represent the minimum score that the maximizing player is assured of and the maximum score that the minimizing player is assured of, respectively. As the search proceeds, these values are updated, and branches are pruned when $\beta \le \alpha$ , indicating that the current position will not be reached in optimal play.

Applications and performance

Computer vision

Pruning achieves substantial compression across vision architectures:

ResNet models:

ResNet-50 ImageNet: 2 times acceleration with about 1.4% top-5 accuracy loss (L1-norm pruning); 76% top-1 accuracy with only 25% FLOPs (GWCS method)^[62]
ResNet-56 CIFAR-10: 25% sparsity shows optimal performance; 50-75% sparsity suboptimal due to low channel counts^[63]
ResNet-110 CIFAR-100: 62% FLOPs reduction with accuracy maintained (HDBOFP method)

VGG architectures demonstrate extreme compressibility:

VGG-16 CIFAR-10: 20 times model size reduction, 5 times computing operations reduction (Network Slimming); 60% channel sparsity with under 1% performance drop^[35]
VGG-16 ImageNet: 5 times acceleration with 0.59% top-5 accuracy loss (FBS method)^[64]

Object detection:

YOLOv5l: 63.8% parameter reduction and 37.4% FLOPs reduction with favorable accuracy balance^[65]
YOLOv3: Pruning with quantization enables deployment on Jetson TX2 edge devices with significant energy reduction

Natural language processing

BERT pruning demonstrates exceptional compression potential:

Optimal BERT Surgeon (oBERT): 10 times model size compression with under 1% accuracy drop; 10 times CPU-inference speedup with under 2% accuracy drop; 29 times speedup with under 7.5% accuracy drop^[66]
Rasa BERT (real-world deployment): 60% neuron sparsity achieving F1 about 0.895 (2.8% relative decrease), 28% inference acceleration, model size 406MB to 197MB (51% reduction); remarkably, 100% WQ and WK pruning (removing self-attention entirely) achieved F1 = 0.897^[67]
Neural Magic BERT: 12-layer 90% sparse achieves 4.05 times speedup beating 3-layer dense BERT in accuracy; 3-layer 70% sparse achieves 9.20 times speedup matching dense accuracy^[68]

Large language model pruning has advanced dramatically:

SparseGPT: First method efficiently pruning 10-100B+ parameter models; OPT-175B and BLOOM-176B pruned to 60% unstructured sparsity in under 4.5 hours with minimized perplexity^[16]
Wanda: Simple weight × activation approach achieving competitive 50% sparsity, 300 times faster than SparseGPT, no retraining needed^[17]
LLaMA pruning: Tailored-LLaMA achieves 95.68% accuracy recovery at 20% compression, 86.54% at 50% compression in under 1 hour fine-tuning; AMD's LLaMA 3.1 405B conservative pruning removes 26 layers achieving over 97% RougeL^[69]

Vision transformers

Token pruning methods achieve significant speedups:

DynamicViT: Dynamic token sparsification with learnable prediction modules (NeurIPS 2021)
SPViT: Computation-aware soft pruning reducing DeiT-T latency to 26ms on mobile^[70]
NViT (NVIDIA Research): Hardware-friendly global structural pruning achieving 1.9 times speedup with minimal accuracy loss^[71]
Isomorphic Pruning: Groups sub-structures by topology; improved DeiT-Tiny from 74.52% to 77.50% accuracy by pruning DeiT-Base, and ConvNext-Tiny from 82.06% to 82.18%^[72]

Edge computing and mobile deployment

Edge applications demonstrate pruning's practical value:

MP-YOLO (autonomous vehicles): Model size 6MB to 2.2MB (63% reduction), +4.7% AP50, +4.2% AP on DAIR-V2X dataset using LAMP pruning^[73]
Industrial IoT: VGG16 and ResNet18 pruning achieves energy savings without accuracy compromise on BloodMNIST, VisA, MVTec datasets^[74]
Low-Rank LLaMA2-7B: about 50% faster training vs. 8-bit quantization, about 1.25 times inference speed-up, about 50% weights removed without fine-tuning^[75]

Industry use cases

Pruning has found applications across various industries where efficiency and interpretability are important:

Healthcare: In medical diagnostics, pruned decision trees can create simple, interpretable rules for predicting patient risk factors or treatment outcomes. Pruned neural networks can accelerate the analysis of medical images on portable devices.^[6]
Finance: Financial institutions use pruned decision trees for credit scoring and risk assessment. The resulting models are not only faster but also more transparent, making it easier to explain lending decisions to regulators and customers.^[6]
Marketing: Pruned models are used for customer segmentation and targeted advertising. Their simplicity allows marketing teams to understand the key drivers of customer behavior and tailor their strategies accordingly.^[6]

Advantages and limitations

Benefits

Model	Compression Ratio	Speedup	Accuracy Impact	Reference
BERT-base	10 times	29 times	<7.5% loss	Optimal BERT Surgeon^[66]
ResNet-50	62-76% reduction	2-5 times	~1.4% top-5 loss	Multiple studies^[62]
VGG-16	20 times	5 times	<1%	Network Slimming^[56]
GPT-2	70%	2.5 times	Maintained	Nature 2025
YOLOv4	96.7%	N/A	Balanced	Research papers

Pruning provides numerous benefits:

Model size reduction: Reduces storage requirements and deployment costs
Inference acceleration: Lowers latency and enables real-time applications
Power consumption reduction: Critical for battery-powered edge devices
Memory footprint reduction: Allows larger batch sizes and more efficient inference
Better generalization: Originally motivating OBD, pruning can act as regularization^[2]
Bandwidth reduction: Lower model transfer costs for cloud and edge deployment
Cost savings: Cloud inference cost reductions up to 70%
Interpretability: Simpler models are easier to understand and explain

Challenges and limitations

Accuracy-efficiency trade-off: Excessive pruning loses important information; beyond 80% sparsity, models become incapable of recovery. Different tasks and datasets exhibit varying sensitivity; ImageNet-trained models show more accuracy deterioration than CIFAR100-trained models.^[76]

Hardware-software compatibility: Unstructured pruning requires specialized hardware for actual speedups. The Rasa study found 50%-sparse BERT provides almost no speed-up due to computational overhead, with tf.scatter_nd adding about 15ms. Extreme sparsity (80%+) needed on GPUs to see benefits.^[67] Standard GPUs and CPUs are optimized for dense matrix operations, making irregular sparsity patterns inefficient without specialized support.

Pruning schedule complexity: Determining optimal schedules is non-trivial: pruning too eagerly (1 epoch) or too slowly both harm models. Different layers have different sensitivities requiring careful calibration.^[67]

Layer collapse: Global pruning may eliminate entire groups at high speedup ratios. Protected global pruning preserving at least 10% parameters per group mitigates this.^[76]

Model and task specificity: Vision Transformers are harder to compress than CNNs. Each component has characteristic maximum sparsity; BERT self-attention can sustain 100% pruning but intermediate layers cannot.^[67] Recovery requirements increase with pruning ratio, making fine-tuning computationally expensive.

Implementation challenges: Requires careful tuning of hyperparameters, understanding of model architecture sensitivities, and often multiple iterations to achieve optimal results. The process can be time-consuming and requires expertise.

How does pruning compare with other model compression techniques?

Pruning is one of the three main pillars of model compression, alongside quantization and knowledge distillation. While all three aim to create more efficient models, they operate on different principles.

Technique	Mechanism	Primary Effect	Typical Accuracy Impact	Hardware Considerations
Pruning	Removes redundant weights, neurons, or filters from the model's architecture	Reduces parameter count and FLOPs, leading to a smaller and potentially faster model	Can maintain accuracy with fine-tuning; high pruning rates can cause degradation	Structured pruning is necessary for significant speedups on standard hardware (GPUs/CPUs)
Quantization	Reduces the bit-precision of weights and/or activations (for example from 32-bit floats to 8-bit integers)^[77]	Reduces model size (memory footprint) and can significantly speed up inference due to faster integer arithmetic	Minor accuracy drop is common, often recoverable with Quantization-Aware Training (QAT)	Most effective with hardware that has native support for low-precision arithmetic (for example Tensor Cores, TPUs)
Knowledge distillation	Trains a smaller "student" model to mimic the behavior (output probabilities) of a larger "teacher" model^[78]	Creates a new, compact model with a different architecture and weights, but trained to capture the "dark knowledge" of the larger model	Aims to transfer the high performance of the teacher to the smaller student; some performance drop is expected but often less than training the small model from scratch	The student model can be designed specifically to be efficient on target hardware

Pruning vs. Quantization: Pruning changes the model's architecture by removing parts of it. Quantization keeps the architecture the same but changes the numerical representation of the parameters. Pruning reduces the number of parameters, while quantization reduces the size of each parameter.^[78]

Pruning vs. Knowledge Distillation: Pruning is a process of simplifying an existing, trained model. Knowledge distillation is a training process for creating a new, smaller model. Pruning results in a subset of the original model's parameters, whereas the student model in distillation has entirely new parameters learned from scratch.^[77]

Synergistic use

These techniques are not mutually exclusive and are often most powerful when used in combination.^[79] A common and highly effective pipeline for model compression involves:

Pruning: First, prune a large, trained model to remove structural redundancy and reduce its FLOPs
Quantization: Next, quantize the remaining weights of the pruned model to reduce its memory footprint and leverage fast integer arithmetic
Knowledge Distillation: Alternatively, a large model can be used as a teacher to train a smaller, structurally efficient student model, which can then itself be pruned and/or quantized

By combining these methods, practitioners can achieve dramatic reductions in model size and latency, often by an order of magnitude or more, making it possible to deploy state-of-the-art AI on a wide variety of hardware.^[80]

Advanced topics and modern frontiers

Pruning research continues to evolve, moving beyond simple heuristics applied to standard CNNs. Current research focuses on applying pruning to state-of-the-art architectures, automating the complex process of deciding what and how much to prune, and developing more dynamic and adaptive pruning strategies. This trajectory mirrors the broader evolution of machine learning itself: a progression from static, heuristic-based methods to dynamic, automated, and learned approaches.

Automated pruning and AutoML

Manually determining the optimal pruning strategy for a given network, deciding which layers to prune and by how much, is a complex and time-consuming process involving extensive trial and error. To address this, the field is moving towards AutoML for pruning, where the pruning policy itself is learned automatically.^[81]

These methods frame the search for the best pruned architecture as an optimization problem:

Meta-Learning Approaches: Methods like MetaPruning train a separate "meta-network" (called a PruningNet) that learns to generate the optimal weights for any given pruned architecture. By sampling different pruned structures during training, the meta-network learns a general mapping from architecture to weights. This allows for a fast search over many candidate pruned networks without having to train each one from scratch.^[82]
Reinforcement Learning (RL) and Bayesian Methods: Other approaches use RL agents or Bayesian optimization to explore the space of possible pruning configurations. The agent proposes a pruning action (for example a set of per-layer pruning rates), receives a reward based on the resulting model's accuracy and size, and updates its policy to find configurations that maximize the reward.^[81]^[83]
Gradient-Based Automatic Pruning: Techniques like AutoPrune introduce a set of trainable auxiliary parameters that control the pruning mask. These parameters are optimized via gradient descent alongside the model weights, allowing the network to learn its own sparse structure automatically and robustly, without sensitive hyperparameters like pruning thresholds.^[84]

Dynamic and adaptive pruning

The most advanced frontier in pruning research involves moving away from a static pruned structure. In static pruning, once a network is pruned, its sparse structure remains fixed. Dynamic pruning methods allow this structure to change:

Dynamic Pruning During Training: Some methods allow the pruning mask to be updated during the training process. For example, a technique called RigL (Rigged Lottery) prunes weights with the smallest magnitudes and then reactivates (regrows) connections with the largest gradient magnitudes, allowing the sparse topology to evolve and adapt throughout training.
Spatio-Temporal Pruning: For models that process sequential data, like spiking neural networks (SNNs) used with Dynamic Vision Sensors, pruning can be adapted to the temporal dimension. Spatio-temporal pruning algorithms dynamically adjust the network's structure to reduce not only spatial redundancy (within a single frame) but also temporal redundancy that exists across consecutive frames of data.^[85] This represents a highly adaptive form of pruning tailored to the specific characteristics of the data stream.
Input-Dependent Dynamic Pruning: Masks change per input at inference, optimizing for each sample. This allows the model to use different sparse structures for different inputs, allocating computational resources where they are most needed.

Tools and frameworks

PyTorch

PyTorch's torch.nn.utils.prune module (available since 1.4.0) provides built-in pruning capabilities including random_unstructured(), l1_unstructured(), ln_structured(), and global_unstructured().^[86] The module uses forward hooks applying masks during inference, supports iterative pruning with mask accumulation via PruningContainer, and allows custom pruning methods via BasePruningMethod.

Torch-Pruning, implementing the DepGraph algorithm (CVPR 2023), provides automatic dependency analysis for structural pruning across LLMs, Vision Transformers, CNNs, and detection models.^[87] Supporting GroupMagnitudeImportance, GroupTaylorImportance, and custom metrics, it enables high-level pruning with global strategies and isomorphic pruning (ECCV 2024).

TensorFlow

The TensorFlow Model Optimization Toolkit provides magnitude-based pruning via prune_low_magnitude() with polynomial decay schedules, integrated with Keras layers.^[48] Features include UpdatePruningStep and PruningSummaries callbacks, strip_pruning() to remove wrappers, TensorFlow Lite support with XNNPACK acceleration, and PruneForLatencyOnXNNPack policy for mobile/edge devices. The toolkit supports structured pruning patterns including 2:4 and N:M sparsity.

NVIDIA tools

NVIDIA TensorRT Model Optimizer supports depth pruning (layer removal), width pruning (neurons, attention heads, channels), magnitude-based and activation-based pruning for LLMs and transformers, with TensorRT integration for optimized inference.^[88]

NeMo Framework provides script-based pruning (scripts/llm/gpt_prune.py) powered by TensorRT Model Optimizer, supporting combined depth and width pruning for Llama, Mistral, and other LLMs with importance calibration using training data.^[89]

NVIDIA ASP (Automatic SParsity) enables 2:4 structured sparsity for Ampere GPUs, achieving up to 2 times speedup using sparse tensor cores with TensorRT 8.0+ integration.^[90]

Additional frameworks

Microsoft NNI (Neural Network Intelligence) provides unified API for 10+ pruning algorithms including L1NormPruner, FPGMPruner, SlimPruner, TaylorFOWeightPruner, with ModelSpeedup for real acceleration, supporting PyTorch and TensorFlow.^[91]

JaxPruner (Google Research 2023) offers JAX-based sparsity with magnitude, top-K, random, and gradient-based methods, integrating with Optax optimizers and Flax models, demonstrating minimal overhead with sparsity distributions and scheduling functions.^[92]

ONNX Runtime provides graph optimizations (constant folding, node elimination/fusion), dynamic and static quantization (INT8/INT4), and TensorRT EP integration for cross-platform deployment.^[93]

Theoretical foundations

Generalization bounds

Pruning's generalization benefits have theoretical support. For pruned networks, generalization error bounds:^[94]

$R(h)\leq {\hat {R}}(h)+O\left({\sqrt {\frac {d_{\text{eff}}\log(n/d_{\text{eff}})}{n}}}\right)$ where $d_{\text{eff}}$ is effective dimensionality (non-zero parameters). This shows generalization improves with higher pruning rates up to a threshold.

PAC-Bayes compression bounds provide state-of-the-art guarantees. For stochastic classifier $Q$ , with probability $1-\delta$ :^[95]

$R(Q)\leq {\hat {R}}(Q)+{\sqrt {\frac {KL(Q||P)+\log(2{\sqrt {n}}/\delta )}{2n-1}}}$ where KL is Kullback-Leibler divergence. Arora et al. (2018) showed compression-based bounds orders of magnitude better than parameter counting, with first non-vacuous ImageNet-scale guarantees achieved in 2019.

Path-norm bounds provide rescaling-invariant metrics:^[96]

$||R_{\theta }-R_{\theta '}||\leq C\cdot ||\Phi (\theta )-\Phi (\theta ')||_{1}$ where $\Phi(\theta)$ is path-lifting of parameters, applicable to ResNets, VGGs, and U-nets.

Statistical mechanics analysis

Teacher-student frameworks show sparse networks generalize better than dense networks for fixed parameter counts, with pruning benefit increasing with pruning instability (accuracy drop immediately after pruning), suggesting pruning regularizes similarly to noise injection, producing flatter models.^[97]

Key researchers

Song Han (MIT Associate Professor, 80,683+ citations) pioneered magnitude-based pruning (2015), Deep Compression (ICLR 2016, 35-49 times compression), AMC (ECCV 2018, AutoML compression), and EIE inference engine (ISCA 2016, top-5 most cited in 50 years). Recent work includes AWQ and SmoothQuant for LLM quantization. Awards include ICLR'16 Best Paper, NSF CAREER, "35 Innovators Under 35", IEEE "AI's 10 to Watch", Sloan Research Fellowship.^[98]

Jonathan Frankle and Michael Carbin (MIT) introduced the Lottery Ticket Hypothesis (ICLR 2019 Best Paper), stabilization methods, and critical analysis of pruning-at-initialization, fundamentally changing understanding of network sparsity.^[15]

Gongfan Fang, Xinyin Ma, and Xinchao Wang (National University of Singapore xML Lab) developed DepGraph (CVPR 2023), Isomorphic Pruning (ECCV 2024), LLM-Pruner (NeurIPS 2023), and Structural Pruning for Diffusion Models (NeurIPS 2023), advancing structured pruning across architectures.^[72]

Elias Frantar and Dan Alistarh (IST Austria) created SparseGPT (ICML 2023), enabling efficient one-shot pruning of 100B+ parameter models.^[16]

Pavlo Molchanov and Huanrui Yang (NVIDIA Research) contributed Taylor expansion methods (ICLR 2017), importance estimation (CVPR 2019), and NViT (CVPR 2023) for hardware-aware pruning.^[52]

Yann LeCun, John S. Denker, Sara A. Solla, Babak Hassibi, and David G. Stork established foundational second-order methods (OBD, OBS) in the late 1980s-early 1990s that continue influencing modern approaches.^[2]^[10]

References

Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth. ↩
LeCun, Y., Denker, J. S., and Solla, S. A. (1989). "Optimal Brain Damage". *Advances in Neural Information Processing Systems (NeurIPS) 2*. https://proceedings.neurips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html ↩
Russell, S., and Norvig, P. (2021). *Artificial Intelligence: A Modern Approach* (4th ed.), chapter on adversarial search and alpha-beta pruning. Pearson. ↩
He, Y., and Xiao, L. (2023). "Structured Pruning for Deep Convolutional Neural Networks: A Survey". *IEEE Transactions on Pattern Analysis and Machine Intelligence*. https://arxiv.org/abs/2303.00566 ↩
Han, S., Mao, H., and Dally, W. J. (2016). "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1510.00149 ↩
Hu, B., et al. (2021). "Interpretable Machine Learning in Healthcare and Finance: A Review of Pruned Models and Decision Trees". *Applied Sciences / IEEE Access* (survey of interpretable pruned models). ↩
Sietsma, J., and Dow, R. J. F. (1988). "Neural Net Pruning: Why and How". *IEEE International Conference on Neural Networks*, pp. 325-333. ↩
Blalock, D., Ortiz, J. J. G., Frankle, J., and Guttag, J. (2020). "What is the State of Neural Network Pruning?". *Proceedings of Machine Learning and Systems (MLSys)*. https://arxiv.org/abs/2003.03033 ↩
Reed, R. (1993). "Pruning Algorithms: A Survey". *IEEE Transactions on Neural Networks*, 4(5), 740-747. ↩
Hassibi, B., and Stork, D. G. (1993). "Second Order Derivatives for Network Pruning: Optimal Brain Surgeon". *Advances in Neural Information Processing Systems (NeurIPS) 5*. https://proceedings.neurips.cc/paper/1992/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf ↩
Karnin, E. D. (1990). "A Simple Procedure for Pruning Back-Propagation Trained Neural Networks". *IEEE Transactions on Neural Networks*, 1(2), 239-242. ↩
Reed, R. (1993). "Pruning Algorithms: A Survey". *IEEE Transactions on Neural Networks*, 4(5), 740-747. ↩
Han, S., Pool, J., Tran, J., and Dally, W. J. (2015). "Learning Both Weights and Connections for Efficient Neural Networks". *Advances in Neural Information Processing Systems (NeurIPS) 28*. https://arxiv.org/abs/1506.02626 ↩
Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. (2017). "Pruning Filters for Efficient ConvNets". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1608.08710 ↩
Frankle, J., and Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". *International Conference on Learning Representations (ICLR), Best Paper Award*. https://arxiv.org/abs/1803.03635 ↩
Frantar, E., and Alistarh, D. (2023). "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot". *International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2301.00774 ↩
Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2024). "A Simple and Effective Pruning Approach for Large Language Models (Wanda)". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2306.11695 ↩
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python". *Journal of Machine Learning Research*, 12, 2825-2830. Decision tree pruning documentation: https://scikit-learn.org/stable/modules/tree.html ↩
Esposito, F., Malerba, D., and Semeraro, G. (1997). "A Comparative Analysis of Methods for Pruning Decision Trees". *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 19(5), 476-491. ↩
Quinlan, J. R. (1986). "Induction of Decision Trees". *Machine Learning*, 1(1), 81-106. ↩
Mingers, J. (1989). "An Empirical Comparison of Pruning Methods for Decision Tree Induction". *Machine Learning*, 4(2), 227-243. ↩
Esposito, F., Malerba, D., and Semeraro, G. (1997). "A Comparative Analysis of Methods for Pruning Decision Trees". *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 19(5), 476-491. ↩
Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. (1984). *Classification and Regression Trees* (CART). Wadsworth. ↩
Quinlan, J. R. (1987). "Simplifying Decision Trees". *International Journal of Man-Machine Studies*, 27(3), 221-234. ↩
Elomaa, T., and Kaariainen, M. (2001). "An Analysis of Reduced Error Pruning". *Journal of Artificial Intelligence Research*, 15, 163-187. ↩
Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. (1984). *Classification and Regression Trees*, cost-complexity pruning. Wadsworth. ↩
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.), section 9.2. Springer. ↩
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. ↩
Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth. ↩
scikit-learn developers. "Minimal Cost-Complexity Pruning". scikit-learn documentation. https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning ↩
scikit-learn developers. "DecisionTreeClassifier (ccp_alpha)". scikit-learn API reference. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html ↩
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann. ↩
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). "Understanding Deep Learning Requires Rethinking Generalization". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1611.03530 ↩
Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2018). "Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges". *IEEE Signal Processing Magazine*, 35(1), 126-136. https://arxiv.org/abs/1710.09282 ↩
Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. (2019). "Rethinking the Value of Network Pruning". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1810.05270 ↩
NVIDIA. "Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT". NVIDIA Technical Blog. https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ ↩
Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., and Peste, A. (2021). "Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks". *Journal of Machine Learning Research*, 22(241), 1-124. https://arxiv.org/abs/2102.00554 ↩
Simonyan, K., and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG)". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1409.1556 ↩
Hu, H., Peng, R., Tai, Y.-W., and Tang, C.-K. (2016). "Network Trimming: A Data-Driven Neuron Pruning Approach Towards Efficient Deep Architectures (APoZ)". https://arxiv.org/abs/1607.03250 ↩
Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. (2017). "Pruning Filters for Efficient ConvNets". *ICLR*. https://arxiv.org/abs/1608.08710 ↩
He, Y., Zhang, X., and Sun, J. (2017). "Channel Pruning for Accelerating Very Deep Neural Networks". *International Conference on Computer Vision (ICCV)*. https://arxiv.org/abs/1707.06168 ↩
Men, X., et al. (2024). "ShortGPT: Layers in Large Language Models Are More Redundant Than You Expect (Block Influence)". https://arxiv.org/abs/2403.03853 ↩
Mishra, A., et al. (2021). "Accelerating Sparse Deep Neural Networks (2:4 / N:M Structured Sparsity)". NVIDIA. https://arxiv.org/abs/2104.08378 ↩
Wang, H., Qin, C., Bai, Y., Zhang, Y., and Fu, Y. (2022). "Recent Advances on Neural Network Pruning at Initialization". *International Joint Conference on Artificial Intelligence (IJCAI)*. https://arxiv.org/abs/2103.06460 ↩
Lee, N., Ajanthan, T., and Torr, P. H. S. (2019). "SNIP: Single-Shot Network Pruning Based on Connection Sensitivity". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1810.02340 ↩
Wang, C., Zhang, G., and Grosse, R. (2020). "Picking Winning Tickets Before Training by Preserving Gradient Flow (GraSP)". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2002.07376 ↩
Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. (2021). "Pruning Neural Networks at Initialization: Why Are We Missing the Mark?". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2009.08576 ↩
TensorFlow. "TensorFlow Model Optimization Toolkit: Pruning". https://www.tensorflow.org/model_optimization/guide/pruning ↩
Gale, T., Elsen, E., and Hooker, S. (2019). "The State of Sparsity in Deep Neural Networks". https://arxiv.org/abs/1902.09574 ↩
Aghasi, A., et al. (2024). "Confident Magnitude-Based Neural Network Pruning with Uncertainty Quantification". https://arxiv.org/abs/2408.04759 ↩
Dong, X., Chen, S., and Pan, S. J. (2017). "Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon". *Advances in Neural Information Processing Systems (NeurIPS) 30*. https://arxiv.org/abs/1705.07565 ↩
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. (2017). "Pruning Convolutional Neural Networks for Resource Efficient Inference (Taylor expansion)". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1611.06440 ↩
Liu, C., and Wu, H. (2019). "Channel Pruning Based on Mean Gradient for Accelerating Convolutional Neural Networks". *Signal Processing*, 156, 84-91. ↩
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*, chapter 7 (Regularization). MIT Press. https://www.deeplearningbook.org/ ↩
Yang, H., Wen, W., and Li, H. (2020). "DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1908.09979 ↩
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. (2017). "Learning Efficient Convolutional Networks through Network Slimming". *International Conference on Computer Vision (ICCV)*. https://arxiv.org/abs/1708.06519 ↩
Zhu, M., and Gupta, S. (2018). "To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression (Automated Gradual Pruning)". *ICLR Workshop*. https://arxiv.org/abs/1710.01878 ↩
Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. (2020). "Linear Mode Connectivity and the Lottery Ticket Hypothesis (rewinding)". *International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/1912.05671 ↩
Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., and Rastegari, M. (2020). "What's Hidden in a Randomly Weighted Neural Network?". *Conference on Computer Vision and Pattern Recognition (CVPR)*. https://arxiv.org/abs/1911.13299 ↩
Malach, E., Yehudai, G., Shalev-Shwartz, S., and Shamir, O. (2020). "Proving the Lottery Ticket Hypothesis: Pruning is All You Need". *International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2002.00585 ↩
Chen, T., et al. (2020). "The Lottery Ticket Hypothesis for Pre-trained BERT Networks". *Advances in Neural Information Processing Systems (NeurIPS) 33*. https://arxiv.org/abs/2007.12223 ↩
He, Y., Liu, P., Wang, Z., Hu, Z., and Yang, Y. (2019). "Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration". *Conference on Computer Vision and Pattern Recognition (CVPR)*. https://arxiv.org/abs/1811.00250 ↩
He, Y., Kang, G., Dong, X., Fu, Y., and Yang, Y. (2018). "Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks". *International Joint Conference on Artificial Intelligence (IJCAI)*. https://arxiv.org/abs/1808.06866 ↩
Gao, X., Zhao, Y., Dudziak, L., Mullins, R., and Xu, C.-Z. (2019). "Dynamic Channel Pruning: Feature Boosting and Suppression (FBS)". *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1810.05331 ↩
Wang, Z., et al. (2022). "Lightweight YOLOv5 via Structured Pruning for Real-Time Object Detection". *Sensors / IEEE Access*. ↩
Kurtic, E., et al. (2022). "The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models". *Conference on Empirical Methods in Natural Language Processing (EMNLP)*. https://arxiv.org/abs/2203.07259 ↩
Wendler, C. (2021). "Compressing BERT for Faster Inference: Pruning Experiments". Rasa Blog. https://rasa.com/blog/compressing-bert-for-faster-prediction/ ↩
Kurtz, M., et al. (2020). "Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks". *International Conference on Machine Learning (ICML)* / Neural Magic. https://proceedings.mlr.press/v119/kurtz20a.html ↩
Ma, X., Fang, G., and Wang, X. (2023). "LLM-Pruner: On the Structural Pruning of Large Language Models". *Advances in Neural Information Processing Systems (NeurIPS) 36*. https://arxiv.org/abs/2305.11627 ↩
Kong, Z., et al. (2022). "SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning". *European Conference on Computer Vision (ECCV)*. https://arxiv.org/abs/2112.13890 ↩
Yang, H., Yin, H., Molchanov, P., Li, H., and Kautz, J. (2023). "Global Vision Transformer Pruning with Hessian-Aware Saliency (NViT)". *Conference on Computer Vision and Pattern Recognition (CVPR)*. https://arxiv.org/abs/2110.04869 ↩
Fang, G., Ma, X., Mi, M. B., and Wang, X. (2024). "Isomorphic Pruning for Vision Models". *European Conference on Computer Vision (ECCV)*. https://arxiv.org/abs/2407.04616 ↩
Liu, Y., et al. (2023). "MP-YOLO: A Lightweight Object Detection Model for Edge Deployment in Autonomous Driving". *Sensors / IEEE Access*. ↩
Industrial IoT pruning study (2023). "Energy-Efficient Pruning of VGG and ResNet Models for Industrial IoT Anomaly Detection on BloodMNIST, VisA, and MVTec". *IEEE Internet of Things Journal / arXiv*. ↩
Zhang, M., et al. (2023). "LoRAPrune / Low-Rank Pruning for LLaMA-2 Models without Fine-Tuning". https://arxiv.org/abs/2305.18403 ↩
Sun, M., et al. (2023). "Analysis of Pruning Sensitivity Across Datasets and the Accuracy-Efficiency Trade-off in Large Models". (pruning survey / sensitivity analysis). ↩
Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. (2022). "A Survey of Quantization Methods for Efficient Neural Network Inference". https://arxiv.org/abs/2103.13630 ↩
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network". *NIPS Deep Learning Workshop*. https://arxiv.org/abs/1503.02531 ↩
Han, S., Pool, J., Tran, J., and Dally, W. J. (2015). "Learning Both Weights and Connections for Efficient Neural Networks". *NeurIPS 28*. https://arxiv.org/abs/1506.02626 ↩
Menghani, G. (2023). "Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better". *ACM Computing Surveys*, 55(12), 1-37. https://arxiv.org/abs/2106.08962 ↩
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. (2018). "AMC: AutoML for Model Compression and Acceleration on Mobile Devices". *European Conference on Computer Vision (ECCV)*. https://arxiv.org/abs/1802.03494 ↩
Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K.-T., and Sun, J. (2019). "MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning". *International Conference on Computer Vision (ICCV)*. https://arxiv.org/abs/1903.10258 ↩
Tung, F., and Mori, G. (2018). "Bayesian Optimization for Network Compression" / RL-based pruning configuration search. ↩
Xiao, X., Wang, Z., and Rajasekaran, S. (2019). "AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters". *Advances in Neural Information Processing Systems (NeurIPS) 32*. https://proceedings.neurips.cc/paper/2019/hash/4efc9e02abdab6b6166251918570a307-Abstract.html ↩
Kundu, S., et al. (2021). "Spike-Thrift / Spatio-Temporal Pruning for Energy-Efficient Spiking Neural Networks". *Winter Conference on Applications of Computer Vision (WACV)*. https://arxiv.org/abs/2007.04362 ↩
PyTorch. "torch.nn.utils.prune (Pruning Tutorial)". PyTorch documentation. https://pytorch.org/tutorials/intermediate/pruning_tutorial.html ↩
Fang, G., Ma, X., Song, M., Mi, M. B., and Wang, X. (2023). "DepGraph: Towards Any Structural Pruning". *Conference on Computer Vision and Pattern Recognition (CVPR)*. https://arxiv.org/abs/2301.12900 ↩
NVIDIA. "TensorRT Model Optimizer: Pruning". NVIDIA documentation. https://github.com/NVIDIA/TensorRT-Model-Optimizer ↩
NVIDIA. "NeMo Framework: Model Pruning and Distillation". NVIDIA NeMo documentation. https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html ↩
NVIDIA. "Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT (2:4 structured sparsity, Automatic SParsity / ASP)". NVIDIA Technical Blog. https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ ↩
Microsoft. "Neural Network Intelligence (NNI): Pruning". Microsoft documentation. https://nni.readthedocs.io/en/stable/compression/pruning.html ↩
Lee, J. H., et al. (2023). "JaxPruner: A Concise Library for Sparsity Research". Google Research. https://arxiv.org/abs/2304.14082 ↩
Microsoft. "ONNX Runtime: Graph Optimizations and Quantization". ONNX Runtime documentation. https://onnxruntime.ai/docs/performance/model-optimizations/ ↩
Bartlett, P. L., Foster, D. J., and Telgarsky, M. (2017). "Spectrally-Normalized Margin Bounds for Neural Networks". *Advances in Neural Information Processing Systems (NeurIPS) 30*. https://arxiv.org/abs/1706.08498 ↩
Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). "Stronger Generalization Bounds for Deep Nets via a Compression Approach". *International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/1802.05296 ↩
Neyshabur, B., Tomioka, R., and Srebro, N. (2015). "Norm-Based Capacity Control in Neural Networks (path-norm bounds)". *Conference on Learning Theory (COLT)*. https://arxiv.org/abs/1503.00036 ↩
Bartoldson, B. R., Morcos, A. S., Barbu, A., and Erlebacher, G. (2020). "The Generalization-Stability Tradeoff in Neural Network Pruning". *Advances in Neural Information Processing Systems (NeurIPS) 33*. https://arxiv.org/abs/1906.03728 ↩
Han, S. "Song Han: Publications and Google Scholar Profile". MIT HAN Lab. https://hanlab.mit.edu/songhan ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit