Pruning

From AI Wiki

Template:Infobox machine learning

Pruning is a family of techniques used in machine learning and artificial intelligence to remove parts of a model or search space that are estimated to be unnecessary for accuracy or optimality. In predictive models (for example decision trees and artificial neural networks), pruning reduces overfitting and improves efficiency by eliminating branches, neurons, filters, or weights that contribute little to validation performance.[1][2] In search algorithms (for example alpha–beta pruning for the minimax procedure), pruning discards branches that provably cannot change the final decision, reducing computation without affecting correctness.[3]

In deep learning, pruning removes parameters from artificial neural networks, creating sparse networks by eliminating unnecessary weights, neurons, filters, or entire layers, enabling deployment on resource-constrained devices and reducing inference costs. The technique has evolved from early theoretical work in the late 1980s to become essential for deploying modern large language models and deep neural networks on edge devices, achieving compression ratios of 50-95% with minimal accuracy loss. By 2024, over 3,000 pruning papers have been published, with successful deployments by NVIDIA, Meta, Intel, and Qualcomm achieving 2-29× speedups in production systems.[4]

Motivation and overview

Many ML models are overparameterized, containing redundancies that do not affect generalization. Pruning targets such redundancies to achieve multiple goals:

  • Generalization and regularization: Reduce overfitting by simplifying the hypothesis. Post-pruned decision trees often outperform fully grown trees on held-out data, and early pruning work in neural networks was motivated primarily by improving generalization rather than compression.[1][2]
  • Efficiency: Decrease parameters, memory, energy, and latency for deployment on edge devices and servers. Deep model compression pipelines combining pruning with quantization can achieve 35-49× storage reduction on models like AlexNet and VGG-16 without accuracy loss.[5]
  • Search optimization: Reduce the effective branching factor in combinatorial search while preserving optimal results (for example alpha–beta pruning).[3]
  • Interpretability: Create simpler, more transparent models where the decision-making process is easier to understand and explain, particularly important in regulated industries like healthcare and finance.[6]

The evolution of pruning techniques from focusing on individual weights (unstructured) to entire network components (structured) reflects a maturation of the field, driven by practical engineering challenges of achieving real-world performance gains on existing hardware.

History

Early development (1988-1995)

Research on neural network pruning traces back to 1988, emerging from neurobiological studies showing the human brain's resistance to damage and natural synaptic pruning during development.[7] This biological analogy to synaptic pruning in human brains, where unnecessary connections are eliminated during development, provided the conceptual foundation for artificial neural network pruning.[8]

At the end of the 1980s and beginning of the 1990s, the field expanded rapidly from seminal studies, with two major branches emerging: sensitivity calculation methods that evaluated each parameter's contribution to the error function, and penalty-term methods using regularization to encourage sparse networks.[9]

The original motivation differed from modern applications: pruning was intended to improve generalization rather than compression, based on theory and experience showing that networks with excessive parameters do not generalize well for fixed training data.[2]

Yann LeCun, with John S. Denker and Sara A. Solla, published Optimal Brain Damage at NeurIPS 1989, introducing parameter "saliency" computed using diagonal Hessian approximations.[2] This influential work, with 4,897 citations, demonstrated 60% parameter reduction on handwritten digit recognition networks with minimal accuracy impact. The method approximates the change in objective function using Taylor series expansion, showing that removing unimportant weights improves generalization and reduces required training examples.

Babak Hassibi and David G. Stork extended this work with Optimal Brain Surgeon (1992-1993), using the full Hessian matrix rather than diagonal approximations and allowing optimal weight adjustments after pruning.[10] The method achieved 90%, 76%, and 62% weight reduction on MONK's benchmark problems, significantly outperforming magnitude-based methods.

Other foundational contributions included:

  • Mozer and Smolensky's "Skeletonization" (1989): Early work on removing hidden units
  • E.D. Karnin's simple pruning procedure (1990): Proposing sensitivity-based approach during training[11]
  • Randall Reed's comprehensive survey (1993): Synthesized early approaches, summarizing pruning algorithms from the 1980s and 1990s, highlighting their role in reducing connections based on information theory and heuristics[12]

Modern resurgence (2015-present)

The field experienced renewed interest starting in 2015 with Song Han's work on magnitude-based iterative pruning, demonstrating 9-13× compression on AlexNet and VGGNet.[13] His subsequent "Deep Compression" paper (2016) combined pruning with quantization and Huffman coding, achieving 35-49× compression without accuracy loss and becoming one of the top-5 most cited papers in ISCA's 50-year history.[4]

In 2016, Hao Li et al. introduced filter pruning for convolutional networks in "Pruning Filters for Efficient ConvNets", targeting entire convolutional filters to create cascading effects through the network.[14]

The Lottery Ticket Hypothesis, proposed by Jonathan Frankle and Michael Carbin in 2018, revolutionized understanding of pruning by showing that randomly-initialized dense networks contain sparse subnetworks ("winning tickets") that can match full network accuracy when trained in isolation.[15] This ICLR 2019 Best Paper Award winner demonstrated that networks with 10-20% of original parameters could achieve comparable performance, fundamentally changing perspectives on network sparsity and suggesting that large over-parameterized networks are not just learning effective weights, but acting as a search space to find well-initialized sparse structures.

From 2020-2024, the field exploded with more than 3,000 pruning papers published—representing over half of all neural network compression research.[4] Recent advances include:

  • Efficient LLM pruning methods: SparseGPT (2023) enabling one-shot pruning of 175-billion parameter models[16], and Wanda (2023) combining weight magnitudes with activation norms[17]
  • Vision transformer token pruning: Dynamic sparsification methods
  • Hardware-aware structured pruning techniques: Methods specifically designed for GPU and edge accelerators
  • Pruning at initialization methods: SNIP (2019), GraSP (2020), SynFlow (2020)

Major technology companies including NVIDIA, Meta, Intel, AMD, and Qualcomm have deployed pruning in production systems for model compression and edge deployment.

Pruning in decision trees

A decision tree grown to purity typically overfits the training data. Decision tree pruning plays a crucial role in preventing overfitting and improving model generalization by simplifying the tree structure. Pruning simplifies the tree by removing nodes and subtrees that do not improve validation accuracy or that add unnecessary complexity.[18]

Pre-pruning (early stopping)

Pre-pruning, also known as early stopping, limits growth during induction with constraints applied during the tree-building process. This prevents the tree from growing to its full complexity in the first place.[19] Common pre-pruning criteria include:

  • Maximum Depth: Limiting the maximum number of levels the tree can grow
  • Minimum Samples per Leaf: Requiring a leaf node to contain a minimum number of training samples
  • Minimum Samples per Split: Requiring a node to have at least a certain number of samples before it can be split
  • Minimum Impurity Decrease: Requiring a split to reduce the node's impurity (for example Gini impurity or entropy) by at least a specified threshold
  • Statistical Tests: Using tests like the chi-squared test to determine if a proposed split is statistically significant[20]

The primary advantage of pre-pruning is its computational efficiency. By preventing the tree from growing to its full complexity, it saves significant training time, which can be crucial for very large datasets.[21]

However, pre-pruning suffers from a significant drawback known as the horizon effect.[22] A split may seem unpromising based on a local stopping criterion, causing the algorithm to halt growth. However, this "weak" split might have led to very informative splits further down the branch. Pre-pruning's greedy nature prevents it from seeing beyond this short-term horizon, potentially leading to a suboptimal, less accurate tree.

Post-pruning (backward pruning)

Post-pruning, sometimes called backward pruning, is the more common and often more effective approach.[21] In this strategy, the decision tree is first allowed to grow to its maximum size, fitting the training data completely (and thus, overfitting). Afterwards, the algorithm goes back through the tree and systematically removes nodes and subtrees that do not significantly contribute to its predictive accuracy.[23]

The decision to prune a node is typically based on its performance on a separate validation dataset (also called a pruning set) or by using a metric that penalizes complexity.[22] By first growing the full tree, post-pruning avoids the horizon effect, as it has a global view of all potential splits. While this is computationally more expensive than pre-pruning, it generally leads to more accurate and robust models.[19]

Post-pruning algorithms can traverse the tree in two ways:[22]

  • Bottom-up: The algorithm starts at the leaf nodes and works its way up towards the root. For each internal node, it evaluates whether replacing its subtree with a single leaf node would improve performance. This is the most common approach as it ensures that the relevance of an entire subtree is considered before any pruning decision is made about its parent nodes.
  • Top-down: The algorithm starts at the root and traverses downwards. This approach is less common because it risks pruning a large subtree that may contain highly valuable nodes deep within it.[21]

Post-pruning algorithms

Reduced Error Pruning (REP)

Reduced Error Pruning is one of the simplest and most intuitive post-pruning algorithms.[22] It relies on a separate dataset, known as a pruning or validation set, which was not used to train the tree. The algorithm works as follows:[24][25]

  1. Build a full tree: First, a decision tree is grown to its maximum depth on the training data until it overfits
  2. Use a validation set: The data is split into a training set and a validation set. The validation set is used exclusively for evaluating pruning decisions
  3. Iterate bottom-up: The algorithm iterates through every non-leaf (internal) node in the tree, starting from the nodes closest to the leaves and moving up towards the root
  4. Evaluate pruning: For each node, it considers the effect of "pruning" it. Pruning a node means removing the entire subtree rooted at that node and replacing it with a single leaf node. The class assigned to this new leaf node is the majority class of the training examples that fall under that node
  5. Compare accuracy: The accuracy of the original tree (with the subtree intact) is compared to the accuracy of the pruned tree (with the new leaf node) on the validation set
  6. Make the decision: If the pruned tree has an accuracy on the validation set that is equal to or better than the original tree, the subtree is permanently removed. Otherwise, the subtree is kept
  7. Repeat: This process is repeated for all internal nodes until no more nodes can be pruned without decreasing the validation accuracy

The main advantage of REP is its simplicity and speed.[22] However, its effectiveness depends heavily on the size and representativeness of the validation set. If the validation set is too small, the pruning decisions may be unreliable and could lead to removing useful subtrees.

Cost-Complexity Pruning (CCP)

Cost-Complexity Pruning, also known as weakest link pruning, is a more sophisticated and widely used method introduced in the CART algorithm.[23] Instead of relying solely on a validation set's error rate, CCP introduces a regularization parameter, α (alpha), that explicitly penalizes the complexity of the tree.

The algorithm defines a cost-complexity measure for a tree T as:[26][27]

Where:

  • R(T) is the total misclassification error of the tree T on the training data
  • |Tleaves| is the number of terminal (leaf) nodes in the tree T, serving as a measure of its complexity
  • α ≥ 0 is the complexity parameter. It controls the trade-off between the tree's fit to the training data and its complexity. A value of α=0 means no penalty for complexity, resulting in the largest tree. As α increases, the penalty for having more leaves grows, leading to more aggressive pruning and smaller trees[28]

The CCP algorithm works as follows:[29][30]

  1. Build a full tree: A maximal tree, Tmax, is grown on the entire training dataset
  2. Find the weakest link: For each internal node t in the tree, the algorithm calculates an "effective alpha" (αeff). This is the value of α at which pruning the subtree at t becomes beneficial. The node with the smallest non-negative αeff is considered the "weakest link"
  3. Generate a sequence of trees: The algorithm starts with T0=Tmax. It then finds the weakest link in T0, prunes it to create a new tree T1, finds the weakest link in T1, prunes it to create T2, and so on. This process continues until only the root node is left. This generates a finite sequence of optimally pruned subtrees for a range of α values
  4. Select the best tree: The final step is to choose the best tree from this sequence. This is typically done using k-fold cross-validation. For each tree in the sequence, its performance is evaluated on unseen data (the validation folds). The tree that achieves the best performance (for example highest accuracy) is selected as the final model[28]

CCP is a powerful and principled method for finding the right-sized tree. By generating a sequence of candidate trees and using cross-validation, it provides a robust way to select a model that generalizes well. Modern libraries like scikit-learn expose CCP through the `ccp_alpha` hyperparameter.[31]

Comparison of decision tree pruning methods

Comparison of pre-pruning and post-pruning approaches
Method When to use Computational cost Pros Cons Canonical reference
Pre-pruning Limited data, fast training desired Low (prevents growth) Simple; prevents deep overfitting early; computationally efficient May stop too soon (horizon effect); missed structure [18]
Cost-complexity post-pruning Standard CART workflow High (full tree + pruning) Strong CV-based model selection; nested subtrees; avoids horizon effect Requires pruning path & CV; computationally expensive [26][1]
Reduced-error post-pruning Separate validation set available Medium (full tree + validation) Conceptually simple; robust; easy to implement Needs hold-out set; can be aggressive; less data for training [24]

Practical considerations

Modern libraries expose both pre- and post-pruning controls. For example, scikit-learn's `DecisionTreeClassifier` supports pre-pruning (for example `max_depth`, `min_samples_leaf`, `min_samples_split`, `min_impurity_decrease`) and post-pruning via `ccp_alpha`.[18]

The choice between pre-pruning and post-pruning represents a classic algorithmic trade-off between computational efficiency and model optimality. Pre-pruning is computationally cheap, making it suitable for rapid prototyping or on massive datasets where building a full tree is infeasible.[23] In contrast, post-pruning is computationally expensive but makes more globally informed decisions, typically resulting in a more robust final model. The choice is therefore not merely technical but strategic, depending on the available resources and performance requirements of the application.

Classic works in decision tree pruning include Quinlan's C4.5 algorithm, which popularized decision-tree pruning (including a form of pessimistic error-based post-pruning).[32]

Pruning in artificial neural networks

In the domain of deep learning, pruning has become an essential technique for model compression and optimization. Modern deep neural networks, such as those used for computer vision and natural language processing, are often massively over-parameterized, containing millions or even billions of parameters.[33] This over-parameterization, while beneficial for achieving high accuracy during training, results in models that are computationally expensive, slow to run, and have large memory footprints, making them difficult to deploy on resource-constrained devices like smartphones or embedded systems.[34]

Definition and mathematical formulation

Formally, neural network pruning transforms a model f(x; W) into f(x; M ⊙ W'), where M ∈ {0, 1}|W'| is a binary mask setting certain parameters to zero, W' is a (potentially modified) collection of parameters, and ⊙ represents the elementwise product operator.[8] The goal is reducing parameter count and computational resources while maintaining accuracy on the task.

The process creates sparsity by eliminating connections (unstructured pruning) or entire structures (structured pruning). For a network with L layers and parameters θ = {W₁, W₂, ..., WL}, pruning identifies a subset S ⊂ θ of parameters to remove, typically by computing importance scores and removing low-importance parameters according to a pruning criterion.

The pruning optimization problem can be formulated as:

minimize: L(f(x; M ⊙ W), y)
subject to: ||M||₀ ≤ k

where L is the loss function, ||M||₀ counts non-zero elements in the mask, and k is the target number of remaining parameters. This NP-hard combinatorial optimization problem requires approximation methods in practice.

Types of pruning by granularity

Comparison of pruning granularities
Type Granularity Hardware Speedup Typical Sparsity Advantages Disadvantages
Unstructured Individual weights Requires specialized hardware 70-95% Highest compression, better accuracy No speedup on standard hardware
Structured Filters/channels/layers Universal speedup 20-60% Hardware-friendly, real acceleration Lower compression, more accuracy loss
Semi-structured Block patterns (N:M) GPU-optimized 50% (2:4) Hardware support, good compression Limited patterns, requires recent GPUs

Unstructured pruning

Unstructured pruning, also called fine-grained pruning, removes individual weights anywhere in the network without pattern constraints.[4] This approach achieves 50-90% sparsity with minimal accuracy loss by creating irregular sparse matrices. For example, VGG-16 on CIFAR-10 achieves 92.36% accuracy at 10% density with unstructured pruning versus 89.33% with structured pruning.[35]

However, unstructured pruning suffers a critical limitation: it provides no speedup on standard hardware without specialized sparse computation libraries or hardware support. GPUs and CPUs are highly optimized for dense matrix operations, and the irregular sparsity pattern created by unstructured pruning means the underlying computation (matrix multiplication) still operates on the original dense matrix dimensions, with many multiplications by zero that are not skipped.[36][37]

Structured pruning

Structured pruning removes entire filters, channels, neurons, or layers while maintaining regular architecture.[14] This hardware-friendly approach achieves universal speedup on standard processors by reducing both memory and FLOPs. For instance, ResNet-50 on ImageNet achieves 2× acceleration with ~1.4% top-5 accuracy loss using L1-norm filter pruning.[4]

Structured pruning exploits cascading effects: pruning an output filter in layer L automatically removes corresponding input channels in layer L+1, creating architectural modifications without specialized kernels. Research shows VGG-16 contains 90% of weights in fully-connected layers but only 1% of FLOPs in convolutional operations, making filter pruning particularly effective for CNNs.[38]

Types of structured pruning include:

  • Neuron pruning: Removes entire nodes using metrics like Average Percentage of Zeros (APoZ), which identifies neurons outputting mostly zeros on calibration data[39]
  • Filter/channel pruning: Targets entire convolutional filters (output channels). Pruning a filter in layer L removes one output channel from layer L and the corresponding input channel for layer L+1[40][41]
  • Layer pruning: Removes entire layers to reduce network depth. For large language models, the Block Influence (BI) method measures the extent a layer alters hidden states for intelligent layer removal[42]

Semi-structured pruning

Semi-structured pruning represents a middle ground, exemplified by N:M sparsity patterns where N of every M consecutive weights are non-zero.[43] NVIDIA's 2:4 structured sparsity achieves 2× speedup on Ampere A100 GPUs using sparse tensor cores while maintaining 50% sparsity, combining advantages of both approaches. This method is particularly effective because modern NVIDIA GPUs have dedicated hardware support for 2:4 sparsity patterns.

Types of pruning by timing

Pruning methods can also be categorized based on when pruning is applied relative to the model training process.[44]

Post-training pruning

Post-training pruning is the traditional and most common approach. A dense network is first trained to convergence. Then, a pruning algorithm is applied to remove unimportant parameters. This is often followed by a "fine-tuning" phase, where the pruned network is retrained for a few epochs to allow the remaining weights to adjust and recover any accuracy lost during pruning.[8][44] This is often done iteratively: prune, fine-tune, prune, fine-tune, achieving gradual adaptation to sparsity and reduced catastrophic forgetting.

During-training pruning

During-training pruning integrates the pruning process directly into the training phase. Sparsity is encouraged from the beginning or introduced gradually as training progresses. This can be achieved through regularization methods (like L1 regularization, which pushes weights towards zero) or by using dynamic pruning masks that are updated during training.[44] Methods include DeepR (2018) using stochastic updates, and dynamic pruning where masks change at runtime.

Pruning at initialization

Pruning at initialization (PaI), also known as pre-training pruning, is where the network is pruned at initialization, before any training has occurred. Inspired by the Lottery Ticket Hypothesis, methods like SNIP (2019) use connection sensitivity to prune one-shot,[45] and GraSP (2020) preserves gradient flow.[46] Advantages include reduced training costs, though critical analysis by Frankle et al. (2020) showed pruning-at-initialization methods often underperform magnitude pruning after training.[47]

Global versus local pruning

Local pruning applies independent pruning within each layer with uniform or preset ratios, preventing layer collapse but yielding suboptimal sparsity distributions.[4] This safer approach works well when importance varies greatly across layers and prevents catastrophic performance degradation.

Global pruning ranks importance across the entire network, automatically discovering optimal layer-wise sparsity patterns. While generally achieving better accuracy, global pruning risks layer collapse at high speedup ratios.[35] For LLMs, where outlier features exhibit 20× magnitude differences across layers, protected global pruning preserves ≥10% of parameters per group to mitigate this risk.

Pruning methods and algorithms

Magnitude-based pruning

Magnitude-based pruning, dating to 1988 and popularized by Han et al. (2015), prunes weights with smallest absolute values: prune if |w| < threshold τ.[13] The core assumption is that weights with a small absolute value (magnitude) have a smaller impact on the network's output and thus contribute less to its predictive power.

Despite its simplicity, magnitude pruning remains a strong baseline, with TensorFlow reporting 6× compression with minimal loss.[48] This can be applied at different scopes:[49]

  • Layer-wise pruning: A separate pruning threshold (or percentage) is determined for each layer. Weights within each layer are ranked by magnitude, and the lowest-ranking ones are removed.
  • Global pruning: All weights across the entire network (or all prunable layers) are collected into a single group. They are ranked globally by magnitude, and a single threshold is used to prune the lowest-ranking weights, regardless of which layer they belong to. Global pruning is often more effective as it allows the algorithm to automatically discover which layers are more sensitive to pruning.[8]

For filter pruning, L1 and L2 norms rank importance: Score(f) = Σ|wi| for L1, Score(f) = √(Σwi²) for L2. Modern variants include Wanda, which combines weight magnitudes with activation norms: Score(w) = |w| × ||x||, outperforming pure magnitude methods on LLMs.[17] Recent work includes confident magnitude-based pruning (2024) adding uncertainty quantification.[50]

Gradient-based and second-order methods

Optimal Brain Damage

Optimal Brain Damage (OBD) approximates the change in objective function using Taylor series expansion with three key simplifications: diagonal Hessian approximation (cross terms neglected), extremal approximation (gradient term gi = 0 at convergence), and quadratic approximation (higher-order terms discarded).[2]

The final saliency formula becomes:

where hkk is the diagonal Hessian element computed via backpropagation and uk is the weight value. OBD successfully reduced a 2578-parameter network by 60% (removing 1500 parameters) with minimal accuracy impact, demonstrating that removing unimportant weights improves generalization and reduces required training examples.

Optimal Brain Surgeon

Optimal Brain Surgeon (OBS) extends OBD by using the full inverse Hessian H-1 rather than diagonal approximations, allowing weight modifications during pruning.[10] For pruning weight q, optimal weight changes are:

with saliency:

OBS significantly outperforms magnitude-based methods and OBD, which "often remove the wrong weights," permitting more aggressive pruning for the same training error and yielding better test generalization. Extensions include Layer-wise Optimal Brain Surgeon (2017)[51] and The Combinatorial Brain Surgeon (2022) for simultaneous weight removal.

Taylor expansion methods

First-order Taylor expansion approximates loss change when pruning parameter h:

yielding importance:

This computationally efficient criterion requires only first-order gradients, demonstrating 10× reduction on 3D-convolutional filters with small accuracy drops.[52]

Modern gradient-based methods include:

  • SNIP (Single-shot Network Pruning): Uses connection sensitivity ∂L/∂mc normalized across parameters[45]
  • GraSP: Removes weights with least effect on gradient flow preservation[46]
  • Mean Gradient Method: Novel criterion for CNNs achieving 5.64× FLOPs reduction on VGG-16 CIFAR-10 with <1% accuracy loss[53]

Regularization-based pruning

L1 regularization (Lasso) adds penalty λ||θ||₁ = λΣii| to the loss, inducing exact sparsity by driving weights to zero through non-differentiable subgradients. L2 regularization (Ridge) adds λ||θ||₂² = λΣiθi², encouraging small weights without exact zeros.[54]

Growing Regularization gradually increases penalty λ(t) over training iterations for improved pruning schedules, addressing Hessian information exploitation. DeepHoyer introduces scale-invariant, differentiable sparsity measures: DeepHoyer-Square (DHS) = (||θ||₁/||θ||₂)², optimizable via standard SGD.[55]

Network Slimming (2017) prunes channels by penalizing batch normalization scaling factors with L1 regularization, achieving 20× model size reduction and 5× computing operations reduction on VGG-16 CIFAR-10.[56]

Pruning schedules

One-shot pruning removes the target percentage in a single step after training, offering negligible pruning cost and fast execution but requiring carefully designed criteria and risking layer collapse.[4] Examples include SNIP, SynFlow (data-free), and SparseGPT for 100B+ parameter LLMs. One-shot methods are particularly valuable for very large models where iterative retraining is prohibitively expensive.

Iterative pruning alternates score-prune-update cycles: train to performance level, prune p% parameters, fine-tune several epochs, repeat until target sparsity.[13] While computationally expensive, iterative methods achieve better final accuracy through gradual adaptation to sparsity and reduced catastrophic forgetting. Studies on VGG-16 CIFAR-10 and LLaMA-7B consistently show iterative outperforming one-shot approaches.

Automated Gradual Pruning (AGP) uses polynomial sparsity schedules:[57]

where sf is final sparsity, si is initial sparsity, and the schedule gradually increases sparsity over training.

The Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis, proposed by Jonathan Frankle and Michael Carbin at MIT in 2018, states that randomly-initialized dense networks f(x; θ₀) contain sparse subnetworks f(x; m⊙θ₀) (where m is a binary mask) that, when trained in isolation, reach accuracy comparable to the full network.[15] This ICLR 2019 Best Paper Award winner demonstrated winning tickets with 10-20% of original parameters achieving full network performance.

The hypothesis provides a powerful theoretical framework for understanding why pruning can be so effective, suggesting that large, over-parameterized networks are not just learning effective weights; they are also acting as a search space to find well-initialized sparse structures that are inherently good at learning.

Iterative magnitude pruning

The Iterative Magnitude Pruning (IMP) algorithm identifies winning tickets:

  1. Randomly initialize network: θ₀
  2. Train to convergence: θ
  3. Prune p% of weights by magnitude: create mask m
  4. Reset remaining weights: θpruned = m ⊙ θ₀
  5. Repeat steps 2-4

The rewinding variant, proposed for stabilizing larger networks, resets to weights at iteration k (not initialization): θpruned = m ⊙ θk, establishing that networks become stable to SGD noise early in training, creating linearly-connected minima.[58]

The key insight of the LTH is that the structure of the sparse subnetwork and its specific initial weight values are both crucial. A stronger version of the hypothesis has also been proven, showing that a sufficiently over-parameterized network contains a subnetwork that can approximate a target function well even before any training.[59]

Theoretical validation

Malach et al. (2020) provided the first theoretical proof of a strong lottery ticket hypothesis for two-layer networks, formally validating that pruning is sufficient.[60] Extensions include applications to pre-trained BERT networks, where matching subnetworks exist at 40-90% sparsity at initialization.[61]

Critical analysis by Frankle et al. (2020) showed pruning-at-initialization methods underperform magnitude pruning after training, with shuffling weights preserving accuracy—suggesting these methods identify architecture rather than specific initializations.[47]

Pruning in search and planning

Outside of model training, pruning is fundamental in symbolic AI search. In two-player games, alpha–beta pruning eliminates branches that cannot affect the minimax value, enabling deeper searches with the same compute while returning the same optimal move as plain minimax under perfect play.[3]

Alpha-beta pruning is a search algorithm that seeks to decrease the number of nodes evaluated by the minimax algorithm in its search tree. It stops evaluating a move when at least one possibility has been found that proves the move to be worse than a previously examined move. Such moves need not be evaluated further. When applied to a standard minimax tree, it returns the same move as minimax would, but prunes away branches that cannot possibly influence the final decision.

The algorithm maintains two values, alpha and beta, which represent the minimum score that the maximizing player is assured of and the maximum score that the minimizing player is assured of, respectively. As the search proceeds, these values are updated, and branches are pruned when beta ≤ alpha, indicating that the current position will not be reached in optimal play.

Applications and performance

Computer vision

Pruning achieves substantial compression across vision architectures:

ResNet models:

  • ResNet-50 ImageNet: 2× acceleration with ~1.4% top-5 accuracy loss (L1-norm pruning); 76% top-1 accuracy with only 25% FLOPs (GWCS method)[62]
  • ResNet-56 CIFAR-10: 25% sparsity shows optimal performance; 50-75% sparsity suboptimal due to low channel counts[63]
  • ResNet-110 CIFAR-100: 62% FLOPs reduction with accuracy maintained (HDBOFP method)

VGG architectures demonstrate extreme compressibility:

  • VGG-16 CIFAR-10: 20× model size reduction, 5× computing operations reduction (Network Slimming); 60% channel sparsity with <1% performance drop[35]
  • VGG-16 ImageNet: 5× acceleration with 0.59% top-5 accuracy loss (FBS method)[64]

Object detection:

  • YOLOv5l: 63.8% parameter reduction and 37.4% FLOPs reduction with favorable accuracy balance[65]
  • YOLOv3: Pruning with quantization enables deployment on Jetson TX2 edge devices with significant energy reduction

Natural language processing

BERT pruning demonstrates exceptional compression potential:

  • Optimal BERT Surgeon (oBERT): 10× model size compression with <1% accuracy drop; 10× CPU-inference speedup with <2% accuracy drop; 29× speedup with <7.5% accuracy drop[66]
  • Rasa BERT (real-world deployment): 60% neuron sparsity achieving F1 ~0.895 (2.8% relative decrease), 28% inference acceleration, model size 406MB → 197MB (51% reduction); remarkably, 100% WQ and WK pruning (removing self-attention entirely) achieved F1 = 0.897[67]
  • Neural Magic BERT: 12-layer 90% sparse achieves 4.05× speedup beating 3-layer dense BERT in accuracy; 3-layer 70% sparse achieves 9.20× speedup matching dense accuracy[68]

Large language model pruning has advanced dramatically:

  • SparseGPT: First method efficiently pruning 10-100B+ parameter models; OPT-175B and BLOOM-176B pruned to 60% unstructured sparsity in under 4.5 hours with minimized perplexity[16]
  • Wanda: Simple weight × activation approach achieving competitive 50% sparsity, 300× faster than SparseGPT, no retraining needed[17]
  • LLaMA pruning: Tailored-LLaMA achieves 95.68% accuracy recovery at 20% compression, 86.54% at 50% compression in <1 hour fine-tuning; AMD's LLaMA 3.1 405B conservative pruning removes 26 layers achieving >97% RougeL[69]

Vision transformers

Token pruning methods achieve significant speedups:

  • DynamicViT: Dynamic token sparsification with learnable prediction modules (NeurIPS 2021)
  • SPViT: Computation-aware soft pruning reducing DeiT-T latency to 26ms on mobile[70]
  • NViT (NVIDIA Research): Hardware-friendly global structural pruning achieving 1.9× speedup with minimal accuracy loss[71]
  • Isomorphic Pruning: Groups sub-structures by topology; improved DeiT-Tiny from 74.52% to 77.50% accuracy by pruning DeiT-Base, and ConvNext-Tiny from 82.06% to 82.18%[72]

Edge computing and mobile deployment

Edge applications demonstrate pruning's practical value:

  • MP-YOLO (autonomous vehicles): Model size 6MB → 2.2MB (63% reduction), +4.7% AP50, +4.2% AP on DAIR-V2X dataset using LAMP pruning[73]
  • Industrial IoT: VGG16 and ResNet18 pruning achieves energy savings without accuracy compromise on BloodMNIST, VisA, MVTec datasets[74]
  • Low-Rank LLaMA2-7B: ~50% faster training vs. 8-bit quantization, ~1.25× inference speed-up, ~50% weights removed without fine-tuning[75]

Industry use cases

Pruning has found applications across various industries where efficiency and interpretability are important:

  • Healthcare: In medical diagnostics, pruned decision trees can create simple, interpretable rules for predicting patient risk factors or treatment outcomes. Pruned neural networks can accelerate the analysis of medical images on portable devices.[6]
  • Finance: Financial institutions use pruned decision trees for credit scoring and risk assessment. The resulting models are not only faster but also more transparent, making it easier to explain lending decisions to regulators and customers.[6]
  • Marketing: Pruned models are used for customer segmentation and targeted advertising. Their simplicity allows marketing teams to understand the key drivers of customer behavior and tailor their strategies accordingly.[6]

Advantages and limitations

Benefits

Documented compression and speedup metrics
Model Compression Ratio Speedup Accuracy Impact Reference
BERT-base 10× 29× <7.5% loss Optimal BERT Surgeon
ResNet-50 62-76% reduction 2-5× ~1.4% top-5 loss Multiple studies
VGG-16 20× <1% Network Slimming
GPT-2 70% 2.5× Maintained Nature 2025
YOLOv4 96.7% Balanced Research papers

Pruning provides numerous benefits:

  • Model size reduction: Reduces storage requirements and deployment costs
  • Inference acceleration: Lowers latency and enables real-time applications
  • Power consumption reduction: Critical for battery-powered edge devices
  • Memory footprint reduction: Allows larger batch sizes and more efficient inference
  • Better generalization: Originally motivating OBD, pruning can act as regularization[2]
  • Bandwidth reduction: Lower model transfer costs for cloud and edge deployment
  • Cost savings: Cloud inference cost reductions up to 70%
  • Interpretability: Simpler models are easier to understand and explain

Challenges and limitations

Accuracy-efficiency trade-off: Excessive pruning loses important information; beyond 80% sparsity, models become incapable of recovery. Different tasks and datasets exhibit varying sensitivity—ImageNet-trained models show more accuracy deterioration than CIFAR100-trained models.[76]

Hardware-software compatibility: Unstructured pruning requires specialized hardware for actual speedups. The Rasa study found 50%-sparse BERT provides almost no speed-up due to computational overhead, with tf.scatter_nd adding ~15ms. Extreme sparsity (80%+) needed on GPUs to see benefits.[67] Standard GPUs and CPUs are optimized for dense matrix operations, making irregular sparsity patterns inefficient without specialized support.

Pruning schedule complexity: Determining optimal schedules is non-trivial—pruning too eagerly (1 epoch) or too slowly both harm models. Different layers have different sensitivities requiring careful calibration.[67]

Layer collapse: Global pruning may eliminate entire groups at high speedup ratios. Protected global pruning preserving ≥10% parameters per group mitigates this.[76]

Model and task specificity: Vision Transformers are harder to compress than CNNs. Each component has characteristic maximum sparsity—BERT self-attention can sustain 100% pruning but intermediate layers cannot.[67] Recovery requirements increase with pruning ratio, making fine-tuning computationally expensive.

Implementation challenges: Requires careful tuning of hyperparameters, understanding of model architecture sensitivities, and often multiple iterations to achieve optimal results. The process can be time-consuming and requires expertise.

Comparison with other model compression techniques

Pruning is one of the three main pillars of model compression, alongside quantization and knowledge distillation. While all three aim to create more efficient models, they operate on different principles.

Comparison of pruning, quantization, and knowledge distillation
Technique Mechanism Primary Effect Typical Accuracy Impact Hardware Considerations
Pruning Removes redundant weights, neurons, or filters from the model's architecture Reduces parameter count and FLOPs, leading to a smaller and potentially faster model Can maintain accuracy with fine-tuning; high pruning rates can cause degradation Structured pruning is necessary for significant speedups on standard hardware (GPUs/CPUs)
Quantization Reduces the bit-precision of weights and/or activations (for example from 32-bit floats to 8-bit integers)[77] Reduces model size (memory footprint) and can significantly speed up inference due to faster integer arithmetic Minor accuracy drop is common, often recoverable with Quantization-Aware Training (QAT) Most effective with hardware that has native support for low-precision arithmetic (for example Tensor Cores, TPUs)
Knowledge distillation Trains a smaller "student" model to mimic the behavior (output probabilities) of a larger "teacher" model[78] Creates a new, compact model with a different architecture and weights, but trained to capture the "dark knowledge" of the larger model Aims to transfer the high performance of the teacher to the smaller student; some performance drop is expected but often less than training the small model from scratch The student model can be designed specifically to be efficient on target hardware

Pruning vs. Quantization: Pruning changes the model's architecture by removing parts of it. Quantization keeps the architecture the same but changes the numerical representation of the parameters. Pruning reduces the number of parameters, while quantization reduces the size of each parameter.[78]

Pruning vs. Knowledge Distillation: Pruning is a process of simplifying an existing, trained model. Knowledge distillation is a training process for creating a new, smaller model. Pruning results in a subset of the original model's parameters, whereas the student model in distillation has entirely new parameters learned from scratch.[77]

Synergistic use

These techniques are not mutually exclusive and are often most powerful when used in combination.[79] A common and highly effective pipeline for model compression involves:

  1. Pruning: First, prune a large, trained model to remove structural redundancy and reduce its FLOPs
  2. Quantization: Next, quantize the remaining weights of the pruned model to reduce its memory footprint and leverage fast integer arithmetic
  3. Knowledge Distillation: Alternatively, a large model can be used as a teacher to train a smaller, structurally efficient student model, which can then itself be pruned and/or quantized

By combining these methods, practitioners can achieve dramatic reductions in model size and latency, often by an order of magnitude or more, making it possible to deploy state-of-the-art AI on a wide variety of hardware.[80]

Advanced topics and modern frontiers

Pruning research continues to evolve, moving beyond simple heuristics applied to standard CNNs. Current research focuses on applying pruning to state-of-the-art architectures, automating the complex process of deciding what and how much to prune, and developing more dynamic and adaptive pruning strategies. This trajectory mirrors the broader evolution of machine learning itself: a progression from static, heuristic-based methods to dynamic, automated, and learned approaches.

Automated pruning and AutoML

Manually determining the optimal pruning strategy for a given network—deciding which layers to prune and by how much—is a complex and time-consuming process involving extensive trial and error. To address this, the field is moving towards AutoML for pruning, where the pruning policy itself is learned automatically.[81]

These methods frame the search for the best pruned architecture as an optimization problem:

  • Meta-Learning Approaches: Methods like MetaPruning train a separate "meta-network" (called a PruningNet) that learns to generate the optimal weights for any given pruned architecture. By sampling different pruned structures during training, the meta-network learns a general mapping from architecture to weights. This allows for a fast search over many candidate pruned networks without having to train each one from scratch.[82]
  • Reinforcement Learning (RL) and Bayesian Methods: Other approaches use RL agents or Bayesian optimization to explore the space of possible pruning configurations. The agent proposes a pruning action (for example a set of per-layer pruning rates), receives a reward based on the resulting model's accuracy and size, and updates its policy to find configurations that maximize the reward.[81][83]
  • Gradient-Based Automatic Pruning: Techniques like AutoPrune introduce a set of trainable auxiliary parameters that control the pruning mask. These parameters are optimized via gradient descent alongside the model weights, allowing the network to learn its own sparse structure automatically and robustly, without sensitive hyperparameters like pruning thresholds.[84]

Dynamic and adaptive pruning

The most advanced frontier in pruning research involves moving away from a static pruned structure. In static pruning, once a network is pruned, its sparse structure remains fixed. Dynamic pruning methods allow this structure to change:

  • Dynamic Pruning During Training: Some methods allow the pruning mask to be updated during the training process. For example, a technique called RigL (Rigged Lottery) prunes weights with the smallest magnitudes and then reactivates (regrows) connections with the largest gradient magnitudes, allowing the sparse topology to evolve and adapt throughout training.
  • Spatio-Temporal Pruning: For models that process sequential data, like spiking neural networks (SNNs) used with Dynamic Vision Sensors, pruning can be adapted to the temporal dimension. Spatio-temporal pruning algorithms dynamically adjust the network's structure to reduce not only spatial redundancy (within a single frame) but also temporal redundancy that exists across consecutive frames of data.[85] This represents a highly adaptive form of pruning tailored to the specific characteristics of the data stream.
  • Input-Dependent Dynamic Pruning: Masks change per input at inference, optimizing for each sample. This allows the model to use different sparse structures for different inputs, allocating computational resources where they are most needed.

Tools and frameworks

PyTorch

PyTorch's torch.nn.utils.prune module (available since 1.4.0) provides built-in pruning capabilities including random_unstructured(), l1_unstructured(), ln_structured(), and global_unstructured().[86] The module uses forward hooks applying masks during inference, supports iterative pruning with mask accumulation via PruningContainer, and allows custom pruning methods via BasePruningMethod.

Torch-Pruning, implementing the DepGraph algorithm (CVPR 2023), provides automatic dependency analysis for structural pruning across LLMs, Vision Transformers, CNNs, and detection models.[87] Supporting GroupMagnitudeImportance, GroupTaylorImportance, and custom metrics, it enables high-level pruning with global strategies and isomorphic pruning (ECCV 2024).

TensorFlow

The TensorFlow Model Optimization Toolkit provides magnitude-based pruning via prune_low_magnitude() with polynomial decay schedules, integrated with Keras layers.[48] Features include UpdatePruningStep and PruningSummaries callbacks, strip_pruning() to remove wrappers, TensorFlow Lite support with XNNPACK acceleration, and PruneForLatencyOnXNNPack policy for mobile/edge devices. The toolkit supports structured pruning patterns including 2:4 and N:M sparsity.

NVIDIA tools

NVIDIA TensorRT Model Optimizer supports depth pruning (layer removal), width pruning (neurons, attention heads, channels), magnitude-based and activation-based pruning for LLMs and transformers, with TensorRT integration for optimized inference.[88]

NeMo Framework provides script-based pruning (scripts/llm/gpt_prune.py) powered by TensorRT Model Optimizer, supporting combined depth and width pruning for Llama, Mistral, and other LLMs with importance calibration using training data.[89]

NVIDIA ASP (Automatic SParsity) enables 2:4 structured sparsity for Ampere GPUs, achieving up to 2× speedup using sparse tensor cores with TensorRT 8.0+ integration.[90]

Additional frameworks

Microsoft NNI (Neural Network Intelligence) provides unified API for 10+ pruning algorithms including L1NormPruner, FPGMPruner, SlimPruner, TaylorFOWeightPruner, with ModelSpeedup for real acceleration, supporting PyTorch and TensorFlow.[91]

JaxPruner (Google Research 2023) offers JAX-based sparsity with magnitude, top-K, random, and gradient-based methods, integrating with Optax optimizers and Flax models, demonstrating minimal overhead with sparsity distributions and scheduling functions.[92]

ONNX Runtime provides graph optimizations (constant folding, node elimination/fusion), dynamic and static quantization (INT8/INT4), and TensorRT EP integration for cross-platform deployment.[93]

Theoretical foundations

Generalization bounds

Pruning's generalization benefits have theoretical support. For pruned networks, generalization error bounds:[94]

where deff is effective dimensionality (non-zero parameters). This shows generalization improves with higher pruning rates up to a threshold.

PAC-Bayes compression bounds provide state-of-the-art guarantees. For stochastic classifier Q, with probability 1-δ:[95]

where KL is Kullback-Leibler divergence. Arora et al. (2018) showed compression-based bounds orders of magnitude better than parameter counting, with first non-vacuous ImageNet-scale guarantees achieved in 2019.

Path-norm bounds provide rescaling-invariant metrics:[96]

where Φ(θ) is path-lifting of parameters, applicable to ResNets, VGGs, and U-nets.

Statistical mechanics analysis

Teacher-student frameworks show sparse networks generalize better than dense networks for fixed parameter counts, with pruning benefit increasing with pruning instability (accuracy drop immediately after pruning)—suggesting pruning regularizes similarly to noise injection, producing flatter models.[97]

Key researchers

Song Han (MIT Associate Professor, 80,683+ citations) pioneered magnitude-based pruning (2015), Deep Compression (ICLR 2016 - 35-49× compression), AMC (ECCV 2018 - AutoML compression), and EIE inference engine (ISCA 2016 - top-5 most cited in 50 years). Recent work includes AWQ and SmoothQuant for LLM quantization. Awards include ICLR'16 Best Paper, NSF CAREER, "35 Innovators Under 35", IEEE "AI's 10 to Watch", Sloan Research Fellowship.[98]

Jonathan Frankle and Michael Carbin (MIT) introduced the Lottery Ticket Hypothesis (ICLR 2019 Best Paper), stabilization methods, and critical analysis of pruning-at-initialization, fundamentally changing understanding of network sparsity.[15]

Gongfan Fang, Xinyin Ma, and Xinchao Wang (National University of Singapore xML Lab) developed DepGraph (CVPR 2023), Isomorphic Pruning (ECCV 2024), LLM-Pruner (NeurIPS 2023), and Structural Pruning for Diffusion Models (NeurIPS 2023), advancing structured pruning across architectures.[72]

Elias Frantar and Dan Alistarh (IST Austria) created SparseGPT (ICML 2023), enabling efficient one-shot pruning of 100B+ parameter models.[16]

Pavlo Molchanov and Huanrui Yang (NVIDIA Research) contributed Taylor expansion methods (ICLR 2017), importance estimation (CVPR 2019), and NViT (CVPR 2023) for hardware-aware pruning.[52]

Yann LeCun, John S. Denker, Sara A. Solla, Babak Hassibi, and David G. Stork established foundational second-order methods (OBD, OBS) in the late 1980s-early 1990s that continue influencing modern approaches.[2][10]

See also

References

  1. 1.0 1.1 1.2 Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. (1984). Classification and Regression Trees. Chapman & Hall/CRC. https://www.taylorfrancis.com/books/mono/10.1201/9781315139470/classification-regression-trees-leo-breiman-jerome-friedman-olshen-charles-stone
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 LeCun, Y.; Denker, J.; Solla, S. (1989). "Optimal Brain Damage". NeurIPS. https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf
  3. 3.0 3.1 3.2 Russell, S.; Norvig, P. (2010). Artificial Intelligence: A Modern Approach (3rd ed.). Pearson. https://lib.ysu.am/open_books/416544.pdf
  4. 4.0 4.1 4.2 4.3 4.4 4.5 4.6 Cheng, H.; et al. (2024). "A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations". IEEE TPAMI. https://arxiv.org/pdf/2308.06767
  5. Han, S.; Mao, H.; Dally, W. J. (2015). "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding". arXiv:1510.00149. https://arxiv.org/abs/1510.00149
  6. 6.0 6.1 6.2 6.3 Jang, D. (2024). "Mastering Decision Trees: A Dive into Pruning Techniques in Supervised Learning". Medium. https://medium.com/@jangdaehan1/mastering-decision-trees-a-dive-into-pruning-techniques-in-supervised-learning-47003890159d
  7. Pruning (artificial neural network). Wikipedia. https://en.wikipedia.org/wiki/Pruning_(artificial_neural_network)
  8. 8.0 8.1 8.2 8.3 Polukhin, A. (2022). "Pruning: The History And Overview". https://polukhin.tech/2022/10/27/pruning-the-history-and-overview
  9. Rethinking Weight Decay for Efficient Neural Network Pruning. NIH/PMC 2022. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8950981/
  10. 10.0 10.1 10.2 Hassibi, B.; Stork, D. (1993). "Optimal Brain Surgeon and general network pruning". IEEE. https://ieeexplore.ieee.org/document/298572/
  11. Karnin, E.D. (1990). "A Simple Procedure for Pruning Back-Propagation Trained Neural Networks". IEEE. https://ieeexplore.ieee.org/document/80236
  12. Reed, R. (1993). "Pruning Algorithms - A Survey". IEEE. https://ieeexplore.ieee.org/document/248452
  13. 13.0 13.1 13.2 Han, S.; et al. (2015). "Learning both Weights and Connections for Efficient Neural Networks". NeurIPS. https://arxiv.org/abs/1506.02626
  14. 14.0 14.1 Li, H.; et al. (2016). "Pruning Filters for Efficient ConvNets". ICLR 2017. https://arxiv.org/abs/1608.08710
  15. 15.0 15.1 15.2 Frankle, J.; Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". ICLR 2019. https://arxiv.org/abs/1803.03635
  16. 16.0 16.1 16.2 Frantar, E.; Alistarh, D. (2023). "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot". ICML 2023. https://arxiv.org/abs/2301.00774
  17. 17.0 17.1 17.2 Sun, M.; et al. (2024). "A Simple and Effective Pruning Approach for Large Language Models". ICLR 2024. https://arxiv.org/abs/2306.11695
  18. 18.0 18.1 18.2 scikit-learn User Guide (2025). "Decision Trees". https://scikit-learn.org/stable/modules/tree.html
  19. 19.0 19.1 TIBCO Software. "Pruning or Pre-Pruning". https://docs.tibco.com/pub/sfire-dsc/7.0.1/doc/html/user-guide/pruning-or-pre-pruning.htm
  20. Kaur, P.; Singh, M. (2012). "A Survey of Decision Tree Pruning Methods". International Journal of Computer Applications. https://research.ijcaonline.org/volume60/number12/pxc3884304.pdf
  21. 21.0 21.1 21.2 Lamarr Institute. "Decision Trees Pruning". https://lamarr-institute.org/blog/decision-trees-pruning/
  22. 22.0 22.1 22.2 22.3 22.4 Decision tree pruning. Wikipedia. https://en.wikipedia.org/wiki/Decision_tree_pruning
  23. 23.0 23.1 23.2 GeeksforGeeks. "Pruning decision trees". https://www.geeksforgeeks.org/machine-learning/pruning-decision-trees/
  24. 24.0 24.1 Elomaa, T.; Kääriäinen, M. (2001). "An Analysis of Reduced Error Pruning". Journal of Artificial Intelligence Research. https://jair.org/index.php/jair/article/download/10284/24526
  25. Banu, S.; Gomathy, C. (2020). "A comprehensive study on pre-pruning and post-pruning methods of decision tree classification algorithm". ResearchGate. https://www.researchgate.net/publication/350950093
  26. 26.0 26.1 IBM SPSS (2005). "Cost-Complexity Pruning Process". https://public.dhe.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Algorithms/14.0/TREE-pruning.pdf
  27. PennState Statistics. "STAT 857: Cost-Complexity Pruning". https://online.stat.psu.edu/stat857/node/60/
  28. 28.0 28.1 GeeksforGeeks. "How to choose α in cost-complexity pruning?". https://www.geeksforgeeks.org/machine-learning/how-to-choose-a-in-cost-complexity-pruning/
  29. scikit-learn. "Post pruning decision trees with cost complexity pruning". https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
  30. Nandi, S. (2023). "A Comprehensive Guide to Pre-Pruning and Post-Pruning in Decision Trees". Medium. https://medium.com/@sushmita2310/a-comprehensive-guide-to-pre-pruning-and-post-pruning-in-decision-trees-c556c48aafdf
  31. scikit-learn API. "DecisionTreeClassifier". https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
  32. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. https://dl.acm.org/doi/10.5555/583200
  33. OpenDataScience.com (2021). "What is Pruning in Machine Learning?". https://opendatascience.com/what-is-pruning-in-machine-learning/
  34. GeeksforGeeks. "Neural Network Pruning in Deep Learning". https://www.geeksforgeeks.org/deep-learning/neural-network-pruning-in-deep-learning/
  35. 35.0 35.1 35.2 Neural Network Pruning Methods Research. Multiple sources compiled. 2024.
  36. Vadera, S.; Ameen, S. (2022). "A Survey on Unstructured and Structured Pruning of Deep Neural Networks". arXiv. https://arxiv.org/abs/2502.07189
  37. Kaggle. "Pruning a Neural Network". https://www.kaggle.com/code/nitinsss/pruning-a-neural-network
  38. Gildenblat, J. "Pruning deep neural networks". https://jacobgil.github.io/deeplearning/pruning-deep-learning
  39. Hu, H.; et al. (2016). "Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures". https://arxiv.org/abs/1607.03250
  40. Zia, M. W.; et al. (2022). "A Novel Filter Pruning Method for Deep Convolutional Neural Network Compression". MDPI Applied Sciences. https://www.mdpi.com/2076-3417/12/21/11184
  41. Intel AI Lab Distiller. "Tutorial: Pruning Filters and Channels". https://intellabs.github.io/distiller/tutorial-struct_pruning.html
  42. Shortened LLaMA: Depth Pruning for Large Language Models (2024). https://arxiv.org/html/2402.02834v2
  43. Zhou, A.; et al. (2021). "Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch". ICLR 2021. https://arxiv.org/abs/2102.04010
  44. 44.0 44.1 44.2 Datature. "A Comprehensive Guide to Neural Network Model Pruning". https://datature.io/blog/a-comprehensive-guide-to-neural-network-model-pruning
  45. 45.0 45.1 Lee, N.; et al. (2019). "SNIP: Single-shot Network Pruning based on Connection Sensitivity". ICLR 2019. https://arxiv.org/abs/1810.02340
  46. 46.0 46.1 Wang, C.; et al. (2020). "Picking Winning Tickets Before Training by Preserving Gradient Flow". ICLR 2020. https://openreview.net/pdf?id=SkgsACVKPH
  47. 47.0 47.1 Frankle, J.; et al. (2020). "Pruning Neural Networks at Initialization: Why are We Missing the Mark?". https://arxiv.org/abs/2009.08576
  48. 48.0 48.1 TensorFlow Model Optimization - Pruning. https://www.tensorflow.org/model_optimization/guide/pruning
  49. He, Y.; et al. (2018). "Optimizing Layer-wise Magnitude-based Pruning with Reinforcement Learning". IJCAI Proceedings. https://www.ijcai.org/proceedings/2018/0330.pdf
  50. Confident magnitude-based pruning (2024). https://arxiv.org/abs/2408.04759
  51. Dong, X.; et al. (2017). "Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon". NeurIPS 2017. https://arxiv.org/abs/1705.07565
  52. 52.0 52.1 Molchanov, P.; et al. (2017). "Pruning Convolutional Neural Networks for Resource Efficient Inference". ICLR 2017. https://arxiv.org/abs/1611.06440
  53. Channel pruning based on mean gradient (2019). https://www.sciencedirect.com/science/article/abs/pii/S0165168418303517
  54. Wang, H.; et al. (2021). "Neural Pruning via Growing Regularization". NeurIPS 2021. https://arxiv.org/abs/2012.09243
  55. Yang, H.; et al. (2020). "DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures". ICLR 2020. https://arxiv.org/abs/1908.09979
  56. Liu, Z.; et al. (2017). "Network Slimming". ICCV 2017. https://arxiv.org/abs/1708.06519
  57. Zhu, M.; Gupta, S. (2018). "To prune, or not to prune: exploring the efficacy of pruning for model compression". ICLR Workshop 2018.
  58. Frankle, J.; et al. (2020). "Linear Mode Connectivity and the Lottery Ticket Hypothesis". ICML 2020. https://proceedings.mlr.press/v119/frankle20a.html
  59. Lottery ticket hypothesis. Wikipedia. https://en.wikipedia.org/wiki/Lottery_ticket_hypothesis
  60. Malach, E.; et al. (2020). "Proving the Lottery Ticket Hypothesis: Pruning is All You Need". ICML 2020. https://proceedings.mlr.press/v119/malach20a/malach20a.pdf
  61. Chen, T.; et al. (2020). "The Lottery Ticket Hypothesis for Pre-trained BERT Networks". NeurIPS 2020. https://proceedings.neurips.cc/paper/2020/hash/b6af2c9703f203a2794be03d443af2e3-Abstract.html
  62. Global Wavelet Channel Search method (2021). https://www.frontiersin.org/journals/computational-neuroscience/articles/10.3389/fncom.2021.760554/full
  63. Pruning Applications and Performance Research. Multiple sources compiled. 2024.
  64. Filter-Based Sampling method (2021). https://arxiv.org/pdf/2101.09671
  65. Pruned-YOLO: Learning Efficient Object Detector Using Model Pruning (2021). https://link.springer.com/chapter/10.1007/978-3-030-86380-7_4
  66. Kurtic, E.; et al. (2023). "The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models". https://arxiv.org/abs/2203.07259
  67. 67.0 67.1 67.2 67.3 Rasa. "Pruning BERT to accelerate inference". https://rasa.com/blog/pruning-bert-to-accelerate-inference/
  68. Neural Magic. "Pruning Hugging Face BERT with Compound Sparsification". https://neuralmagic.com/blog/pruning-hugging-face-bert-compound-sparsification/
  69. AMD. "LLaMA 3.1 405B MLPerf pruning". https://rocm.blogs.amd.com/artificial-intelligence/mlperf-llama-pruning/README.html
  70. Kong, Z.; et al. (2021). "SPViT: Soft Token Pruning for Vision Transformers". https://arxiv.org/abs/2112.13890
  71. Yang, H.; et al. (2023). "NViT: Global Vision Transformer Pruning with Hessian-Aware Saliency". CVPR 2023. https://github.com/NVlabs/NViT
  72. 72.0 72.1 Fang, G.; et al. (2024). "Isomorphic Pruning for Vision Models". ECCV 2024. https://arxiv.org/abs/2407.04616
  73. MP-YOLO for storage-limited edge devices (2025). https://www.sciencedirect.com/science/article/abs/pii/S1047320325001749
  74. Sparse Evolutionary Training for Industrial IoT (2025). https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2025.1563942/full
  75. Mobius ML. "Low-Rank LLaMA2 pruning". https://mobiusml.github.io/low-rank-llama2/
  76. 76.0 76.1 Dataset sensitivity in pruning (2024). https://arxiv.org/html/2406.12315v1
  77. 77.0 77.1 Neptune.ai. "Deep learning model optimization methods". https://neptune.ai/blog/deep-learning-model-optimization-methods
  78. 78.0 78.1 Phontron, G. (2024). "Distillation". Carnegie Mellon University. https://phontron.com/class/anlp2024/assets/slides/anlp-11-distillation.pdf
  79. AI Stack Exchange (2023). "When to use pruning, quantization, distillation, and others when optimizing speed of a DL model?". https://ai.stackexchange.com/questions/43054/when-to-use-pruning-quantization-distillation-and-others-when-optimizing-spee
  80. Reddit (2024). "Prune, distill, quantize - what's the best order?". https://www.reddit.com/r/computervision/comments/1i84qw7/prune_distill_quantize_whats_the_best_order/
  81. 81.0 81.1 Li, Y.; et al. (2024). "AutoSculpt: A Pattern-based Automated Pruning Framework for DNNs". arXiv. https://arxiv.org/html/2412.18091v1
  82. Liu, Z.; et al. (2019). "MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning". ICCV 2019. https://openaccess.thecvf.com/content_ICCV_2019/papers/Liu_MetaPruning_Meta_Learning_for_Automatic_Neural_Network_Channel_Pruning_ICCV_2019_paper.pdf
  83. Reddit (2024). "A new method for structured pruning of neural networks". https://www.reddit.com/r/MachineLearning/comments/1d85eqd/r_a_new_method_for_structured_pruning_of_neural/
  84. Liu, Z.; et al. (2020). "AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters". National Science Foundation. https://par.nsf.gov/servlets/purl/10181832
  85. Gao, Z.; et al. (2025). "Spatio-Temporal Pruning for Spiking Neural Networks". Frontiers in Neuroscience. https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2025.1545583/full
  86. PyTorch Pruning Tutorial. https://docs.pytorch.org/tutorials/intermediate/pruning_tutorial.html
  87. Fang, G. "Torch-Pruning: Towards Any Structural Pruning". https://github.com/VainF/Torch-Pruning
  88. NVIDIA. "Pruning and Distilling LLMs using NVIDIA TensorRT Model Optimizer". https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer
  89. NVIDIA. "LLM Model Pruning with NVIDIA NeMo Framework". https://developer.nvidia.com/blog/llm-model-pruning-and-knowledge-distillation-with-nvidia-nemo-framework Cite error: Invalid <ref> tag; name "nemo_pruning" defined multiple times with different content
  90. NVIDIA. "Accelerating Inference with Sparsity Using Ampere and TensorRT". https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt
  91. NNI Pruning Documentation. https://nni.readthedocs.io/en/stable/compression/pruning.html
  92. JaxPruner: A Concise Library for Sparsity Research (2023). https://arxiv.org/abs/2304.14082
  93. ONNX Runtime Model Optimizations. https://onnxruntime.ai/docs/performance/model-optimizations/
  94. Theoretical Characterization of How Neural Network Pruning Affects its Generalization. NeurIPS 2022. https://arxiv.org/abs/2301.00335
  95. Lotfi, S.; et al. (2022). "PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization". NeurIPS 2022. https://arxiv.org/abs/2211.13609
  96. Path-metrics, pruning, and generalization (2024). https://arxiv.org/html/2405.15006v1
  97. Bartoldson, B. R.; et al. (2020). "The Generalization-Stability Tradeoff In Neural Network Pruning". https://arxiv.org/abs/1906.03728
  98. Song Han MIT Profile. https://hanlab.mit.edu/songhan