Training & Optimization

136 articlesRSS

Showing 1-60 of 136 articles

AdaGrad

See also: Machine learning terms AdaGrad (short for Adaptive Gradient Algorithm) is an optimizer for gradient descent-based machine learning that gives every...

Adafactor

Adafactor is a memory-efficient adaptive learning-rate optimizer for training deep neural networks, introduced by Noam Shazeer and Mitchell Stern in the 2018...

Deep Learning

Adam optimizer

The Adam optimizer (short for Adaptive Moment Estimation) is an algorithm for first-order gradient descent-based optimization of stochastic objective...

AdamW

AdamW is a variant of the Adam optimizer that decouples weight decay from the gradient-based update rule, applying the decay directly to the weights instead of...

Machine Learning

AutoML (Automated Machine Learning)

AutoML (Automated Machine Learning) is the automation of the end-to-end pipeline of applying machine learning to real-world data, replacing manual trial and...

Developer ToolsMLOps

Axolotl

Axolotl is a free and open source framework for fine-tuning and post-training large language models, written in Python and driven entirely by a single YAML...

Developer ToolsOpen Source AI

Bayesian Optimization

See also: Machine learning terms, Hyperparameter tuning, AutoML Bayesian optimization is a sequential, model-based strategy for finding the global optimum of...

Machine Learning

Candidate Sampling

Candidate sampling is a family of training-time optimization techniques used in machine learning to reduce the computational cost of models that must choose...

Machine LearningNatural Language Processing

Clipping

Clipping is a family of techniques in machine learning that constrain numerical values to lie within a specified range or below a specified magnitude. The most...

Deep LearningMachine Learning

Context Parallelism

Context Parallelism (CP) is a distributed training strategy that partitions the input sequence dimension of a transformer across multiple accelerators and uses...

AI Infrastructure

Convergence

Convergence in machine learning is the point at which an iterative optimization algorithm reaches a stable solution, meaning the loss function stops decreasing...

Machine LearningMathematics

Convex Function

A convex function is a real-valued function whose graph curves upward into a bowl or cup shape, so that the line segment (chord) connecting any two points on...

Machine LearningMathematics

Convex Optimization

Convex optimization is the branch of mathematical optimization that minimizes a convex function over a convex set, a problem class with one defining advantage:...

Machine LearningMathematics

Convex Set

A convex set is a set of points in which the line segment connecting any two points of the set lies entirely within the set [1][3]. Formally, a set in a real...

Machine LearningMathematics

Cosine learning rate schedule

The cosine learning rate schedule, also called cosine annealing, is a learning rate decay strategy that lowers the optimizer step size from a peak value to a...

Deep Learning

Cost

See also: Machine learning terms In machine learning, cost is the scalar number that summarizes how badly a model is performing on a chunk of data. The...

Curriculum learning

Curriculum learning is a training strategy for machine learning models in which training examples are presented in a meaningful, easy-to-hard order rather than...

Deep LearningMachine Learning

DPO

DPO (Direct Preference Optimization) is an alignment technique for large language models that directly optimizes a language model policy from human preference...

AI Alignment

DeepSeek-R1-Distill

DeepSeek-R1-Distill is a family of six open-weight reasoning language models released by DeepSeek on January 20, 2025, alongside the flagship DeepSeek-R1...

AI ModelsChinese AI

DeepSpeed

DeepSpeed is an open-source deep learning optimization library developed by Microsoft that makes distributed training and inference of large models efficient,...

AI InfrastructureDeep Learning

DiLoCo

DiLoCo (Distributed Low-Communication training) is a distributed optimization algorithm for neural networks introduced by Google DeepMind in November 2023 to...

Google DeepMind

Distributed training

Distributed training is the practice of training a single machine learning model using many compute devices in parallel, splitting the data, the model, or both...

MLOps

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA (Weight-Decomposed Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) method for large neural networks introduced in February 2024 by...

Machine Learning

Domain adaptation

Domain adaptation is the subfield of transfer learning that adapts a model trained on a labelled source domain so it performs well on a related but different...

Machine Learning

Dropout

Dropout is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during training, forcing the network to...

Deep Learning

Dropout Regularization

Dropout regularization is a regularization technique for neural networks that prevents overfitting by randomly setting a fraction of neuron activations to zero...

Deep LearningMachine Learning

Early Stopping

Early stopping is a regularization technique that halts the training of an iterative machine learning model as soon as its performance on a held-out validation...

Deep LearningMachine Learning

Elastic Net

See also: Regularization, Linear regression Elastic Net is a regularization and variable selection method for linear regression and other generalized linear...

Machine Learning

Empirical Risk Minimization

Empirical risk minimization (ERM) is the foundational principle of statistical learning theory: because the true risk (the expected loss over the unknown data...

Machine Learning

Expert Parallelism

Expert Parallelism (EP) is a model-parallelism strategy specific to Mixture of Experts (MoE) neural networks in which the individual expert sub-networks...

AI InfrastructureMixture of Experts

FP4 (4-bit floating point)

FP4 (4-bit floating point) is a numerical format that stores a real number in just 4 bits, the smallest floating-point type in mainstream use for deep...

AI HardwareAI Inference

Fine Tuning

Fine-tuning is a machine learning technique that takes a pre-trained model and further trains it on a smaller, task-specific dataset, adjusting the model's...

Deep LearningMachine Learning

Focal loss

Focal loss is a loss function that reshapes standard cross-entropy loss by adding a (1 - pt)^gamma modulating factor, which down-weights well-classified (easy)...

Computer VisionDeep Learning

Fully Sharded Data Parallel (FSDP)

Fully Sharded Data Parallel (FSDP) is a distributed training technique implemented in PyTorch that shards a model's parameters, gradients, and optimizer states...

AI InfrastructureDeep Learning

GRPO

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for fine-tuning large language models that eliminates the separate critic...

AI InferenceChinese AI

GaLore (Gradient Low-Rank Projection)

GaLore (Gradient Low-Rank Projection) is a memory-efficient training strategy for large neural networks that projects each weight matrix's gradient into a...

Machine Learning

Gradient

In machine learning, the gradient is the vector of partial derivatives of a loss function with respect to every model parameter, and it points in the direction...

Machine LearningMathematics

Gradient Accumulation

Gradient accumulation is a deep learning training technique that simulates a large batch size on limited GPU memory by summing the gradients from several small...

Deep LearningMachine Learning

Gradient Descent

Gradient descent is a first-order iterative optimization algorithm that minimizes a differentiable loss function by repeatedly stepping in the direction of the...

Deep LearningMachine Learning

Gradient checkpointing

Gradient checkpointing, also called activation checkpointing, activation recomputation, or rematerialization, is a memory-saving technique for training deep...

Deep Learning

Gradient clipping

Gradient clipping is a training technique that caps the magnitude of gradient values before they update model weights, preventing the excessively large...

Hinge Loss

Hinge loss is the margin-based loss function defined as max(0, 1 - y f(x)), used to train support vector machines (SVMs) and other maximum-margin classifiers,...

Machine Learning

HuggingFace PEFT

PEFT (Parameter-Efficient Fine-Tuning) is an open-source Python library from Hugging Face that adapts large pretrained models to new tasks by training only a...

Developer ToolsOpen Source AI

HuggingFace TRL

TRL (Transformer Reinforcement Learning, now stylized as Transformers Reinforcement Learning) is an open-source Python library maintained by Hugging Face for...

Open Source AIReinforcement Learning

Hyperparameter

See also: Machine learning terms A hyperparameter is a configuration setting in a machine learning algorithm that is fixed by the practitioner before training...

Deep LearningMachine Learning

InstructGPT

InstructGPT is a family of language models released by OpenAI in January 2022 that take the base GPT-3 and fine-tune it to follow user instructions more...

AI AlignmentLarge Language Models

KTO

KTO (Kahneman-Tversky Optimization) is a method for aligning large language models with human feedback using only a binary signal of whether a model output is...

AI AlignmentAI Inference

L0 Regularization

L0 regularization is a regularization technique in machine learning and statistics that penalizes the number of nonzero parameters in a model, a quantity...

Machine Learning

L1 Loss

L1 loss is a regression loss function equal to the average of the absolute differences between predicted values and target values, written as . It is also...

Machine LearningStatistics

L1 Regularization

L1 regularization is a regularization technique in machine learning and statistics that prevents overfitting by adding the sum of the absolute values of a...

Machine Learning

L2 Loss

L2 loss is the squared-error loss function: for a true value and a predicted value , it is the squared difference , and averaging it across a dataset gives...

Machine LearningStatistics

L2 Regularization

See also: machine learning terms, regularization, L1 regularization, elastic net, overfitting L2 regularization is a technique in machine learning and...

Machine Learning

LIMA (Less Is More for Alignment)

LIMA, short for "Less Is More for Alignment," is a 2023 research paper by Chunting Zhou and colleagues at Meta AI, Carnegie Mellon University, the University...

AI ResearchMeta AI

LLaMA-Factory

LLaMA-Factory is an open-source unified framework for the efficient fine-tuning of large language models (LLMs) and vision-language models (VLMs). It...

Developer ToolsOpen Source AI

Lasso Regression

Lasso regression (an acronym for Least Absolute Shrinkage and Selection Operator) is a linear regression method, introduced by Robert Tibshirani in 1996, that...

Machine Learning

Learning Rate

The learning rate is a hyperparameter in machine learning that controls how much a model's parameters change in response to the estimated error each time the...

Deep LearningMachine Learning

Lion (optimizer)

Lion (EvoLved Sign Momentum) is a stochastic optimizer for training deep neural networks, introduced by researchers at Google in the February 2023 paper...

AlgorithmsGoogle

LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes a pre-trained model's weights and trains small, injected low-rank...

Deep LearningMachine Learning

LoftQ

LoftQ (short for LoRA-Fine-Tuning-aware Quantization) is a quantization and initialization framework for large language models that jointly quantizes a...

Large Language Models

Log Loss

Log loss is the negative log-likelihood of the predicted probabilities and the standard loss function for probabilistic classification: for binary labels it is...

Machine LearningMathematics