Model Architecture

61 articlesRSS

Showing 1-60 of 61 articles

ALiBi (Attention with Linear Biases)

ALiBi (Attention with Linear Biases) is a positional encoding method for transformer language models that, instead of adding positional embeddings to word...

Transformer Models

Albert Gu

Albert Gu is an American computer scientist, Assistant Professor of Machine Learning at Carnegie Mellon University, and co-founder and Chief Scientist of...

People

AutoML (Automated Machine Learning)

AutoML (Automated Machine Learning) is the automation of the end-to-end pipeline of applying machine learning to real-world data, replacing manual trial and...

Developer ToolsMLOps

Bahdanau attention

Bahdanau attention is the first attention mechanism for neural networks, introduced in 2014 to let a sequence-to-sequence decoder soft-align to every encoder...

Deep LearningNatural Language Processing

Bidirectional

See also: Machine learning terms Bidirectional describes a sequence model in which the representation at every position depends on the entire input sequence,...

Neural Networks

BitNet

BitNet is a family of large language model architectures developed by Microsoft Research Asia that constrain the weights of a transformer to extremely low...

Large Language ModelsMicrosoft

BitNet b1.58

BitNet b1.58 is a ternary-weight large language model architecture from Microsoft Research in which every weight is constrained to one of three values, -1, 0,...

Large Language ModelsMicrosoft

Byte Latent Transformer

The Byte Latent Transformer (BLT) is a tokenizer-free large language model architecture introduced by researchers at Meta AI's Fundamental AI Research (FAIR)...

Large Language ModelsMeta AI

Cross-attention

Cross-attention is a variant of the attention mechanism in which the queries are derived from one sequence or representation while the keys and values are...

Transformer Models

Depthwise Separable CNN

A depthwise separable convolution is a factorized form of convolution that decomposes a standard convolutional operation into two sequential steps: a depthwise...

Computer VisionMachine Learning

Depthwise separable convolutional neural network (sepCNN)

See also: Machine learning terms A depthwise separable convolutional neural network (often abbreviated sepCNN) is a convolutional neural network that replaces...

Computer VisionNeural Networks

Differential Transformer

The Differential Transformer (often shortened to Diff Transformer or DIFF Transformer) is a decoder-only neural sequence architecture introduced by researchers...

MicrosoftTransformer Models

Encoder

An encoder in machine learning is a neural network component that transforms input data (text, an image, audio, or code) into a compressed, structured...

Deep Learning

Feature Pyramid Network (FPN)

A Feature Pyramid Network (FPN) is a generic feature-extraction architecture for object detection and other dense-prediction tasks that builds a multi-scale...

Computer Vision

Graph Machine Learning Models

Graph machine learning models are neural networks designed to operate on data structured as graphs, where the input is a set of nodes connected by edges rather...

AI ModelsMachine Learning

Hidden Markov Model

A Hidden Markov Model (HMM) is a statistical model of sequential data in which an unobserved (hidden) sequence of discrete states follows a Markov process, and...

Machine Learning

Hyena

Hyena is a sub-quadratic, attention-free neural sequence operator that replaces the self-attention operator of the Transformer with a recurrence of long,...

Deep Learning

Infini-Attention

Infini-attention is an attention mechanism introduced by Google researchers Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal in the April 2024 paper...

GoogleTransformer Models

Jamba

Jamba is a family of open-weight large language models from AI21 Labs, first released on March 28, 2024, and is the world's first production-grade language...

AI CompaniesLarge Language Models

Jamba2

Jamba2 is the second generation of hybrid State Space Model and Transformer language models released by AI21 Labs on January 8, 2026.[1] The family extends the...

AI ModelsLarge Language Models

Joint Embedding Predictive Architecture

Joint Embedding Predictive Architecture (JEPA) is a family of self-supervised, non-generative neural network architectures proposed by Yann LeCun in his June...

Machine LearningMeta AI

LSTM

Long Short-Term Memory (LSTM) is a recurrent neural network architecture that learns long-range dependencies in sequential data by maintaining a separate,...

Deep LearningNeural Networks

Large Concept Model

A Large Concept Model (LCM) is a research approach to language modeling, introduced by Meta AI's Fundamental AI Research (FAIR) group in December 2024, that...

Meta AINatural Language Processing

Layer normalization

Layer normalization is a technique for normalizing the activations of a neural network across the feature dimension of each individual sample, rather than...

Deep Learning

Linear Attention

Linear attention is a family of sub-quadratic attention mechanisms that replaces the softmax dot-product operation of standard Transformer self-attention with...

Transformer Models

Liquid AI

Liquid AI is an artificial intelligence company founded in 2023 by researchers from MIT CSAIL and headquartered at 314 Main Street in Cambridge,...

AI CompaniesAI Models

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a specialized type of recurrent neural network (RNN) architecture designed to learn long-range dependencies in sequential...

Deep LearningMachine Learning

Long-context language models

Long-context language models are large language models engineered to accept and reason over inputs far larger than the few-thousand-token windows used by early...

Large Language Models

LongNet

LongNet is a transformer variant introduced by Microsoft Research in July 2023 that is designed to scale attention to sequences exceeding one billion tokens...

MicrosoftTransformer Models

LongRoPE

LongRoPE is a context-window extension technique for large language models (LLMs) that use rotary position embeddings (RoPE). Introduced by researchers at...

Large Language ModelsMicrosoft

MEGABYTE

MEGABYTE is a transformer architecture for autoregressive modeling of very long sequences directly at the byte level, introduced by researchers at Meta AI...

Meta AITransformer Models

MMDiT (Multimodal Diffusion Transformer)

MMDiT (Multimodal Diffusion Transformer, sometimes written MM-DiT) is a transformer architecture for text-conditioned image generation that gives image tokens...

Diffusion ModelsImage Generation

Machine learning terms/Sequence Models

See also: Machine learning terms Sequence models are a class of machine learning systems designed to process inputs or produce outputs that have a meaningful...

Machine Learning

Mamba

See also: transformer, attention mechanism, recurrent neural network, large language model, Mamba-2, state space model Mamba is a neural network architecture...

Large Language Models

Mamba 2

Mamba 2 is a state space model architecture introduced in the paper "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured...

AI ModelsDeep Learning

Mamba-3

Mamba-3 is a sequence-modeling architecture in the state space model (SSM) family, introduced in March 2026 by researchers at Carnegie Mellon University and...

Deep Learning

Mixture of Depths

Mixture of Depths (MoD) is a technique for dynamically allocating computation to individual tokens within transformer-based language models. Introduced by...

Deep LearningMachine Learning

Multi-Head Self-Attention

Multi-head self-attention is the core sequence-mixing mechanism of the Transformer architecture: it runs several scaled dot-product attention operations...

Deep LearningMachine Learning

Multi-Query Attention (MQA)

Multi-Query Attention (MQA) is a variant of the multi-head attention mechanism used in transformer neural networks in which all query heads share a single key...

Transformer Models

Multi-head Latent Attention

Multi-head Latent Attention (MLA) is an attention mechanism for transformer models that achieves a 93.3% reduction in key-value cache size while maintaining or...

Deep LearningMachine Learning

Node (neural network)

See also: Machine learning terms A node in a neural network is the basic computational element, an artificial neuron, that receives one or more inputs,...

Neural Networks

PagedAttention

PagedAttention is a KV-cache memory management algorithm for serving large language models that applies the virtual-memory paging technique used by operating...

AI InferenceAI Infrastructure

Perceiver

Perceiver is a family of general-purpose neural network architectures from DeepMind built around attention and a small latent bottleneck. The first model,...

Google DeepMindNeural Networks

RMSNorm

RMSNorm (Root Mean Square Layer Normalization) is a feature normalization technique introduced by Biao Zhang and Rico Sennrich in 2019 that scales each...

Artificial IntelligenceTransformer Models

RWKV

RWKV (pronounced "RwaKuv") is an open-source neural network architecture that combines the parallelizable training of Transformers with the constant-time,...

Deep LearningMachine Learning

RadixAttention

RadixAttention is a KV cache management technique introduced in SGLang that uses a radix tree data structure to automatically share and reuse cached key-value...

AI InferenceAI Infrastructure

Recurrent Neural Network

See also: Machine learning terms A recurrent neural network (RNN) is a class of artificial neural network designed to process sequential data by maintaining an...

Deep LearningMachine Learning

Rotary Position Embedding

Rotary Position Embedding (RoPE) is a positional encoding method for transformer models that encodes a token's absolute position by rotating its query and key...

Deep LearningLarge Language Models

Self-attention

Self-attention is a mechanism that lets a neural network weigh how much every element of a single input sequence should influence every other element,...

Deep LearningMachine Learning

Sliding window attention

Sliding window attention (SWA) is a sparse attention pattern in which each query token attends only to a fixed-size window of nearby tokens instead of to every...

Transformer Models

Sparse attention

Sparse attention is a family of techniques that cut the computational and memory cost of the attention mechanism in transformer models by letting each token...

Deep LearningMachine Learning

SubQ

SubQ is a large language model released on May 5, 2026 by Subquadratic, a Miami-based startup that emerged from stealth claiming to have built the first...

AI CompaniesLarge Language Models

SwiGLU

SwiGLU (Swish-Gated Linear Unit) is the activation function used inside the feed-forward sublayer of most modern transformer large language models, including...

Deep LearningNeural Networks

Titans (neural architecture)

Titans is a family of neural sequence-modeling architectures from Google Research that combines an attention-based "short-term memory" with a deep neural...

GoogleNeural Networks

Tower

See also: two-tower model, dual encoder, cross-encoder, contrastive learning, embedding, recommendation system, information retrieval Not to be confused with...

Information RetrievalMachine Learning

Transformers

> Note: This article is about the neural network architecture introduced in 2017. For the open-source Python library by Hugging Face, see Hugging Face...

Deep LearningNeural Networks

Unidirectional

See also: Machine learning terms Unidirectional is a property of a sequence model in which the representation or output at each position depends only on inputs...

Vision Transformer

The Vision Transformer (ViT) is a deep learning architecture that applies the transformer model, originally designed for natural language processing, to image...

Computer Vision

YOCO (You Only Cache Once)

YOCO ("You Only Cache Once") is a decoder-decoder neural network architecture for large language models introduced by researchers at Microsoft Research and...

Large Language ModelsMicrosoft

YaRN

YaRN (Yet another RoPE extensioN) is a compute-efficient method for extending the context window of large language models that use Rotary Position Embeddings...

AI InferenceDeep Learning