AI Inference

71 articlesRSS

Showing 1-60 of 71 articles

AWQ (Activation-aware Weight Quantization)

Activation-aware Weight Quantization (AWQ) is a post-training quantization method for large language models that compresses weights to 4-bit (and optionally...

Deep LearningLarge Language Models

AWS Inferentia

AWS Inferentia is a family of custom application specific integrated circuits (ASICs) designed by Amazon Web Services for machine learning inference in the...

AI Hardware

Adaptive thinking

Adaptive thinking is an inference-time reasoning mode in the Anthropic Messages API in which a claude model decides, on a per-request basis, whether to use...

AnthropicReasoning Models

Beam search

Beam search is a heuristic search algorithm for sequence generation that, at each decoding step, keeps only the top-K highest-scoring partial sequences (where...

Constitutional Classifiers

Constitutional Classifiers are a machine learning-based safety technique developed by Anthropic to defend large language models against universal jailbreak...

AI AlignmentAI Safety

Context caching

Context caching is a large-language-model API feature that stores parts of a request's input (system prompts, instructions, attached documents, or earlier...

Developer ToolsLarge Language Models

Continuous Batching

Continuous batching is a scheduling technique for large language model (LLM) inference servers that inserts new requests into a running batch at the...

AI Infrastructure

DeepInfra

DeepInfra is a serverless AI inference cloud that hosts open-source and open-weight AI models and serves them to developers through a single pay-per-token API....

AI CompaniesAI Infrastructure

Disaggregated serving

Disaggregated serving is an LLM inference architecture that physically separates the prefill phase and the decode phase of text generation onto different sets...

AI InfrastructureArtificial Intelligence

Dynamic inference

Dynamic inference, also called input-adaptive inference, conditional computation, or adaptive computation, is the family of techniques that adapts a neural...

Mixture of Experts

EAGLE (speculative decoding)

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a lossless speculative decoding method that speeds up large language model (LLM)...

AI Infrastructure

EAGLE-2

EAGLE-2 ("Faster Inference of Language Models with Dynamic Draft Trees") is the second generation of the EAGLE family of speculative decoding methods for...

Large Language Models

Etched Sohu

Sohu is a transformer-specialized application-specific integrated circuit (ASIC) built by Etched, a Silicon Valley AI hardware startup founded in 2022 by...

AI Hardware

ExLlamaV2 (EXL2)

EXL2 (ExLlamaV2 format) is an open-source, mixed-bit weight-quantization format for compressing large language models so they run fast on a single...

Developer ToolsOpen Source AI

FP4 (4-bit floating point)

FP4 (4-bit floating point) is a numerical format that stores a real number in just 4 bits, the smallest floating-point type in mainstream use for deep...

AI HardwareTraining & Optimization

Fireworks AI

Fireworks AI is an artificial intelligence infrastructure company that runs a high-performance inference platform for deploying and serving open large language...

AI CompaniesDeveloper Tools

Flash-Decoding

Flash-Decoding is an inference-time variant of the FlashAttention algorithm that targets the decoding (autoregressive generation) phase of large language model...

Algorithms

GPTQ

GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot post-training quantization method that compresses the weights of large language models to...

Deep Learning

GRPO

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for fine-tuning large language models that eliminates the separate critic...

Chinese AIReasoning Models

Google TPU 8i

Google TPU 8i is an eighth-generation Tensor Processing Unit from Google, built specifically for AI inference rather than model training. It was previewed at...

AI HardwareGoogle

GraphRAG

GraphRAG is a graph-based approach to retrieval-augmented generation developed by Microsoft Research, first described publicly on February 13, 2024 and...

Information RetrievalMicrosoft

Groq LPU

The Groq LPU (Language Processing Unit) is a deterministic, SRAM-based AI inference chip designed by Groq, an American AI chip company founded in 2016, to run...

AI Hardware

H2O (Heavy-Hitter Oracle for KV Cache)

H2O (Heavy-Hitter Oracle) is a training-free, runtime KV cache eviction policy for autoregressive large language model inference. It identifies a small subset...

Large Language Models

Inference-time scaling

Inference-time scaling (also called test-time compute scaling) is the practice of improving an AI model's output quality by allocating more computational...

AI ResearchArtificial Intelligence

Intel Crescent Island

Intel Crescent Island is a data-center GPU from Intel built for artificial-intelligence inference workloads. Intel announced it on October 14, 2025 at the Open...

AI Hardware

KTO

KTO (Kahneman-Tversky Optimization) is a method for aligning large language models with human feedback using only a binary signal of whether a model output is...

AI AlignmentReinforcement Learning

KV Cache

A KV cache (key-value cache) is a memory optimization technique used during transformer inference that stores previously computed key and value tensors from...

Deep LearningMachine Learning

Knowledge Distillation

Knowledge distillation (also known as model distillation) is a model compression technique in machine learning in which a smaller model, called the student, is...

Deep LearningMachine Learning

LLM inference engine

An LLM inference engine (also called an LLM serving engine or LLM inference server) is the systems software stack that loads trained large language model...

AI InfrastructureLarge Language Models

LLM.int8()

LLM.int8() is an 8-bit matrix multiplication scheme for large language model inference that preserves accuracy across models up to 175 billion parameters by...

Large Language Models

Llama API

The Llama API is Meta's first-party hosted cloud service for running Llama models. Announced on April 29, 2025 at Meta's inaugural LlamaCon developer...

Developer ToolsMeta AI

Lookahead Decoding

Lookahead Decoding is a parallel decoding algorithm for accelerating inference in large language models, introduced in November 2023 by Yichao Fu, Peter...

AlgorithmsLarge Language Models

Medusa

Medusa is a large language model inference acceleration framework that speeds up text generation by adding multiple lightweight decoding heads on top of an...

AI Infrastructure

Model Compression

Model compression is a family of techniques that reduce the size, memory footprint, and computational cost of machine learning models while preserving as much...

NVIDIA Dynamo

NVIDIA Dynamo is an open-source, low-latency distributed inference serving framework designed to deploy and scale generative AI and reasoning models across...

AI InfrastructureDeveloper Tools

NVIDIA Groq LPX Rack

NVIDIA Groq 3 LPX is a rack-scale inference accelerator that NVIDIA introduced at GTC 2026, built around 256 Groq Language Processing Units and designed to sit...

AI HardwareNVIDIA

NVIDIA NIM

NVIDIA NIM (NVIDIA Inference Microservices) is a set of containerized, prebuilt-and-optimized model-serving microservices from NVIDIA that package an AI model,...

AI InfrastructureDeveloper Tools

NVIDIA Picasso

See also: Model Deployment and artificial intelligence applications See also: Image generation, Video generation, and 3D generation NVIDIA Picasso is a...

AI HardwareAI Infrastructure

NVIDIA Rubin CPX

NVIDIA Rubin CPX is a class of GPU announced by NVIDIA on September 9, 2025, purpose-built to accelerate the compute-heavy "context" phase of large-model...

AI HardwareNVIDIA

NVIDIA TensorRT-LLM

NVIDIA TensorRT-LLM is an open-source library developed by nvidia for high-performance inference of large language models on NVIDIA GPUs. It provides a Python...

NVIDIAOpen Source AI

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is open-source model deployment software that lets teams run trained models from any machine learning or deep learning framework...

Deep LearningDeveloper Tools

NormalFloat 4-bit (NF4)

NormalFloat 4-bit (NF4) is a 4-bit numerical data type for storing the weights of deep neural networks, introduced in the 2023 QLoRA paper by Tim Dettmers,...

Training & Optimization

OctoAI

OctoAI (originally OctoML) was an American artificial intelligence infrastructure company that operated a generative-AI inference platform and, before its...

AI CompaniesAI Infrastructure

Offline inference

Offline inference (also called batch inference, static inference, or bulk scoring) is the practice of running a trained machine learning model over a known set...

MLOps

Online inference

Online inference (also called dynamic inference, real-time inference, or on-demand prediction) is the practice of running a trained machine learning model...

MLOps

OpenVINO

OpenVINO (Open Visual Inference and Neural Network Optimization) is an open-source toolkit developed by Intel for optimizing and deploying deep learning...

Developer ToolsOpen Source AI

Optimum-Quanto

Optimum Quanto, commonly referred to as Quanto, is a PyTorch-based quantization toolkit developed and maintained by Hugging Face that provides linear weight...

Developer ToolsOpen Source AI

PagedAttention

PagedAttention is a KV-cache memory management algorithm for serving large language models that applies the virtual-memory paging technique used by operating...

AI InfrastructureModel Architecture

Positron AI

Positron AI is an American semiconductor startup headquartered in Reno, Nevada, that designs and manufactures purpose-built hardware for transformer inference....

AI CompaniesAI Hardware

Post-processing

In machine learning, post-processing is any operation applied to a model's raw outputs after the prediction step but before the results reach a downstream...

MLOps

Product quantization

Product quantization (PQ) is a vector-compression technique for approximate nearest-neighbor (ANN) search that splits each high-dimensional vector into M equal...

AI InfrastructureInformation Retrieval

Pruning

Pruning is a family of techniques used in machine learning and artificial intelligence to remove parts of a model or search space that are estimated to be...

Machine LearningTraining & Optimization

QLoRA

QLoRA (Quantized Low-Rank Adaptation) is a parameter-efficient fine-tuning method that finetunes a 65-billion-parameter large language model on a single 48 GB...

Deep LearningLarge Language Models

Qualcomm AI200

Qualcomm AI200 is a rack-scale data-center accelerator for artificial intelligence inference, announced by Qualcomm on 27 October 2025 and slated for...

AI HardwareData Centers

Qualcomm AI250

Qualcomm AI250 is a planned data-center artificial intelligence inference accelerator and rack-scale system announced by Qualcomm in late October 2025. It is...

AI HardwareData Centers

Quantization

Quantization in machine learning and artificial intelligence is the process of reducing the numerical precision of a neural network's parameters (weights,...

Deep LearningMachine Learning

RLVR

Reinforcement Learning with Verifiable Rewards (RLVR) is a post-training paradigm for large language models in which the reward signal comes from a...

Reasoning ModelsReinforcement Learning

RadixAttention

RadixAttention is a KV cache management technique introduced in SGLang that uses a radix tree data structure to automatically share and reuse cached key-value...

AI InfrastructureModel Architecture

Rebellions REBEL-Quad

REBEL-Quad is a chiplet-based AI inference accelerator developed by Rebellions, a South Korean AI-chip company. It was presented at the Hot Chips 2025...

AI Hardware

Skeleton-of-Thought

Skeleton-of-Thought (SoT) is a prompting technique for large language models that reduces end-to-end generation latency by first eliciting a short outline of...

Prompt Engineering