Inference

From AI Wiki

Template:Infobox AI term

In the field of artificial intelligence (AI), inference refers to the process of using a trained neural network model to make a prediction or draw a conclusion from new, previously unseen data.[1][2] It is the operational or "doing" phase of the AI lifecycle, where the model applies the knowledge and patterns it learned during AI training to produce real-world results.[3] If training is analogous to teaching an AI a new skill, inference is the AI actually using that skill to perform a task.[1] This process is fundamental to nearly all practical applications of AI, from identifying objects in photos and translating languages to powering generative AI systems.[4]

Inference is distinct from the training phase, which is a computationally intensive process focused on building an accurate model. While training represents a one-time computational investment, inference runs continuously, driving 80-90% of AI's operational costs and lifetime value.[5] Individual inference operations are typically much faster than training, but they must often be executed at massive scale and with very low latency to be useful in real-time applications.[3] The efficiency and performance of the inference process are critical for the successful deployment of AI systems, driving a vast field of optimization in both software and hardware.

History

The concept of inference in AI traces its roots to early efforts in formal reasoning and symbolic manipulation. Philosophers like Aristotle developed syllogistic logic, while later thinkers such as Ramon Llull (1232–1315) and Gottfried Leibniz envisioned mechanical systems for logical deduction.[6] In the 20th century, breakthroughs in mathematical logic by figures like Alan Turing, Kurt Gödel, and Alonzo Church laid the groundwork for mechanized reasoning, culminating in the Church-Turing thesis, which suggested that any mathematical deduction could be performed by a machine.[6]

Modern AI inference began in the 1950s with symbolic AI. The Logic Theorist, developed by Allen Newell and Herbert A. Simon in 1955, was one of the first programs to perform automated theorem proving, using heuristic search to infer proofs from axioms.[6][7] Presented at the 1956 Dartmouth Workshop—the birthplace of AI—this program demonstrated inference through step-by-step deduction, proving 38 theorems from Russell and Whitehead's Principia Mathematica.[6]

In the 1960s, inference expanded with neural networks. Walter Pitts and Warren McCulloch's 1943 model of artificial neurons influenced early work, but Frank Rosenblatt's Perceptron (1958) introduced pattern-based inference for classification tasks.[6] Systems like ADALINE (1960) and MADALINE (1962) by Bernard Widrow advanced adaptive inference.[6] However, Marvin Minsky and Seymour Papert's 1969 book Perceptrons highlighted limitations, leading to a decline in neural network funding during the first AI winter.[6]

The 1970s saw the rise of expert systems, where an inference engine applied rules to a knowledge base for domain-specific reasoning.[8] DENDRAL (1965–1983), developed at Stanford, was the first expert system, using inference to analyze mass spectrometry data for organic chemistry.[7][9] MYCIN (1972), another Stanford project, inferred bacterial infection diagnoses using backward chaining.[10][11] Edward Feigenbaum championed expert systems, emphasizing knowledge engineering.[12]

The 1980s brought commercial expert systems, with inference engines like EMYCIN (from MYCIN) enabling reusable frameworks.[8] Systems such as XCON (for Digital Equipment Corporation) used forward chaining for configuration tasks.[13] However, maintenance challenges and the "knowledge acquisition bottleneck" led to their decline by the late 1980s.[14]

Neural networks revived in the 1980s with backpropagation, popularized by Geoffrey Hinton and David Rumelhart in 1986, enabling multi-layer networks for more complex inference.[6] Yann LeCun's convolutional neural networks (1990) applied inference to handwriting recognition.[6]

The 2010s marked the era of deep learning inference, fueled by big data and GPUs. AlexNet (2012) demonstrated superior image inference.[6] The Transformer architecture (2017) revolutionized natural language processing inference.[6] Models like GPT-3 (2020) and ChatGPT (2022) showcased generative inference, while OpenAI's o1 (2024) advanced reasoning inference.[6]

Definition and Role in the AI Lifecycle

Inference is the execution phase in the lifecycle of an AI model, where it moves from a state of learning to a state of practical application. It is the point at which a model, having been trained on a large dataset to recognize patterns and relationships, is deployed to draw conclusions from new information it has not previously encountered.[4][15] This capability to generalize from training data to new inputs is the core function of inference and is where AI delivers its primary business value.[3]

The entire process of bringing an AI model into production involves several key stages, with inference being the final operational step. A typical workflow includes preparing data, selecting and training a model, monitoring its outputs for accuracy and bias, and finally deploying it for inference.[4] This deployment is often managed through a process called AI serving, which involves packaging the model and exposing it via an API to handle live requests.[3]

Distinction from Training, Fine-Tuning, and Serving

Understanding the role of inference requires distinguishing it from other critical stages in the AI model lifecycle: AI training, fine-tuning, and AI serving. Each stage has a different objective, process, data requirement, and business focus.[3]

  • AI Training is the foundational learning phase. It is a highly resource-intensive process where a model is built from scratch by iteratively analyzing a massive, historical dataset to learn patterns. The primary goal of training is to create a model that is accurate and capable. This phase can take anywhere from hours to weeks and requires powerful hardware accelerators like GPUs.[3]
  • AI Fine-Tuning is an optimization of the training process. Instead of building a model from scratch, it takes a powerful, pre-trained model and adapts it for a more specific task. This is achieved by continuing the training process on a smaller, specialized dataset. Fine-tuning saves significant time, computational resources, and cost compared to full training.[3]
  • AI Inference is the execution phase. It uses the fully trained and fine-tuned model to make fast predictions on new, "unseen" data. Each individual prediction is far less computationally demanding than a training iteration, but delivering millions of predictions in real-time requires a highly optimized and scalable infrastructure. The business focus shifts from model accuracy to operational metrics like speed (latency), scale, and cost-efficiency.[3]
  • AI Serving is the operational infrastructure that makes inference possible at scale. It involves deploying and managing the model, typically by packaging it, setting up an API endpoint, and managing the underlying infrastructure to handle incoming requests reliably and efficiently.[3]

The fundamental dichotomy between the computational profiles of training and inference is a primary driver for nearly all specialized fields related to AI deployment. Training is a large-scale, offline, parallel process optimized for throughput over long periods, whereas inference is often an online, real-time process optimized for the lowest possible latency on a single input. This difference in objectives necessitates entirely different approaches to hardware and software. The need for low-latency inference, for example, has directly led to the development of specialized hardware and ASICs like TPUs, which are designed to accelerate the specific mathematical operations used in a forward pass. Similarly, the need to deploy models on resource-constrained edge devices has spurred the creation of model compression techniques like quantization and pruning, which are applied post-training to create a smaller, faster model suitable for an efficient inference environment. This distinction is not merely definitional; it is the root cause that gives rise to the entire ecosystem of technologies and practices surrounding AI deployment, including the field of MLOps.

Comparison of AI Lifecycle Stages[3]
Stage Objective Process Data Business Focus
Training Build a new model from scratch. Iteratively learns from a large dataset. Large, historical, labeled datasets. Model accuracy and capability.
Fine-Tuning Adapt a pre-trained model for a specific task. Refines an existing model with a smaller dataset. Smaller, task-specific datasets. Efficiency and customization.
Inference Use a trained model to make predictions. A single, fast "forward pass" of new data. Live, real-world, unlabeled data. Speed (latency), scale, and cost-efficiency.
Serving Deploy and manage the model to handle inference requests. Package the model and expose it as an API. N/A Reliability, scalability, and manageability.

The Mechanics of Inference

At a technical level, inference in deep learning models is executed through a process known as forward propagation or a forward pass.[16] This is the mechanism by which a neural network takes an input and processes it through its layers to produce an output.[17] During inference, the model's learned parameters—its weights and biases—are frozen. The forward pass is therefore a "read-only" operation where the model applies its fixed knowledge without any learning or parameter updates occurring.[3] This is in direct contrast to the training process, which involves both a forward pass to generate a prediction and a backward pass (backpropagation) to calculate the error and update the model's weights.[17]

The sequential and computationally deterministic nature of the forward pass is what makes inference a prime target for optimization. Because the sequence of mathematical operations is fixed once a model is trained, it becomes a predictable computational graph. This predictability allows for the creation of highly specialized compilers and runtimes, such as NVIDIA TensorRT, which can analyze this graph and apply optimizations like fusing multiple layers into a single operation, selecting the most efficient mathematical kernels for the target hardware, and converting model weights to lower-precision formats.[18] Furthermore, while the layers are processed sequentially, the computations within each layer, such as matrix multiplications, are massively parallel. This inherent parallelism is why hardware like GPUs, with their thousands of cores, are exceptionally well-suited for accelerating inference workloads.[19] The efficiency of modern AI inference is therefore a result of co-designing software and hardware to perfectly execute this fixed, parallelizable sequence of operations defined by the forward pass.

Step-by-Step Breakdown

The inference process can be broken down into three main steps:[3]

  1. Input Data Preparation: Before a model can process new data, that data must be converted into a format it understands. This preprocessing step ensures the input matches the format the model was trained on. For an image classification model, this might involve resizing an image to specific dimensions (for example 224x224 pixels) and normalizing its pixel values.[3] For a Large Language Model (LLM), this involves tokenization, where a text prompt is broken down into a sequence of numerical tokens that the model can interpret.[2]
  2. Model Execution (Forward Pass): The preprocessed input data is fed into the first layer of the neural network. The data then flows sequentially through each subsequent layer.
    • At each neuron in a layer, a linear operation is performed: a weighted sum of the inputs from the previous layer is calculated, and a bias term is added. This result is often called the "pre-activation" or "logit".[17]
    • This linear result is then passed through a non-linear activation function (such as ReLU, Sigmoid, or tanh). This non-linearity is crucial, as it allows the network to learn and represent complex, non-linear patterns in the data.[17][20]
    • The output of the activation function in one layer becomes the input for the next layer, and this process continues until the data reaches the final layer of the network.[21]
  3. Output Generation: The final layer of the network produces the model's output. The form of this output depends on the task. For a classification task, a Softmax activation function is often used in the final layer to convert the logits into a probability distribution across all possible classes.[22] For example, an image classifier might output a probability score, such as a 95% chance that an image contains a "dog".[3] For a generative model, the output might be the next token in a sentence or a newly generated image. This final result is then sent to the end-user application.[2]

There are three primary modes for serving inference requests:

  • Real-time (Online) Inference: Processes individual requests as they arrive, providing an immediate response, often within milliseconds. This is essential for interactive applications like chatbots, recommendation engines, and fraud detection systems.[3]
  • Batch (Offline) Inference: Processes a large volume of data all at once when immediate responses are not required. This method is more cost-effective and is used for tasks like periodic data analysis, report generation, or pre-calculating recommendations.[3]
  • Streaming Inference: Processes continuous streams of data in real-time, such as from sensors, IoT devices, or live video feeds. This mode is used for ongoing anomaly detection or live analytics.[23]

Theoretical Foundations: Paradigms of Reasoning in AI

At a fundamental level, AI inference is the process of deriving new conclusions from existing information, a task that emulates core aspects of human reasoning.[16] This process can be understood through several paradigms of reasoning, which form the theoretical basis for how AI systems operate. These paradigms can be broadly categorized into logical reasoning, which follows formal rules, and statistical reasoning, which deals with uncertainty and probability.[16]

Modern deep learning represents a significant shift toward inductive reasoning during the training phase, where models generalize patterns from vast amounts of specific data. However, the application of these trained models during inference functionally blends paradigms that were central to classical, symbolic AI. When a trained model receives a new input, it applies its complex set of learned weights—which act as a vast system of general rules—to produce a specific output. This mirrors the general-to-specific flow of deductive reasoning. Furthermore, when a generative model like an LLM produces a sequence of text, it is not deriving a logically certain outcome but is instead predicting the most plausible continuation. This task of finding the "best explanation" for the preceding context is the essence of abductive reasoning. This reveals that modern connectionist AI has not abandoned the principles of logical reasoning but has instead developed a high-dimensional, statistical analogue to them. The recent emergence of fields like Causal AI represents a more explicit step in this direction, aiming to move beyond the correlational patterns of induction to understand the underlying cause-and-effect relationships that govern data.[24]

Primary Forms of Reasoning

Three primary forms of reasoning are central to both human cognition and artificial intelligence: deductive, inductive, and abductive reasoning.[25]

  • Deductive reasoning: This is a top-down approach that starts with general premises or widely accepted facts and moves to a specific, logically certain conclusion. If the initial premises are true, the conclusion is guaranteed to be true.[26] The classic example is a syllogism: "All men are mortal. Socrates is a man. Therefore, Socrates is mortal." In AI, deductive reasoning is the foundation of early expert systems and rule-based systems, where a set of predefined rules is applied to specific data to reach a conclusion.[27]
  • Inductive reasoning: This is a bottom-up approach that involves forming a generalized conclusion from specific observations or instances.[26] It is the cornerstone of modern machine learning.[2] An AI model is trained on a large dataset of specific examples (for example thousands of images labeled "cat") and learns to induce a general pattern or set of rules for what constitutes a cat. When presented with a new image, it uses this generalized knowledge to infer whether the new image is also a cat. Unlike deduction, the conclusions of inductive reasoning are probabilistic, not guaranteed to be true.[27]
  • Abductive reasoning: This form of reasoning seeks to find the most plausible explanation for an incomplete set of observations. It is often described as "inference to the best explanation."[26] For example, if you find a half-eaten sandwich on the counter, you might abduce that your son was late for work and left in a hurry.[26] In AI, abductive reasoning is crucial for tasks like medical diagnosis, where a system must infer the most likely disease given a set of symptoms, or in reinforcement learning, where an agent must choose the best action based on incomplete information about its environment.[25]

Other Important Reasoning Paradigms

Beyond the primary three, several other reasoning frameworks are important in AI:

  • Probabilistic Reasoning: This paradigm explicitly handles uncertainty by using the principles of probability theory. The most prominent example is Bayesian inference, which uses Bayes' theorem to update the probability of a hypothesis as more evidence becomes available.[16] Instead of providing a definitive answer, it provides a degree of belief, which is essential for applications in dynamic environments like risk assessment or recommendation systems.[27]
  • Analogical Reasoning: This involves solving a new problem by identifying and applying the solution from a similar, previously solved problem. An AI system might use this to adapt a route-planning algorithm designed for autonomous cars to navigate delivery drones.[27]
  • Causal Inference: A more advanced form of reasoning that aims to understand cause-and-effect relationships, distinguishing them from mere correlations. While traditional machine learning models are excellent at finding correlations (for example ice cream sales and drowning incidents are correlated), they do not understand that both are caused by a third factor (hot weather). Causal AI attempts to model these underlying causal structures to create more robust, fair, and explainable models.[24]
  • Monotonic vs. Non-monotonic Reasoning: This distinction relates to how an AI system handles new information. In monotonic reasoning, once a conclusion is made, it is never retracted, even if new information is introduced. In non-monotonic reasoning, conclusions are provisional and can be revised in light of new evidence that contradicts them. Non-monotonic reasoning is essential for AI systems operating in the real world, where information is often incomplete and subject to change.[27]

Inference Across Key AI Architectures

While the core concept of inference as a forward pass is universal in deep learning, its specific implementation and computational characteristics vary significantly across different model architectures. The unique structure of each architecture creates distinct performance bottlenecks, which in turn drives the development of specialized optimization techniques.

For instance, the inference process in CNNs is dominated by a series of highly parallelizable convolution and matrix multiplication operations, making it a compute-bound task. Optimization efforts for CNNs therefore focus on accelerating these specific computations through specialized hardware like GPUs and TPUs, which feature thousands of cores or systolic arrays, and through the use of efficient mathematical kernels.[20][28]

In contrast, the generative inference process in LLMs is autoregressive and sequential. While the initial processing of the prompt (the "prefill" stage) is compute-bound, the generation of each subsequent token (the "decode" stage) is severely memory-bandwidth-bound. This is because the entire model's large weight matrices must be loaded from memory to perform a relatively small amount of computation for each new token.[29] This specific bottleneck has led to the development of Transformer-specific optimizations like the KV cache, PagedAttention, and Multi-Query Attention, all of which are designed to reduce the memory footprint and bandwidth requirements of the attention mechanism.[30][31]

Probabilistic models like Bayesian networks present yet another type of challenge. Here, the computational complexity is not determined by floating-point operations but by the combinatorial problem of summing over variables in a graph. The performance bottleneck is directly related to the graph's treewidth.[32] Consequently, optimization strategies focus on either transforming the graph's structure to make it more tractable (as in the Junction Tree algorithm) or abandoning exact computation entirely in favor of approximation algorithms.[32][33] This demonstrates that "inference optimization" is not a monolithic field; the optimal strategy is fundamentally dictated by the model's underlying mathematical structure and the specific computational bottleneck it creates.

Convolutional Neural Networks (CNNs)

Inference in a CNN involves passing an input, typically an image, through a sequence of specialized layers designed to extract features of increasing complexity.[16][34] This forward pass transforms the raw pixel data into a final classification. The key layers involved are:

  • Convolutional Layer: This is the core building block of a CNN. It applies a set of learnable filters (or kernels) across the input image. Each filter is a small matrix of weights that is specialized to detect a specific feature, such as an edge, a corner, or a color patch. The filter slides (convolves) over the input, computing the dot product at each location to produce a 2D feature map. This map indicates where in the input the specific feature was detected.[20][35]
  • Activation Function: After each convolution operation, the feature map is passed through a non-linear activation function, most commonly the Rectified Linear Unit (ReLU). This introduces non-linearity, allowing the network to learn more complex patterns than simple linear combinations of features.[20][36]
  • Pooling Layer: The pooling layer is used to reduce the spatial dimensions (width and height) of the feature maps, a process known as downsampling or subsampling.[20][20] The most common form is max pooling, which takes a small window of the feature map and outputs only the maximum value. This makes the network more computationally efficient and provides a degree of translation invariance, meaning the network can recognize a feature regardless of its exact position in the image.[20]
  • Fully Connected Layer: After several convolutional and pooling layers have extracted high-level features, the final feature maps are flattened into a one-dimensional vector. This vector is then fed into one or more fully connected layers, which are identical to the layers in a standard neural network. These layers learn to combine the high-level features to make a final classification.[37][36]
  • Output Layer: The final fully connected layer outputs the raw scores (logits) for each class. These are typically passed through a Softmax activation function, which converts the scores into a probability distribution, indicating the model's confidence for each class.[36]

Transformer Models

Inference with Transformer models, particularly for generative tasks like machine translation or text generation, is an autoregressive process. This means the model generates its output one token (word or sub-word) at a time, with each new token being conditioned on the previously generated ones.[38]

The process differs significantly from training, where the model has access to the entire output sequence at once. During inference, the decoder must build the sequence step-by-step:[38]

  1. Encoder Processing: For models with an encoder-decoder architecture (like those used for translation), the encoder first processes the entire input sequence (for example an English sentence). Using its self-attention mechanism, it builds a set of rich, contextual representations (context vectors) of the input. This step is identical in both training and inference.[38]
  2. Decoder Initialization: The decoder begins with a special "start-of-sequence" (SOS) token as its initial input.[38]
  3. Decoder Loop (Token-by-Token Generation):
    • Masked Self-Attention: The decoder applies a masked self-attention mechanism to its current input sequence (which at the first step is just the SOS token). The mask ensures that when predicting a token, the model can only attend to the tokens that came before it, preventing it from "seeing into the future."[38]
    • Cross-Attention: The decoder then uses a cross-attention mechanism. Here, the queries come from the decoder's state, while the keys and values come from the encoder's context vectors. This allows the decoder to focus on the most relevant parts of the original input sentence when generating the next output token.[38]
    • Feed-Forward and Prediction: The output of the cross-attention layer is passed through a feed-forward network. The final output vector is then fed into a linear layer followed by a Softmax function, which produces a probability distribution over the entire vocabulary.[38]
    • Sampling: A token is selected from this distribution. While simply picking the token with the highest probability (greedy sampling) is an option, more sophisticated sampling methods are often used to generate more diverse and natural-sounding text.[39]
    • Append and Repeat: The newly generated token is appended to the decoder's input sequence, and the entire loop is repeated to generate the next token. This continues until the model generates a special "end-of-sequence" (EOS) token.[38]

A crucial optimization for this process is the KV cache. During each decoding step, the key (K) and value (V) vectors computed in the attention layers for all previous tokens are stored in memory. In the next step, the model only needs to compute the K and V vectors for the newest token and can reuse the cached values for all prior tokens. This prevents a massive amount of redundant computation and is essential for achieving practical inference speeds.[39][30]

Probabilistic Graphical Models (Bayesian Networks)

Inference in a Bayesian Network is fundamentally a task of probabilistic querying. It involves calculating the posterior probability distribution for a set of "query" variables, given that another set of "evidence" variables has been observed.[32][40] This allows the network to logically update its beliefs in response to new information. Because exact inference in general graphs is NP-hard, a variety of algorithms have been developed, which fall into two main categories: exact and approximate inference.[41]

Exact Inference Algorithms

These algorithms compute the precise posterior probabilities.

  • Variable Elimination: This is an intuitive algorithm that answers a specific query, such as $P(X|E=e)$, by systematically eliminating all other "hidden" variables from the joint probability distribution one by one.[42] Instead of calculating the full joint distribution (which is computationally prohibitive), it leverages the network's structure to "push" summation operations inward, performing them on smaller products of factors (the conditional probability tables).[43] While efficient for single queries, the entire process must be re-run for each new query.[44]
  • Junction Tree Algorithm (or Clique Tree Propagation): This is a more general and often more efficient method for exact inference, especially when multiple queries are needed. The algorithm compiles the original graph into a data structure called a junction tree, where the nodes are "cliques" (subsets of fully connected variables) from a triangulated version of the original graph.[33][44] Once this tree is constructed, a two-phase message-passing protocol (known as belief propagation) is executed. In the first phase, messages are passed from the leaves of the tree to an arbitrary root, and in the second phase, they are passed back out from the root to the leaves. After this process, the marginal probability for every variable in the network is available, allowing many different queries to be answered without re-computation.[40][40]

Approximate Inference Algorithms

When the treewidth of a network is too large, exact inference becomes computationally intractable. In such cases, approximate methods are used.[32]

  • Stochastic Sampling Methods: These methods, such as Markov chain Monte Carlo (MCMC), generate a large number of random samples from the probability distribution defined by the network. The desired probabilities are then estimated based on the frequencies of events in these samples.
  • Variational Inference: This method reframes the inference problem as an optimization problem. It seeks to find a simpler, tractable probability distribution that is as close as possible (in terms of KL divergence) to the true, complex posterior distribution.

Evaluating Inference: Performance and Quality Metrics

Evaluating an AI model's inference capabilities is a multifaceted process that requires assessing two distinct but interconnected dimensions: performance and quality.[45] Performance metrics quantify the operational efficiency of the inference process—how fast it runs and how many resources it consumes. Quality metrics, on the other hand, measure the accuracy and utility of the model's outputs.[46] These two dimensions are often in a trade-off; techniques that improve performance, such as aggressive model quantization, can sometimes lead to a degradation in quality.[47]

The evolution of these evaluation metrics reflects the maturation of AI from a purely academic field to a robust engineering discipline. Early machine learning research focused primarily on quality metrics like classification accuracy.[48] However, as models began to be deployed in real-world applications, operational performance became a critical concern, leading to the standardization of metrics for latency and throughput.[45][29] With the recent explosion in the scale of AI services, the economic and environmental costs of inference have become significant. This has driven the emergence of efficiency metrics like cost per million tokens, Model Bandwidth Utilization (MBU), and tokens per watt as first-class concerns for system designers.[49][49]

Performance Metrics: Speed, Scale, and Efficiency

Performance metrics are crucial for assessing the viability of deploying an AI model, especially for interactive and large-scale applications.

Latency

Latency measures the time delay for a single inference request and is a critical factor for user-facing applications where responsiveness is key.[50] For LLMs, latency is typically broken down into several components:

  • Time to First Token (TTFT): The time elapsed from when a user sends a prompt to when the first token of the response is generated. A low TTFT is crucial for making an application feel responsive. For example, some benchmarks show NVIDIA H100 GPUs achieving a TTFT of 46ms for an MPT-7B model.[51][52]
  • End-to-End Latency (E2EL): The total time from the start of the request to the receipt of the final token. This metric represents the total time a user waits for the complete response.[51]
  • Time Per Output Token (TPOT): The average time it takes to generate each token after the first one. This metric determines the "streaming" speed of the response. A lower TPOT results in a smoother, faster-feeling generation process.[51][29]
  • Inter-Token Latency (ITL): The precise time gap between each consecutive pair of tokens. While the average ITL across a single request is equivalent to TPOT, the calculation can differ when averaged across multiple requests.[51][45]
Key Inference Performance Metrics
Metric Definition Primary Use Case
Time to First Token (TTFT) Time to process the prompt and generate the first output token.[51] Measures the perceived responsiveness of interactive applications (for example chatbots).
End-to-End Latency (E2EL) Total time from sending the request to receiving the final token.[51] Measures the total user waiting time for a complete response.
Time Per Output Token (TPOT) Average time to generate each token after the first.[51] Measures the streaming speed of the generated output.
Inter-Token Latency (ITL) The time gap between consecutive tokens.[51] A more granular measure of generation speed; average ITL is often equivalent to TPOT.
Throughput (TPS/RPS) Total number of output tokens (TPS) or requests (RPS) processed by the system per second.[45] Measures the overall capacity and cost-efficiency of the inference server.

Throughput

Throughput measures the total processing capacity of an inference system over a period of time.[53] It is typically measured in tokens per second (TPS) or requests per second (RPS).

  • System Throughput: The total number of tokens per second generated across all concurrent users. This metric reflects the raw processing power of the deployed infrastructure.[45]
  • User Throughput: The effective tokens per second experienced by a single user. As system load increases, user throughput typically decreases because resources are shared.[45]
  • Model Bandwidth Utilization (MBU): Measures what fraction of peak memory bandwidth the workload achieves. For LLMs, this is often the primary bottleneck, with MBU at batch size 1 typically achieving only 50-60% of theoretical bandwidth.[29]

There is a fundamental latency-throughput trade-off. To maximize throughput, systems often batch multiple requests together to better utilize the parallel processing capabilities of GPUs. However, this batching increases the latency for each individual request, as it may have to wait for other requests to arrive and be processed.[29][54] The optimal balance depends on the application: interactive services prioritize low latency, while offline analytics prioritize high throughput.[51]

Power and Economic Metrics

With the increasing scale of AI, efficiency has become a critical concern.

  • Tokens per Watt: Measures the energy efficiency of the hardware and software stack.[49]
  • Cost per Million Tokens: A standard economic metric used by cloud providers to price inference services, reflecting the financial cost of running the model.[49]

Quality and Accuracy Metrics

Quality metrics assess whether the model's outputs are correct, relevant, and useful. The choice of metric depends heavily on the type of task.

For Classification Tasks

These metrics are used for tasks where the model must assign an input to one of a predefined set of categories.[48]

  • Confusion matrix: A table that summarizes classification performance by showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).[46][48]
  • Accuracy: The proportion of correct predictions out of the total number of predictions. While intuitive, it can be misleading on datasets with imbalanced classes.[48]
  • Precision: Measures the accuracy of the positive predictions ($$TP / (TP + FP)$$). A high precision means that when the model predicts a positive class, it is very likely to be correct.[46][48]
  • Recall (or Sensitivity): Measures the proportion of actual positives that were correctly identified ($$TP / (TP + FN)$$). A high recall means the model is good at finding all the positive instances.[46][48]
  • F-score: The harmonic mean of precision and recall, providing a single score that balances both metrics.[48]
  • AUC-ROC and AUC-PR: The Area Under the Receiver Operating Characteristic curve and the Area Under the Precision-Recall curve, respectively. These metrics evaluate the model's ability to distinguish between classes across all possible classification thresholds.[46][46]

For Generative Models (LLMs)

Evaluating generative models is more complex, as there is often no single "correct" answer.

  • Computation-based Metrics: These metrics algorithmically compare the generated text to one or more reference texts. Examples include BLEU, commonly used for machine translation, and ROUGE, used for text summarization.[55]
  • Rubric-based Metrics (LLM-as-a-judge): A modern approach where a powerful, separate LLM is used to evaluate the output of the model being tested. The judge LLM scores the output based on a predefined rubric that might include criteria like fluency, coherence, factuality (grounding), and safety.[55][56]
  • Task-Specific Benchmarks: As AI becomes more integrated into professional workflows, new benchmarks are being developed to measure performance on realistic, economically valuable tasks. An example is GDPval, which evaluates models on tasks drawn from real-world knowledge work, using reference files and expecting complex deliverables.[57]

Industry Benchmarks

To standardize the evaluation of AI systems, several industry-wide benchmarks have been established.

  • MLPerf: Maintained by the non-profit consortium MLCommons, MLPerf is the most widely recognized industry benchmark for AI performance. The MLPerf Inference suite measures the latency and throughput of systems across a range of representative AI tasks, including image classification, object detection, and language processing.[58][59] The suite is regularly updated to include state-of-the-art models (such as Llama 3.1 and DeepSeek-R1 in v5.1).[58] It also includes an optional power measurement component to evaluate energy efficiency.[60]
  • Other Benchmarks: Specialized benchmarks are also emerging. The AI Energy Score initiative aims to standardize the evaluation of energy efficiency for AI model inference.[60] LLM-Inference-Bench is a suite designed to provide detailed hardware performance evaluations specifically for LLMs across various AI accelerators.[61]

Optimizing Inference for Efficiency

The substantial computational and memory requirements of modern AI models, particularly LLMs, present significant challenges for deployment. Inference optimization encompasses a wide range of techniques designed to make models smaller, faster, and more cost-effective to run, without a significant loss in accuracy.[62]

Model Compression Techniques

These methods aim to reduce the size of the model, which in turn lowers memory requirements and can accelerate computation.[47]

  • Quantization: This is one of the most effective optimization techniques. It involves reducing the numerical precision of the model's parameters (weights) and intermediate calculations (activations). Models are typically trained using 32-bit floating-point numbers (FP32). Quantization converts these to lower-precision formats, such as 16-bit floating-point (FP16), 8-bit integers (INT8), or even 4-bit integers (INT4).[63][64] This reduces the model's memory footprint by up to 8x (for INT4) and can significantly speed up inference on hardware that has specialized support for low-precision arithmetic.[63]
    • Post-Training Quantization (PTQ)': This method is applied to an already trained model. It is relatively easy to implement but can sometimes lead to a noticeable drop in model accuracy.[65]
    • Quantization-Aware Training (QAT)': This method simulates the effects of quantization during the training process itself. The model learns to be robust to the lower precision, which typically results in higher accuracy after quantization compared to PTQ.[63]
  • Pruning: This technique involves identifying and removing redundant or unimportant parameters from a neural network.[66]
    • Unstructured Pruning': Individual weights with low magnitude are set to zero, creating a sparse weight matrix. This can significantly reduce the model's storage size, but it often does not lead to faster inference unless run on specialized hardware that can efficiently process sparse matrices.[67]
    • Structured Pruning': Entire groups of parameters, such as neurons, convolutional filters, or attention heads, are removed. This results in a smaller, dense model that can be executed faster on standard hardware like GPUs without any special handling.[68]
  • Knowledge distillation: This method involves training a smaller "student" model to mimic the behavior of a larger, more powerful "teacher" model.[69] Instead of training the student model only on the ground-truth labels, it is also trained to match the output probability distribution (the "soft targets") of the teacher model. This allows the student to learn the more nuanced relationships between classes that the teacher has captured, often achieving performance close to the teacher with a fraction of the parameters.[69][70] A well-known example is DistilBERT, which is 40% smaller than the original BERT model but retains 97% of its performance.[30][70]

Runtime and System-Level Optimizations

These techniques focus on improving the efficiency of the inference execution without modifying the model's parameters.

  • Batching: Grouping multiple inference requests together and processing them in a single pass through the model. This dramatically improves GPU utilization and overall throughput by amortizing the overhead of kernel launches and memory transfers.[30][71]
  • Continuous Batching: A state-of-the-art technique for LLM inference where the server dynamically batches requests at the iteration level. Instead of waiting for all sequences in a batch to finish, it can immediately start processing new requests as soon as individual sequences are complete. This can lead to throughput improvements of over 20x compared to simpler batching methods.[29][72]
  • KV Cache Optimization: For Transformer models, the KV cache can consume a large amount of GPU memory. PagedAttention, a technique inspired by virtual memory in operating systems, allows the KV cache to be stored in non-contiguous memory blocks. This significantly reduces memory fragmentation and waste, enabling larger batch sizes and support for longer contexts.[30][31]
  • Attention Variants: The standard self-attention mechanism is computationally expensive.
    • FlashAttention is a hardware-aware attention algorithm that reorders the computation to minimize slow read/write operations to the GPU's high-bandwidth memory (HBM), resulting in significant speedups.[31]
    • Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce the size of the KV cache by having multiple query heads share the same key and value heads. This is particularly effective for improving inference speed for long sequences.[31]
  • Speculative Decoding: This technique uses a smaller, faster "draft" model to generate a candidate sequence of several tokens. The larger, more accurate "verifier" model then processes this entire candidate sequence in a single forward pass to check its correctness. If the draft is accepted, the model can generate multiple tokens in the time it would normally take to generate one, leading to 2-3x latency reductions in production.[30][31][73]
  • Model Parallelism: When a model is too large to fit on a single accelerator, it must be distributed across multiple devices. Different strategies exist to split the model, including tensor parallelism (splitting individual weight matrices), pipeline parallelism (assigning different layers to different devices), and sequence parallelism (splitting the input sequence across devices).[30][31]

Inference Serving Frameworks

Specialized software is required to efficiently deploy and serve models in production. These frameworks bridge the gap between a trained model file and a scalable, low-latency web service.

  • NVIDIA TensorRT: An SDK for high-performance deep learning inference. It optimizes models by performing graph optimizations, layer fusion, and precision calibration (for example to INT8 or FP8), tuning the model for the specific target NVIDIA GPU.[74]
  • vLLM: An open-source LLM inference and serving engine that revolutionized serving speeds by implementing PagedAttention and continuous batching. It can achieve up to 24x higher throughput compared to standard HuggingFace Transformers.[75]
  • NVIDIA Triton Inference Server: A comprehensive serving solution that can deploy models from multiple frameworks (including TensorRT, TensorFlow, PyTorch, and ONNX). It supports dynamic batching, multi-model serving, and scaling for production environments.[76]
  • ONNX Runtime: A cross-platform inference and training accelerator that supports models from many frameworks. It is designed to enable high-performance inference on diverse hardware, from cloud CPUs and GPUs to edge devices.[77]
  • TensorFlow Lite: A set of tools for on-device inference on mobile, embedded, and IoT devices. It focuses on small binary size and low-latency inference, using techniques like quantization and delegation to device-specific accelerators.[78]
Inference Framework Comparison
Framework Primary Use Case Key Features Hardware Support
TensorRT NVIDIA GPU optimization Graph fusion, quantization NVIDIA GPUs only
vLLM LLM serving PagedAttention, continuous batching NVIDIA/AMD GPUs
ONNX Runtime Cross-platform Multi-backend support CPU, GPU, NPU
TensorFlow Lite Mobile/edge Aggressive compression Mobile GPUs/NPUs
Triton Server Production serving Multi-model, auto-scaling Various

The Hardware Ecosystem for Inference

The performance, cost, and power consumption of AI inference are fundamentally determined by the underlying hardware. The choice of hardware is a critical decision in designing an AI system, with a wide spectrum of options ranging from general-purpose processors to highly specialized custom chips.[19]

Deployment Environments: Cloud vs. Edge

AI inference can be deployed in two primary environments, each with distinct cost structures and performance characteristics.

  • Cloud Inference: This involves running inference on powerful servers located in centralized data centers operated by cloud service providers like AWS, Google Cloud, and Microsoft Azure. It leverages virtually unlimited computational resources and is ideal for training large models and serving complex workloads that are not critically sensitive to latency.[79] The cost model is typically based on operational expenditure (OpEx), with a pay-as-you-go structure.[80][81]
  • Edge Inference: This involves performing inference directly on a local device, such as a smartphone, an industrial sensor, a smart camera, or a vehicle. This approach is essential for applications that require very low latency, the ability to function without a network connection, and enhanced data privacy, as sensitive data is processed locally.[82] The cost model is dominated by capital expenditure (CapEx), as it requires an upfront investment in hardware.[81]

Comparative Analysis of Hardware Accelerators

Different types of processors are optimized for different computational workloads.[19]

Comparative Analysis of Hardware for AI Inference
Hardware Key Architectural Feature Flexibility/Programmability Performance Profile Power Efficiency Ideal Use Cases
CPU Few powerful cores optimized for sequential, serial processing.[19] Very High (General-purpose) Low for parallel AI workloads. Suitable for small models or tasks with high latency tolerance.[83] High (relative to GPU for low-intensity tasks).[19] Prototyping, controlling other accelerators, running very small models.[84]
GPU Thousands of simpler cores designed for massive parallel processing (SIMT).[85] High (Programmable via frameworks like CUDA) Very high for parallel tasks like matrix multiplication. The dominant hardware for AI training and inference.[84] Low (High power consumption and heat generation).[19] General-purpose AI training and high-throughput inference in data centers.[83]
FPGA Reconfigurable hardware fabric; logic gates can be reprogrammed after manufacturing.[19] Medium (Programmable at the hardware level) High performance with very low and deterministic latency.[19] Medium (More efficient than GPUs as hardware can be fine-tuned to the application).[19] Real-time, ultra-low-latency applications (for example industrial robotics, aerospace) where algorithms may evolve.[83]
ASIC / NPU Custom silicon hardwired for a specific task (for example neural network operations). TPUs use a systolic array for matrix multiplication.[28][86] Very Low (Fixed function) Highest possible performance for its specific task.[86] Very High (Highest performance per watt).[84] Large-scale, high-volume inference where efficiency is paramount (for example Google's services, autonomous vehicles, mobile phones).[86]

Specialized Accelerators

  • GPUs: NVIDIA's GPUs dominate high-performance inference. The H100 GPU can deliver up to 30x faster LLM inference than its predecessor, the A100. Other competitors like AMD's MI300X also target this market.[87]
  • TPUs: Google's Tensor Processing Unit (TPU) is a prominent ASIC. Its core innovation is the systolic array, a grid of multiply-accumulators that performs massive matrix multiplications with minimal memory access, achieving high throughput and power efficiency.[28][88] Google's Ironwood (TPU v7) is the first TPU specifically designed for inference.[89]
  • Other ASICs/NPUs: A growing market of specialized chips includes AWS's Inferentia chips, Groq's Language Processing Units (LPUs), and Cerebras's Wafer Scale Engines, all designed to offer advantages in speed, cost, or power efficiency for inference.[90] On edge devices, NPUs like the Apple Neural Engine and Qualcomm Hexagon provide high-performance, low-power inference directly on consumer devices.[91]
GPU Performance Comparison for LLM Inference (Approx. Benchmarks)
GPU Model Architecture Memory TDP Tokens/sec (Llama 70B)
NVIDIA H100 Hopper 80GB HBM3 700W ~2,400
NVIDIA A100 Ampere 80GB HBM2e 400W ~1,200
NVIDIA L4 Ada Lovelace 24GB GDDR6 72W ~450
AMD MI300X CDNA 3 192GB HBM3 750W ~2,200

Applications of AI Inference in Practice

AI inference is the engine that powers a vast and growing range of applications across nearly every industry, transforming raw data streams into actionable insights.

The AI inference market is projected to grow from $97 billion in 2024 to $1.3 trillion by 2032, reflecting its central role in AI adoption.[92]

AI Inference Market Projections
Year Market Size (USD) Growth Rate
2024 $97 billion -
2026 $180 billion 36.3%
2028 $405 billion 39.8%
2030 $720 billion 33.5%
2032 $1.3 trillion 34.4%

Healthcare

  • Medical Imaging and Diagnostics: CNNs are used to analyze medical scans such as X-rays, CT scans, and MRIs. These systems can infer the presence of anomalies, such as tumors or signs of diabetic retinopathy, assisting healthcare professionals by highlighting areas of concern.[93][94]
  • Personalized Medicine: By analyzing a patient's genetic information, lifestyle data, and clinical history, AI models can infer the most effective treatment plan for an individual.[94]
  • Predictive Health Monitoring: Wearable devices and smart sensors use inference to continuously monitor vital signs and predict the likelihood of adverse events, such as a heart attack or a fall, enabling early intervention.[95]

Financial Services

  • Fraud Detection and Risk Management: AI inference engines analyze millions of transactions in real-time, inferring patterns of fraudulent behavior from data such as transaction amount, location, and user history. Suspicious transactions can be automatically blocked in milliseconds.[3][96][97]
  • Algorithmic Trading: Predictive models analyze market data, news sentiment, and economic indicators to infer future price movements and execute trades automatically.[97]

Retail and E-commerce

  • Recommendation Engines: E-commerce and streaming platforms analyze a user's browsing history, past purchases, and real-time interactions to infer their preferences. This knowledge is used to generate hyper-personalized recommendations for products, movies, or music.[3]
  • Supply Chain and Inventory Optimization: AI models analyze historical sales data and external factors to infer future product demand, allowing retailers to optimize inventory levels and reduce waste.[3]

Autonomous Systems

  • Autonomous car: A self-driving car's perception system continuously runs inference on data from a suite of sensors (cameras, Lidar, radar). It infers the presence and classification of objects (for example other cars, pedestrians, traffic signs) and makes critical driving decisions in milliseconds.[98][99]
  • Robotics and Manufacturing: In smart factories, AI-powered cameras perform inference for automated quality control, detecting defects on an assembly line. Robots use inference for object recognition and manipulation.[3]

Other Applications

  • Cybersecurity: Inference models analyze network traffic in real-time to detect anomalies and infer patterns that signal a cyberattack or threat.[23]
  • Natural Language Processing: Powers applications like text classification, translation, and conversational AI chatbots, which rely on inference to interpret new text and generate responses.[100]

Current Challenges and Future Outlook

While AI inference has enabled transformative applications, the field faces significant challenges as models grow in scale and complexity.

Current Challenges in AI Inference

  • Model collapse: A significant emerging threat where new generations of AI models, trained on synthetic data generated by previous models, begin to lose diversity, compound errors, and degrade in quality.[101]
  • Model drift: A common problem where a model's performance degrades over time as the real-world data it encounters during inference "drifts" or changes from the data it was trained on. 91% of production ML models suffer from this drift.[102]
  • Explainability and Trust: Many deep learning models function as "black boxes," making it difficult to understand the reasoning behind their outputs. This lack of transparency is a major obstacle to adoption in high-stakes domains like healthcare and law.[24]
  • Security and Privacy: Inference surfaces novel attack vectors, including model inversion attacks to extract training data, membership inference to violate privacy, and adversarial attacks to bypass safety guardrails.[23]
  • Efficiency, Cost, and Environmental Impact: The scale of modern LLMs has led to immense computational and energy costs for inference, which can account for 70-80% of a model's total lifetime energy consumption.[103][104]

Future Directions and Emerging Trends

  • Reasoning Models: The frontier of AI is moving toward models like OpenAI's o1 and DeepSeek's R1, which shift computational effort from training to inference. These "reasoning models" apply substantial compute during query processing to perform complex, multi-step reasoning tasks.[105]
  • Causal AI: There is a growing movement to develop AI systems that can understand and reason about cause-and-effect relationships, rather than just identifying statistical correlations. Causal AI promises to create models that are more robust, less prone to bias, and inherently more explainable.[24]
  • Hardware-Software Co-design: Achieving the next level of inference efficiency will require tighter integration between model architecture design, software compilers, and hardware accelerators. This involves designing neural networks that are optimized for the strengths of specific hardware and, conversely, designing hardware that is tailored to emerging model architectures.[62]
  • Agentic Workflows: Future AI systems will increasingly function as "agents" that can decompose a complex problem into smaller steps, use external tools (like APIs), and interact with an environment to achieve a goal. This shift dramatically increases the number of inference calls required per user query, making inference efficiency even more critical.[49][58]

See Also

References

  1. 1.0 1.1 "What is AI inference? How it works and examples". Google Cloud. https://cloud.google.com/discover/what-is-ai-inference#:~:text=Think%20of%20it%20this%20way,photo%2C%20or%20makes%20a%20decision..
  2. 2.0 2.1 2.2 2.3 "Explore NVIDIA AI Inference Tools and Technologies". NVIDIA Developer. https://developer.nvidia.com/topics/ai/ai-inference.
  3. 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 "What is AI inference?". Google Cloud. https://cloud.google.com/discover/what-is-ai-inference.
  4. 4.0 4.1 4.2 "What is AI inference?". IBM. https://www.ibm.com/think/topics/ai-inference.
  5. "LLM Inference Benchmarking: How Much Does Your LLM Inference Cost?". NVIDIA Developer. https://developer.nvidia.com/blog/llm-inference-benchmarking-how-much-does-your-llm-inference-cost/. Retrieved 2025-01-21.
  6. 6.00 6.01 6.02 6.03 6.04 6.05 6.06 6.07 6.08 6.09 6.10 6.11 "History of artificial intelligence". Wikipedia. https://en.wikipedia.org/wiki/History_of_artificial_intelligence.
  7. 7.0 7.1 "The History of Artificial Intelligence". IBM. https://www.ibm.com/think/topics/history-of-artificial-intelligence.
  8. 8.0 8.1 "Inference engine". Wikipedia. https://en.wikipedia.org/wiki/Inference_engine.
  9. "History Of AI In 33 Breakthroughs: The First Expert System". Forbes. https://www.forbes.com/sites/gilpress/2022/10/29/history-of-ai-in-33-breakthroughs-the-first-expert-system/.
  10. "Expert system". A History of Artificial Intelligence. https://ahistoryofai.com/expert-system/.
  11. "Expert system". Wikipedia. https://en.wikipedia.org/wiki/Expert_system.
  12. "A Short History of Artificial Intelligence". HPE Community. https://community.hpe.com/t5/ai-unlocked/a-short-history-of-artificial-intelligence/ba-p/7172315.
  13. [https.medium.com/version-1/an-overview-of-the-rise-and-fall-of-expert-systems-14e26005e70e "An Overview of the Rise and Fall of Expert Systems"]. Medium. https.medium.com/version-1/an-overview-of-the-rise-and-fall-of-expert-systems-14e26005e70e.
  14. "Why did expert systems fall?". Retrocomputing Stack Exchange. https://retrocomputing.stackexchange.com/questions/6456/why-did-expert-systems-fall.
  15. "Machine learning". IBM. https://www.ibm.com/think/topics/machine-learning.
  16. 16.0 16.1 16.2 16.3 16.4 "Introduction to Inference in Artificial Intelligence". Pass4Sure. https://www.pass4sure.com/blog/introduction-to-inference-in-artificial-intelligence/.
  17. 17.0 17.1 17.2 17.3 "What is Forward Propagation in Neural Networks?". DataCamp. https://www.datacamp.com/tutorial/forward-propagation-neural-networks.
  18. "NVIDIA TensorRT". NVIDIA. https://developer.nvidia.com/tensorrt.
  19. 19.0 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8 "FPGA vs GPU vs CPU: Hardware Options for AI Applications". Avnet. https://my.avnet.com/silica/resources/article/fpga-vs-gpu-vs-cpu-hardware-options-for-ai-applications/.
  20. 20.0 20.1 20.2 20.3 20.4 20.5 20.6 "Convolutional Neural Network (CNN)". NVIDIA. https://developer.nvidia.com/discover/convolutional-neural-network.
  21. "Forward propagation". Telnyx. https://telnyx.com/learn-ai/forward-propogation-ai.
  22. "How to Implement the Forward Method for a CNN in PyTorch". YouTube. https://www.youtube.com/watch?v=MasG7tZj-hw.
  23. 23.0 23.1 23.2 "AI Inference: Guide and Best Practices". Mirantis. https://www.mirantis.com/blog/what-is-ai-inference-a-guide-and-best-practices/.
  24. 24.0 24.1 24.2 24.3 "An explanation of causal AI". TechTarget. https://www.techtarget.com/whatis/video/An-explanation-of-causal-AI.
  25. 25.0 25.1 "On AI and Types of Reasoning". Towards Data Science. https://towardsdatascience.com/on-ai-and-types-of-reasoning-fc6980295158.
  26. 26.0 26.1 26.2 26.3 "Deduction vs. Induction vs. Abduction". Merriam-Webster. https://www.merriam-webster.com/grammar/deduction-vs-induction-vs-abduction.
  27. 27.0 27.1 27.2 27.3 27.4 "What Are the Different Types of Reasoning in AI?". Milvus. https://milvus.io/ai-quick-reference/what-are-the-different-types-of-reasoning-in-ai.
  28. 28.0 28.1 28.2 "TPU system architecture". Google Cloud. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm.
  29. 29.0 29.1 29.2 29.3 29.4 29.5 "LLM Inference Performance Engineering: Best Practices". Databricks. https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices.
  30. 30.0 30.1 30.2 30.3 30.4 30.5 30.6 "LLM Inference Optimization". Clarifai. https://www.clarifai.com/blog/llm-inference-optimization/.
  31. 31.0 31.1 31.2 31.3 31.4 31.5 "LLM Inference Optimization 101". DigitalOcean. https://www.digitalocean.com/community/tutorials/llm-inference-optimization.
  32. 32.0 32.1 32.2 32.3 "Bayesian network". Wikipedia. https://en.wikipedia.org/wiki/Bayesian_network.
  33. 33.0 33.1 "Junction tree algorithm". Wikipedia. https://en.wikipedia.org/wiki/Junction_tree_algorithm.
  34. "A Comprehensive Review of Convolutional Neural Networks". MDPI. https://www.mdpi.com/2079-3197/11/3/52.
  35. "Convolutional neural network". Wikipedia. https://en.wikipedia.org/wiki/Convolutional_neural_network.
  36. 36.0 36.1 36.2 "What is a CNN?". HPE. https://www.hpe.com/us/en/what-is/convolutional-neural-network.html.
  37. "What is a Convolution Neural Network?". HPE. https://www.hpe.com/us/en/what-is/convolutional-neural-network.html#:~:text=CNNs%20consist%20of%20layers%20that,features%20to%20the%20final%20output..
  38. 38.0 38.1 38.2 38.3 38.4 38.5 38.6 38.7 "How Inference is Done in Transformer". Medium. https://medium.com/@sachinsoni600517/how-inference-is-done-in-transformer-3a1fd1a8bfea.
  39. 39.0 39.1 "The Basics of Transformer Inference". JAX-ML. https://jax-ml.github.io/scaling-book/inference/.
  40. 40.0 40.1 40.2 "Exact Inference in Bayesian Networks". GeeksforGeeks. https://www.geeksforgeeks.org/artificial-intelligence/exact-inference-in-bayesian-networks/.
  41. "Inference in Graphical Probability Models". George Mason University. https://mason.gmu.edu/~klaskey/GraphicalModels/GraphicalModels_Unit4_JTInference.pdf.
  42. "Exact Inference in Graphical Models". pgmpy. https://pgmpy.org/detailed_notebooks/5.%20Exact%20Inference%20in%20Graphical%20Models.html.
  43. "Exact Inference: Variable Elimination". Carnegie Mellon University. https://www.cs.cmu.edu/~epxing/Class/10708-14/scribe_notes/scribe_note_lecture4.pdf.
  44. 44.0 44.1 "From Variable Elimination to the Junction Tree Algorithms". Stanford AI Lab. https://ai.stanford.edu/~paskin/gm-short-course/lec3.pdf.
  45. 45.0 45.1 45.2 45.3 45.4 45.5 "Understand LLM latency and throughput metrics". Anyscale. https://docs.anyscale.com/llm/serving/benchmarking/metrics.
  46. 46.0 46.1 46.2 46.3 46.4 46.5 "Introduction to model evaluation". Vertex AI. https://cloud.google.com/vertex-ai/docs/evaluation/introduction.
  47. 47.0 47.1 "LLM Inference Optimization Techniques: A Comprehensive Analysis". Medium. https://medium.com/@sahin.samia/llm-inference-optimization-techniques-a-comprehensive-analysis-1c434e85ba7c.
  48. 48.0 48.1 48.2 48.3 48.4 48.5 48.6 "Performance Metrics in Machine Learning: The Complete Guide". Neptune.ai. https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide.
  49. 49.0 49.1 49.2 49.3 49.4 "AI Inference Performance". NVIDIA. https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference.
  50. "What's the difference between throughput and latency?". AWS. https://aws.amazon.com/compare/the-difference-between-throughput-and-latency/.
  51. 51.0 51.1 51.2 51.3 51.4 51.5 51.6 51.7 51.8 "LLM Inference Metrics: A Comprehensive Guide". BentoML. https://bentoml.com/llm/inference-optimization/llm-inference-metrics.
  52. "Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference". AWS. https://aws.amazon.com/blogs/machine-learning/optimizing-ai-responsiveness-a-practical-guide-to-amazon-bedrock-latency-optimized-inference/.
  53. "Artificial Intelligence: Understanding Training & Inference". ViaPhoton. https://viaphoton.com/artificial-intelligence-understanding-training-inference/.
  54. "Throughput-Latency Tradeoff in LLM Inference". Medium. https://medium.com/better-ml/throughput-latency-tradeoff-in-llm-inference-5a9e0d1d2c14.
  55. 55.0 55.1 "Define your evaluation metrics". Google Cloud. https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval.
  56. "LLM Evaluation Metrics: Everything You Need for LLM Evaluation". Confident AI. https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation.
  57. "Measuring the performance of our models on real-world tasks". OpenAI. https://openai.com/index/gdpval/.
  58. 58.0 58.1 58.2 "MLPerf Inference v5.1 Results". MLCommons. https://mlcommons.org/2025/09/mlperf-inference-v5-1-results/.
  59. "MLPerf Inference: Datcenter". MLCommons. https://mlcommons.org/benchmarks/inference-datacenter/. Cite error: Invalid <ref> tag; name "[47]" defined multiple times with different content
  60. 60.0 60.1 "AI Energy Score". Hugging Face. https://huggingface.github.io/AIEnergyScore/.
  61. "LLM-Inference-Bench: A Comprehensive Benchmarking Suite for LLM Inference". arXiv. https://arxiv.org/html/2411.00136v1.
  62. 62.0 62.1 "A Survey on Efficient LLM Inference". arXiv. https://arxiv.org/pdf/2404.14294.
  63. 63.0 63.1 63.2 "Quantization in Deep Learning". GeeksforGeeks. https://www.geeksforgeeks.org/deep-learning/quantization-in-deep-learning/.
  64. "What is quantization in machine learning?". Cloudflare. https://www.cloudflare.com/learning/ai/what-is-quantization/.
  65. "An Introduction to Model Quantization with Large Language Models". DigitalOcean. https://www.digitalocean.com/community/tutorials/model-quantization-large-language-models.
  66. "A Comprehensive Guide to Neural Network Model Pruning". Datature. https://datature.io/blog/a-comprehensive-guide-to-neural-network-model-pruning.
  67. "Introduction to Pruning in Deep Learning". Medium. https://medium.com/@anhtuan_40207/introduction-to-pruning-4d60ea4e81e9.
  68. "Neural Network Pruning in Deep Learning". GeeksforGeeks. https://www.geeksforgeeks.org/deep-learning/neural-network-pruning-in-deep-learning/.
  69. 69.0 69.1 "What is Knowledge Distillation?". Lightly AI. https://www.lightly.ai/blog/knowledge-distillation.
  70. 70.0 70.1 "Knowledge Distillation with Teacher Assistant for Model Compression". The Daily Dose of Data Science. https://www.dailydoseofds.com/p/knowledge-distillation-with-teacher-assistant-for-model-compression/.
  71. "AI Inference Optimization Techniques and Solutions". Nebius. https://nebius.com/blog/posts/inference-optimization-techniques-solutions.
  72. "Achieve 23x LLM Inference Throughput & Reduce p50 Latency". Anyscale. https://www.anyscale.com/blog/continuous-batching-llm-inference. Retrieved 2025-01-21.
  73. "Looking back at speculative decoding". Google Research. https://research.google/blog/looking-back-at-speculative-decoding/. Retrieved 2025-01-21.
  74. "Optimizing and Serving Models with NVIDIA TensorRT and NVIDIA Triton". NVIDIA. https://developer.nvidia.com/blog/optimizing-and-serving-models-with-nvidia-tensorrt-and-nvidia-triton/. Retrieved 2025-01-21.
  75. "Open Source Inference at Full Throttle: Exploring TGI and vLLM". Kelk. https://kelk.ai/blog/inference-engines. Retrieved 2025-01-21.
  76. "Triton Inference Server". GitHub. https://github.com/triton-inference-server/server. Retrieved 2025-01-21.
  77. "ONNX Runtime". Microsoft. https://onnxruntime.ai/.
  78. "TensorFlow Lite". TensorFlow. https://www.tensorflow.org/lite.
  79. "Edge AI vs. cloud AI: What's the difference?". IBM. https://www.ibm.com/think/topics/edge-vs-cloud-ai.
  80. "Edge AI Cameras vs. Cloud: Balancing Latency, Cost & Reach". Medium. https://medium.com/@API4AI/edge-ai-cameras-vs-cloud-balancing-latency-cost-reach-7e660131977f.
  81. 81.0 81.1 "The AI Edge Computing Cost: Local Processing vs. Cloud Pricing". Monetizely. https://www.getmonetizely.com/articles/the-ai-edge-computing-cost-local-processing-vs-cloud-pricing.
  82. "Cloud vs. Edge: Where Should AI Training Really Happen?". Datacenters.com. https://www.datacenters.com/news/cloud-vs-edge-where-should-ai-training-really-happen.
  83. 83.0 83.1 83.2 "AI Deep Learning Inference Acceleration". RidgeRun. https://www.ridgerun.com/post/ai-deep-learning-inference-acceleration.
  84. 84.0 84.1 84.2 "GPU vs FPGA vs ASIC vs CPU: Which Chip is Best for AI?". PCBONLINE. https://www.pcbonline.com/blog/gpu-vs-fpga-vs-asic-vs-cpu.html.
  85. "Inside Google's TPU and GPU Comparisons". SkyMod. https://skymod.tech/inside-googles-tpu-and-gpu-comparisons/.
  86. 86.0 86.1 86.2 "Custom ASICs for Real-Time Inference at the Edge". Geeta University. https://blog.geetauniversity.edu.in/custom-asics-for-real-time-inference-at-the-edge/.
  87. "Comparing NVIDIA H100 vs A100 GPUs for AI Workloads". OpenMetal. https://openmetal.io/resources/blog/nvidia-h100-vs-a100-gpu-comparison/.
  88. "Tensor Processing Unit". Wikipedia. https://en.wikipedia.org/wiki/Tensor_Processing_Unit.
  89. "Ironwood: The first Google TPU for the age of inference". Google. https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/.
  90. "AI Chip - Amazon Inferentia". AWS. https://aws.amazon.com/ai/machine-learning/inferentia/.
  91. "What Is Apple's Neural Engine and How Does It Work?". MakeUseOf. https://www.makeuseof.com/what-is-a-neural-engine-how-does-it-work/.
  92. "AI Inference Market Size, Share & Growth, 2025 To 2030". MarketsandMarkets. https://www.marketsandmarkets.com/Market-Reports/ai-inference-market-189921964.html. Retrieved 2025-01-21.
  93. "What is AI inference?". Red Hat. https://www.redhat.com/en/topics/ai/what-is-ai-inference.
  94. 94.0 94.1 "12 Real Life AI in Healthcare Examples". Keragon. https://www.keragon.com/blog/ai-in-healthcare-examples.
  95. "Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices". FDA. https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-software-medical-device.
  96. "What is AI inference?". Red Hat. https://www.redhat.com/en/topics/ai/what-is-ai-inference#:~:text=Finance%3A%20After%20being%20trained%20on,privacy%2C%20and%20improve%20brand%20reputation..
  97. 97.0 97.1 "AI in Finance". Google Cloud. https://cloud.google.com/discover/finance-ai.
  98. "An introduction to AI". TechTarget. https://www.techtarget.com/whatis/video/An-introduction-to-AI.
  99. "Inference in AI". GeeksforGeeks. https://www.geeksforgeeks.org/artificial-intelligence/inference-in-ai/.
  100. "What is Reasoning in AI?". IBM Think. https://www.ibm.com/think/topics/ai-reasoning.
  101. "An explanation of AI model collapse". TechTarget. https://www.techtarget.com/whatis/video/An-explanation-of-AI-model-collapse.
  102. "A Guide to LLM Inference Performance Monitoring". Symbl.ai. https://symbl.ai/developers/blog/a-guide-to-llm-inference-performance-monitoring/.
  103. "Power Consumption Benchmark for Embedded AI Inference". ResearchGate. https://www.researchgate.net/publication/385300510_Power_Consumption_Benchmark_for_Embedded_AI_Inference.
  104. "Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute". Microsoft Research. https://www.microsoft.com/en-us/research/publication/energy-use-of-ai-inference-efficiency-pathways-and-test-time-compute/. Cite error: Invalid <ref> tag; name "[80]" defined multiple times with different content
  105. "The Economics of AI Training and Inference: How DeepSeek Broke the Cost Curve". Adyog. https://blog.adyog.com/2025/02/09/the-economics-of-ai-training-and-inference-how-deepseek-broke-the-cost-curve/.