See also: Machine learning terms
Prediction in machine learning refers to the output produced by a trained model when it is applied to new, previously unseen data. After a model has completed its training phase by learning patterns and relationships from historical data, it can generate estimates or forecasts for new inputs. This process is central to virtually every machine learning application, from classification tasks that assign category labels to regression tasks that output continuous numerical values.
Prediction is distinct from the training process itself. During training, the model adjusts its internal parameters to minimize a loss function. During prediction (also called the inference phase), those parameters are fixed, and the model simply computes an output for a given input. Understanding what predictions are, how they are generated, and how to evaluate them is essential for practitioners building real-world machine learning systems.
The terms "prediction," "inference," and "estimation" are sometimes used interchangeably, but they carry different meanings depending on context.
| Term | Primary Goal | Focus | Typical Use Case |
|---|---|---|---|
| Prediction | Forecast an outcome for new data | Output accuracy on unseen inputs | Spam detection, stock price forecasting |
| Inference | Understand relationships between variables | Coefficients, causal mechanisms, feature importance | Scientific research, policy analysis |
| Estimation | Approximate a population parameter | Standard errors, confidence intervals | Polling, clinical trials |
Prediction asks: "Given this input, what is the most likely output?" A data scientist focused on prediction cares about metrics like accuracy, AUC, or RMSE, measuring how well the model's outputs match reality on held-out test data.
Inference asks: "Which variables are associated with the outcome, and how?" A researcher focused on inference cares about model coefficients, standard errors, and statistical significance. The goal is to explain the data-generating process, not merely to forecast outcomes.
Estimation is a broader umbrella term. It can refer to estimating model parameters during training (parameter estimation) or estimating a predicted value for a new observation (point estimation). In practice, both prediction and inference involve estimation at some level.
The distinction matters for model selection. Complex models like deep neural networks or random forests often excel at prediction but are difficult to interpret, making them poor choices for inference. Simpler models like linear regression or logistic regression may sacrifice some predictive accuracy but provide clear, interpretable coefficients suitable for inference.
In a classification model, predictions take the form of discrete class labels. A binary classifier might output "spam" or "not spam," while a multi-class classifier could assign one of several categories such as "cat," "dog," or "bird."
Most classification models produce class probabilities internally before arriving at a final label. For example, a logistic regression model outputs a probability between 0 and 1 via the sigmoid function. For multi-class problems, models typically use the softmax function, which converts a vector of raw scores (logits) into a probability distribution where all values sum to 1.0. The class with the highest probability is then selected as the predicted label.
| Classification Scenario | Example Output | Output Type |
|---|---|---|
| Binary classification | P(spam) = 0.92 | Single probability |
| Multi-class classification | P(cat) = 0.7, P(dog) = 0.2, P(bird) = 0.1 | Probability distribution |
| Multi-label classification | [positive, comedy] | Set of labels |
The decision threshold (commonly 0.5 for binary tasks) can be adjusted depending on the application. In medical screening, lowering the threshold increases sensitivity so that fewer positive cases are missed, even at the cost of more false positives.
In a regression model, predictions are continuous numerical values. A house price model might predict $342,500; a weather model might predict a temperature of 28.3 degrees Celsius. These are called point predictions because they provide a single best-guess value.
Regression predictions are produced by passing input features through the model's learned function. For a simple linear regression, this is a weighted sum of features plus a bias term. For more complex models like gradient boosting or neural networks, the computation involves multiple layers of nonlinear transformations.
A point prediction gives a single value as the best estimate of the outcome. While straightforward, point predictions provide no information about how confident the model is or how much the actual value might deviate from the prediction.
A probabilistic prediction instead outputs a full probability distribution (or a summary of one) over possible outcomes. This approach captures the inherent uncertainty in the prediction and enables more informed decision-making.
| Aspect | Point Prediction | Probabilistic Prediction |
|---|---|---|
| Output | Single value | Distribution or interval |
| Uncertainty info | None | Quantified |
| Complexity | Simple | More complex |
| Decision-making | Limited context | Risk-aware decisions possible |
| Example | "Sales will be 500 units" | "Sales will be between 420 and 580 units with 95% probability" |
Probabilistic predictions are especially valuable in domains where the cost of errors varies. In supply chain management, knowing that demand could range from 400 to 600 units (rather than just "500") helps planners set appropriate inventory buffers.
Every prediction carries some degree of uncertainty. Quantifying that uncertainty is critical for responsible deployment of machine learning systems.
Calibration measures whether a model's predicted probabilities correspond to actual observed frequencies. A well-calibrated model that predicts a 70% chance of rain should be correct roughly 70% of the time across many such predictions. Neural networks are often poorly calibrated, tending to produce overconfident predictions. Techniques like Platt scaling and temperature scaling can improve calibration after training.
Aleatoric uncertainty arises from inherent noise or randomness in the data. No model can eliminate this type of uncertainty; it is a property of the problem itself. For instance, predicting the exact outcome of a coin flip is fundamentally uncertain.
Epistemic uncertainty arises from limitations in the model or insufficient training data. This type of uncertainty can, in principle, be reduced with more data or a better model. Techniques like Monte Carlo dropout (running multiple forward passes with dropout active at prediction time) and model ensembles can estimate epistemic uncertainty.
These two interval types are often confused but serve different purposes.
A confidence interval estimates the range likely to contain the true mean of a response variable. It quantifies uncertainty about where the average outcome lies for a given set of inputs.
A prediction interval estimates the range likely to contain a single future observation. Because individual observations vary more than averages, prediction intervals are always wider than confidence intervals for the same data.
| Interval Type | What It Estimates | Width | Use Case |
|---|---|---|---|
| Confidence interval | Range for the population mean | Narrower | "The average house price for 3-bedroom homes is between $310K and $340K" |
| Prediction interval | Range for a single new observation | Wider | "This specific 3-bedroom house will sell for between $280K and $370K" |
A prediction error, also known as a residual, is the difference between the observed (actual) value and the predicted value:
Residual = Actual Value - Predicted Value
A positive residual means the model underestimated the true value. A negative residual means the model overestimated it. Analyzing residuals is one of the most important tools for diagnosing model performance.
Common metrics derived from prediction errors include:
| Metric | Formula Description | Interpretation |
|---|---|---|
| Mean Absolute Error (MAE) | Average of absolute residuals | Average magnitude of errors |
| Mean Squared Error (MSE) | Average of squared residuals | Penalizes large errors more heavily |
| Root Mean Squared Error (RMSE) | Square root of MSE | Same units as the target variable |
| Mean Absolute Percentage Error (MAPE) | Average of percentage errors | Scale-independent error measure |
A well-behaved model produces residuals that are randomly distributed around zero with no discernible pattern. If residuals show systematic structure (for example, the model consistently underpredicts for high values), this indicates the model has not fully captured the underlying relationship and may need architectural changes or additional features.
In supervised learning, prediction is the central goal. The model learns from labeled input-output pairs during training and then applies that learned mapping to produce predictions on new inputs. The quality of predictions is directly measured against known ground truth labels in the test set.
Unsupervised learning does not produce predictions in the traditional sense. Instead, it discovers patterns such as clusters or latent dimensions within data. However, the discovered structures can support downstream prediction tasks. For example, cluster assignments from K-means can serve as features for a supervised classifier.
In reinforcement learning, prediction takes a different form. An agent learns to predict the expected cumulative reward (value function) for states or state-action pairs. These value predictions guide the agent's policy, helping it choose actions that maximize long-term reward.
Machine learning systems can serve predictions in two main patterns, each suited to different operational requirements.
| Aspect | Batch Prediction | Online Prediction |
|---|---|---|
| Timing | Predictions computed before requests arrive | Predictions computed after requests arrive |
| Latency | High (hours to days) | Low (milliseconds to seconds) |
| Throughput | Very high | Lower per-request |
| Cost | Often cheaper (uses off-peak resources) | Higher (always-on infrastructure) |
| Data freshness | Uses historical data snapshot | Can use real-time features |
| Example | Nightly product recommendations for all users | Fraud detection on each transaction |
Batch prediction (also called offline inference) generates predictions for a large set of inputs on a schedule, such as every hour or every night. Results are stored in a database and served when needed. This approach is cost-effective and simple but introduces latency between when data arrives and when predictions are available.
Online prediction (also called real-time inference) generates predictions on demand as individual requests arrive. This is necessary for applications like fraud detection, autonomous driving, and conversational AI, where decisions must be made within milliseconds. Online prediction requires always-on serving infrastructure and careful optimization to meet latency requirements.
Many production systems use a hybrid approach. An e-commerce platform might run batch jobs overnight to precompute baseline recommendations for all users, then use an online model to adjust those recommendations in real time based on the user's current browsing session.
Deploying models to serve predictions in production introduces engineering challenges around latency, throughput, and reliability.
Prediction latency is the time elapsed between receiving an input and returning the prediction. Acceptable latency varies by application: a search engine autocomplete feature might require sub-10ms latency, while a batch analytics pipeline can tolerate minutes or hours.
Key techniques for reducing prediction latency include:
Popular serving frameworks include TensorFlow Serving, NVIDIA Triton Inference Server, and TorchServe, all of which provide optimized infrastructure for low-latency model serving.
As machine learning models are deployed in high-stakes domains like healthcare, finance, and criminal justice, the need to explain individual predictions has grown. The field of Explainable AI (XAI) provides tools for understanding why a model made a specific prediction.
SHAP (SHapley Additive exPlanations) is rooted in cooperative game theory. It assigns each feature a contribution value (Shapley value) that represents how much that feature pushed the prediction away from the average. SHAP provides both local explanations (for individual predictions) and global explanations (for overall model behavior).
LIME (Local Interpretable Model-Agnostic Explanations) works by generating perturbed versions of the input, observing how the model's prediction changes, and fitting a simple interpretable model (typically linear) to approximate the complex model's behavior locally. LIME is model-agnostic, meaning it works with any classifier or regressor.
| Method | Scope | Theoretical Basis | Speed | Model-Agnostic |
|---|---|---|---|---|
| SHAP | Local and global | Shapley values from game theory | Slower (exact), faster (approximations) | Yes |
| LIME | Local only | Local linear approximation | Generally faster | Yes |
Both methods help build trust in model predictions and can reveal when a model is relying on spurious correlations rather than genuinely predictive features.
Prediction bias occurs when a model's predictions systematically deviate from actual outcomes, either overall or for specific subgroups of the population.
Statistical prediction bias is the difference between the average predicted value and the average actual value. A model with zero prediction bias has predictions that are correct on average, though individual predictions may still have errors.
Algorithmic or fairness-related bias occurs when predictions are systematically less accurate or less favorable for certain demographic groups. This can arise from several sources:
Mitigation strategies operate at three stages: pre-processing (rebalancing or reweighting training data), in-processing (adding fairness constraints to the optimization objective), and post-processing (adjusting prediction thresholds per group to equalize error rates).
Prediction in machine learning spans virtually every industry:
| Domain | Prediction Task | Example |
|---|---|---|
| Natural language processing | Text classification, sentiment analysis | Classifying customer reviews as positive or negative |
| Computer vision | Image recognition, object detection | Identifying defects in manufactured parts |
| Healthcare | Disease prognosis, readmission risk | Predicting 30-day hospital readmission |
| Finance | Credit scoring, fraud detection | Flagging suspicious transactions in real time |
| Retail | Demand forecasting, recommendation systems | Predicting next-week sales for inventory planning |
| Climate science | Weather and climate modeling | Forecasting temperature and precipitation |
| Autonomous driving | Trajectory prediction | Predicting the future path of nearby vehicles |
Imagine you have a friend who has eaten at hundreds of restaurants. Every time you ask, "Will I like this new restaurant?" your friend thinks about all the restaurants they have tried before, what you liked, and what you did not like. Then they give you their best guess: "Yes, you will probably love it!" or "No, you probably will not enjoy it."
That guess is a prediction. Your friend is the "model," all the past restaurants are the "training data," and the new restaurant is the "new data." Sometimes your friend is very confident ("You will definitely love it!"), and sometimes less sure ("It could go either way"). A good predictor is right most of the time, but nobody is perfect. The important thing is knowing how much to trust the prediction, which is why we measure things like accuracy and uncertainty.