Attribute
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,130 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,130 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, an attribute is an individual measurable property of an object, observation, or example that describes the instance being analyzed. Attributes are the columns of a tabular dataset, and they supply the raw signal that a model uses to learn patterns or make predictions. In a table of patient records, age, blood pressure, gender, and cholesterol level are attributes; in a table of houses, the number of bedrooms, square footage, neighborhood, and year built are attributes.
The word attribute is most common in the data mining and database traditions, where a column of a relational table is called an attribute. The same idea appears in pattern recognition and machine learning under the name feature, and in classical statistics under names such as independent variable, predictor, regressor, explanatory variable, or covariate. These terms are mostly interchangeable in practice, though each community has its own house style.
The two words attribute and feature are often used as synonyms, but a careful distinction does exist. In the Stanford machine learning glossary maintained by Ronny Kohavi, an attribute is defined as a quantity describing an instance, while a feature is the specification of an attribute and its value. Under that definition, color is an attribute and "color is blue" is a feature of a particular example. The attribute is the slot; the feature is the slot together with a concrete reading.
This subtlety matters in older data mining literature and in some textbooks, but it has largely faded in everyday usage. Most modern documentation, including the Wikipedia article on features, treats feature and attribute as the same thing: an individual measurable property of a data set. The Domino Data Lab data science dictionary also lists variable and attribute as synonyms for feature. The practical result is that a data scientist who talks about "adding a feature" and a database analyst who talks about "adding an attribute" are usually doing the same job.
There is also a small terminology drift across subfields. Computer vision and signal processing tend to favor feature, especially when referring to the output of a transformation such as a feature extraction step or an edge detector. Tabular data work, business intelligence, and survey research lean toward attribute or variable. None of these labels change what the underlying object is.
In supervised learning, attributes act as the independent variables of the model. A predictive system takes a vector of attribute values as input and returns an estimate of a label or target value. The label is the dependent variable; the attributes are everything used to predict it. The Wikipedia article on dependent and independent variables lists feature and label attribute as the data mining equivalents of independent and dependent variables in statistics.
This framing also clarifies a common naming question. A target column in a training data table is technically still an attribute of the row, but it is a special one: the value the model is trying to learn. To avoid confusion, many practitioners reserve the bare word attribute for the inputs and call the output the target, label, or response.
Attributes come in several flavors, and the type determines what kinds of mathematical operations make sense on the values. The data mining community usually divides attribute types along two axes: by data nature and by level of measurement.
By data nature, the most common split is:
| Type | Description | Examples |
|---|---|---|
| Numerical (quantitative) | Values are real numbers or integers and support arithmetic | Age in years, weight in kilograms, pixel intensity |
| Categorical (qualitative) | Values come from a finite set of named categories | Country, product type, blood type |
| Binary | A categorical attribute with exactly two possible values | True or false, spam or not spam |
| Text | Free-form strings, usually unstructured | Product reviews, support tickets |
| Date or time | Values represent a moment or duration | Transaction date, session length |
Numerical attributes are further split into discrete (integer counts such as number of children) and continuous (real-valued such as temperature). Categorical attributes are further split into nominal, where no order exists between categories, and ordinal, where a meaningful order does exist.
By level of measurement, attributes are usually grouped using the classic Stevens scheme of nominal, ordinal, interval, and ratio. Nominal attributes only support equality checks; zip code and eye color are typical examples. Ordinal attributes also support ordering; a Likert scale of {poor, fair, good, very good, excellent} is ordinal. Interval attributes support meaningful differences but lack a true zero; Celsius and Fahrenheit temperatures are the textbook cases. Ratio attributes support meaningful ratios and have a real zero point, like weight, height, or counts.
The distinction is not academic. A linear model that treats a zip code as a number assumes that the gap between 10001 and 10002 means the same thing as the gap between 90000 and 90001, which is obviously wrong. Knowing the measurement level of every attribute is the first step in choosing a sensible preprocessing pipeline.
Most learning algorithms expect numeric input, so raw attribute values usually need transformation before training. Common preprocessing steps include:
| Step | What it does | When to use it |
|---|---|---|
| One-hot encoding | Turns each categorical value into its own binary column | Nominal attributes with a small to medium number of categories |
| Label encoding | Maps each category to an integer | Ordinal attributes, or as input to tree-based models |
| Frequency encoding | Replaces each category with how often it appears in the data | High-cardinality nominal attributes |
| Target encoding | Replaces each category with a statistic of the target for that group | High-cardinality categoricals where target leakage is controlled |
| Standardization | Rescales values so the mean is zero and the standard deviation is one | Linear models, neural networks, distance-based methods |
| Min-max scaling | Rescales values to a fixed range such as 0 to 1 | Image pixel intensities, neural network inputs |
| Normalization | Adjusts values to have unit norm or a chosen scale | Text features, sparse vectors |
| Imputation | Fills in missing values using a constant, mean, median, or model | Any attribute with gaps |
| Discretization | Converts a continuous attribute into a finite set of bins | Models that handle categoricals better than continuous inputs |
Tree-based methods such as random forests and gradient boosting can handle some categorical attributes directly, but most other algorithms cannot. One-hot encoding is the default trick for nominal data: a single attribute with three values, like color in {red, green, blue}, becomes three binary columns. Standardization is the default for numeric data when the model cares about scale, which includes linear regression, logistic regression, support vector machines, k-nearest neighbors, and most neural networks.
This whole step is often called feature engineering. The same column on disk can become several different attributes in a trained model: a date might end up as year, month, day of week, and an is_weekend flag, and a price might end up as a raw value, a log-transformed value, and a percentile rank. The transformations are part of the model, not metadata about the data.
Attributes are how the world enters the model. If the attributes do not contain the signal needed to predict the target, no amount of clever training will recover it. A spam classifier without any text features is not going to work, no matter which algorithm you pick.
At the same time, more attributes is not automatically better. Adding irrelevant or redundant attributes makes models harder to interpret, slower to train, and more prone to overfitting. It can also trigger the curse of dimensionality, a term Richard Bellman coined in 1957 while working on dynamic programming. As the number of attributes grows, the volume of the input space grows exponentially, the data becomes sparse, and distances between points lose their meaning. Many algorithms degrade badly in this regime.
The usual response is feature selection: pick a subset of attributes that carry most of the useful signal and drop the rest. There are three main families of feature selection methods:
| Family | How it works | Trade-off |
|---|---|---|
| Filter methods | Rank attributes by a statistical score such as correlation, mutual information, or chi-square, then keep the top ones | Fast and model-agnostic, but ignores interactions between attributes |
| Wrapper methods | Train and evaluate a model on different attribute subsets, often using forward or backward selection | Often produces the strongest subset for a given model, but is computationally expensive |
| Embedded methods | Let the learning algorithm pick attributes during training, as in LASSO regression or tree-based feature importance | Balances speed and accuracy, but is tied to a specific model class |
A related response is dimensionality reduction, where new attributes are constructed as combinations of the originals. Principal component analysis is the textbook example: it builds a small set of new attributes that capture as much variance as possible from the full attribute set. Other techniques include linear discriminant analysis, independent component analysis, and autoencoders.
A single row of a dataset, with all of its attribute values filled in, is sometimes called a feature vector. The set of all possible feature vectors for a given attribute list is the feature space. A model can then be described geometrically: a classification model carves the feature space into regions for each class, and a regression model draws a surface over it. This geometric view is what makes the curse of dimensionality bite. Adding attributes means stretching the space into more dimensions, and the data points that used to feel close together start to look uniformly far apart.
Consider a small dataset for predicting whether a customer will churn from a streaming service. The raw table might have these attributes:
| Attribute | Type | Level | Notes |
|---|---|---|---|
| customer_id | Nominal | Identifier | Should not be used as a predictor |
| age | Numerical | Ratio | Continuous, integer years |
| country | Categorical | Nominal | High cardinality, needs encoding |
| plan_type | Categorical | Ordinal | basic, standard, premium |
| monthly_spend | Numerical | Ratio | Often log-transformed |
| signup_date | Date | Interval | Often split into year and month |
| last_login_days | Numerical | Ratio | Strong churn signal in many services |
| churned | Binary | Nominal | The target, not an input |
A practitioner would drop customer_id, one-hot encode country (or use a hashing trick if there are thousands of values), label encode plan_type while preserving its order, log-transform monthly_spend, derive year, month, and tenure_months from signup_date, and keep age and last_login_days as is. The result is a clean attribute matrix that any standard machine learning algorithm can consume.
The choice of attributes and the way they are transformed often matters more than the choice of model. Two teams using the same gradient boosting library on the same raw data can produce wildly different results because one team built better features.
Imagine you are trying to guess how much a house costs. You would not just look at a picture and shrug. You would write down some facts: how many bedrooms it has, how big it is, what neighborhood it is in, how old it is. Each fact is an attribute. The computer does the same thing. It looks at lots of houses and their attributes, learns which facts matter, and then uses those facts to guess the price of a new house it has never seen.
Some attributes are numbers, like square footage. Some are words, like neighborhood names. The computer prefers numbers, so we usually translate the words into numbers first. Once everything is in a friendly shape, the computer can do its job.