# DataFrame

> Source: https://aiwiki.ai/wiki/dataframe
> Updated: 2026-06-25
> Categories: AI Tools & Products, Data Science, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns), in which each column can hold a different data type and arithmetic operations align on both row and column labels.[1][11] It is the central data structure of modern [data science](/wiki/data_science) and [machine learning](/wiki/machine_learning) data preparation: programmers use it to load, clean, transform, and analyze structured data the way they would a spreadsheet or a SQL table, but programmatically and reproducibly. The concept was introduced in the S statistical language in 1990 and popularized in [Python](/wiki/python) by the [pandas](/wiki/pandas) library, whose `DataFrame` is now the de facto standard for tabular data in Python.[1][2][3]

## What is a DataFrame?

A DataFrame is a tabular data structure in which information is organized into rows and columns. Each row represents a single observation, record, or instance, while each column represents a variable, attribute, or [feature](/wiki/feature). Every column in a DataFrame has a name (header) and a consistent data type, such as integers, floating-point numbers, strings, dates, or categorical values. Rows are typically identified by an index, which can be a simple integer sequence or a meaningful label like a date or ID.

The official pandas documentation describes the structure precisely: a DataFrame is "Two-dimensional, size-mutable, potentially heterogeneous tabular data" whose "Data structure also contains labeled axes (rows and columns)" and where "Arithmetic operations align on both row and column labels."[11] That alignment behavior, automatically matching values by their labels rather than by raw position, is the defining feature that separates a DataFrame from a plain matrix or [NumPy](/wiki/numpy) array.

The key properties of a DataFrame include:

- **Two-dimensional structure:** Data is arranged in rows and columns, similar to a matrix but with labeled axes.
- **Heterogeneous columns:** Different columns can hold different data types (e.g., one column of integers, another of strings).
- **Labeled axes:** Both rows (index) and columns have labels for easy identification and retrieval.
- **Size-mutable:** Rows and columns can be added or removed after creation.
- **Alignment:** Operations between DataFrames automatically align on both row and column labels.

## Where did the DataFrame come from?

The DataFrame did not originate in R, as is sometimes assumed; it was introduced in the S language, which R later reimplemented. The `data.frame` was first presented by John Chambers, Trevor Hastie, and Daryl Pregibon at the 1990 Computational Statistics conference (Compstat), where the authors wrote: "We have introduced into S a class of objects called data.frames, which can be used if convenient to organize all of the variables relevant to a particular analysis."[12] S itself was created earlier at Bell Laboratories by John Chambers and colleagues, with the interactive version described by Becker, Chambers, and Wilks in 1988.[12] Chambers and Hastie expanded the data frame design in the 1992 book *Statistical Models in S*.

R, the open-source successor to S, was created by Ross Ihaka and Robert Gentleman at the University of Auckland around 1991 and inherited the `data.frame` as a core built-in type.[6] The term "data frame" therefore predates Python by nearly two decades.

The DataFrame entered the Python ecosystem through [pandas](/wiki/pandas). Wes McKinney began building pandas in 2008 while working as a quantitative analyst at the hedge fund AQR Capital Management, frustrated that Python lacked an intuitive way to handle spreadsheet-like data with rows and columns.[2] He convinced AQR to let him open-source the library in 2009, releasing pandas 0.1 to the Python Package Index that year.[2] The name pandas derives from "panel data," an econometrics term for multidimensional datasets, and doubles as a play on "Python data analysis."[2] In his 2010 paper introducing the library, McKinney described the pandas DataFrame as "basically a pythonic data.frame, but with automatic data alignment."[1]

| Year | Milestone | Source |
|---|---|---|
| 1988 | The S language (interactive version) described by Becker, Chambers, and Wilks at Bell Labs | [12] |
| 1990 | `data.frame` introduced into S by Chambers, Hastie, and Pregibon at Compstat | [12] |
| 1991 | R created by Ross Ihaka and Robert Gentleman, inheriting `data.frame` | [6] |
| 2008 | Wes McKinney begins pandas at AQR Capital Management | [2] |
| 2009 | pandas open-sourced (version 0.1 on PyPI) | [2] |
| 2010 | McKinney publishes "Data Structures for Statistical Computing in Python" | [1] |
| 2026 | pandas 3.0.3 released (copy-on-write default, Arrow-backed types) | [2] |

## How do you create a DataFrame?

In [pandas](/wiki/pandas), the most widely used DataFrame library, there are several common ways to create a DataFrame:

### From a Dictionary

The most straightforward approach is passing a Python dictionary to the `pd.DataFrame()` constructor. The dictionary keys become column names and the values become the column data.[9]

```python
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "London", "Tokyo"]
}
df = pd.DataFrame(data)
```

### From a CSV File

The `read_csv()` function is one of the most commonly used methods for loading data into a DataFrame. It reads comma-separated values from a file or URL.[2]

```python
df = pd.read_csv("sales_data.csv")
```

### From a List of Dictionaries

Each dictionary in the list represents one row, and the keys become column names.

```python
records = [
    {"Name": "Alice", "Score": 92},
    {"Name": "Bob", "Score": 85},
]
df = pd.DataFrame(records)
```

### From a NumPy Array

DataFrames can be created from [NumPy](/wiki/numpy) arrays, with column names specified separately.[9]

```python
import numpy as np

array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(array, columns=["A", "B", "C"])
```

## What can you do with a DataFrame?

DataFrame libraries provide a rich set of operations for data manipulation. The following table summarizes the most important operations available in pandas.[2]

| Operation | Description | Example Syntax |
|---|---|---|
| Selection | Retrieve specific columns or rows by label or position | `df["col"]`, `df.loc[0]`, `df.iloc[0:5]` |
| Filtering | Subset rows based on Boolean conditions | `df[df["Age"] > 25]` |
| Sorting | Order rows by one or more columns | `df.sort_values("Age")` |
| GroupBy | Split data into groups, apply a function, and combine results | `df.groupby("City")["Sales"].sum()` |
| Merge/Join | Combine two DataFrames on shared columns or indices | `pd.merge(df1, df2, on="ID")` |
| Pivot | Reshape data from long to wide format | `df.pivot_table(values="Sales", index="Region", columns="Year")` |
| Apply | Apply a custom function to rows or columns | `df["col"].apply(lambda x: x * 2)` |
| Aggregation | Compute summary statistics (mean, sum, count, etc.) | `df.describe()`, `df["col"].mean()` |
| Missing data | Detect, fill, or drop missing values | `df.dropna()`, `df.fillna(0)` |
| Concatenation | Stack DataFrames vertically or horizontally | `pd.concat([df1, df2])` |

### Selection and Indexing

Pandas offers multiple ways to select data. Label-based indexing with `.loc[]` selects rows and columns by their names, while integer-based indexing with `.iloc[]` uses numerical positions.[2] Boolean indexing allows filtering rows based on conditions. For example, `df.loc[df["Status"] == "Active", "Revenue"]` selects the Revenue column only for rows where Status equals "Active."[9]

### GroupBy

The `groupby()` method is one of the most powerful features in any DataFrame library. It implements the "split-apply-combine" pattern: the data is split into groups based on one or more keys, a function is applied to each group independently, and the results are combined back into a DataFrame.[1] The pattern was formalized by Hadley Wickham in his 2011 paper "The Split-Apply-Combine Strategy for Data Analysis," which describes breaking a large problem "into manageable pieces, operate on each piece independently and then put all the pieces back together."[13] It is essential for computing aggregated statistics like totals, averages, or counts per category.[2]

### Merge and Join

DataFrames support SQL-style joins (inner, left, right, and outer) to combine datasets. The `merge()` function joins two DataFrames on one or more shared columns, while `join()` aligns on index values.[2] These operations are fundamental for combining data from multiple sources, such as linking a customer table with an orders table.

## What data types can a DataFrame hold?

Every column in a DataFrame has an associated data type (dtype) that determines how values are stored in memory and what operations can be performed. Understanding data types is critical for both correctness and performance.

| Data Type | pandas dtype | Description | Typical Memory per Value |
|---|---|---|---|
| Integer | `int64`, `int32`, `int8` | Whole numbers | 1 to 8 bytes |
| Float | `float64`, `float32` | Decimal numbers | 4 to 8 bytes |
| Boolean | `bool` | True/False values | 1 byte |
| String/Object | `object`, `string` | Text data | Variable |
| DateTime | `datetime64` | Dates and timestamps | 8 bytes |
| Categorical | `category` | Fixed set of values (e.g., "Red", "Blue", "Green") | Depends on cardinality |
| Nullable Integer | `Int64`, `Int32` | Integers that support missing values (NA) | 1 to 8 bytes + bitmap |

Choosing the right data type can drastically reduce memory usage. For instance, converting a column of repeated string values (like country names) to the `category` dtype can reduce memory consumption by 80% or more, since only the unique values are stored once and each row references them via small integer codes.[8]

## How is a DataFrame stored in memory?

DataFrames in pandas store data in contiguous blocks of memory organized by data type. For large [data sets](/wiki/data_set_or_dataset), memory optimization becomes important. Several strategies can help reduce the memory footprint:[8]

- **Downcast numeric types:** Using `int8` or `int16` instead of the default `int64` when the value range permits. An `int64` column uses 8 bytes per value, while `int8` uses only 1 byte.
- **Use categorical dtype:** For columns with a limited number of unique values (such as gender, country, or status codes), converting to `category` type stores each unique value once and uses integer codes for references.
- **Load only needed columns:** When reading from CSV or other file formats, specify the columns to load using the `usecols` parameter to avoid loading unnecessary data.
- **Process in chunks:** For files that exceed available RAM, `read_csv()` supports a `chunksize` parameter that reads and processes the file in smaller pieces.[8]

The `df.info(memory_usage='deep')` and `df.memory_usage(deep=True)` methods allow users to inspect exactly how much memory each column consumes. As a rough guide, in-memory pandas operations typically need around 5 to 10 times as much RAM as the size of the dataset on disk, which is one reason memory-efficient alternatives such as Polars (roughly 2 to 4 times) have gained traction for larger workloads.[10]

## How does a DataFrame differ from a spreadsheet or SQL table?

DataFrames share similarities with spreadsheets and SQL tables but serve different purposes and have different strengths.

| Feature | DataFrame (pandas) | Spreadsheet (Excel) | SQL Table |
|---|---|---|---|
| Interface | Code (Python scripts) | Graphical (point and click) | Query language (SQL) |
| Data size | Millions of rows (limited by RAM) | ~1 million rows (Excel limit) | Billions of rows (disk-based) |
| Reproducibility | Fully reproducible via scripts | Manual steps are hard to reproduce | Reproducible via saved queries |
| Data types | Strict per-column types | Loose (cell-level types) | Strict per-column types |
| Integration with ML | Direct (scikit-learn, TensorFlow) | Requires export/import | Requires extraction |
| Speed | Fast for in-memory data | Slow for large data | Optimized for large disk-based queries |
| Collaboration | Via version control (Git) | Via shared files or cloud | Via database access control |

DataFrames are particularly well suited for data science and [machine learning](/wiki/machine_learning) workflows because they integrate directly with Python's scientific computing ecosystem, including [NumPy](/wiki/numpy), scikit-learn, Matplotlib, and TensorFlow.

## How are DataFrames used in machine learning?

DataFrames play a central role in nearly every stage of a [machine learning](/wiki/machine_learning) pipeline:

1. **Data loading:** Raw data from CSV files, databases, or APIs is typically loaded into a DataFrame as the first step.
2. **Exploratory data analysis (EDA):** Data scientists use DataFrame operations like `describe()`, `value_counts()`, and `corr()` to understand the distribution and relationships within the data.
3. **Data cleaning:** Handling missing values (`dropna()`, `fillna()`), removing duplicates (`drop_duplicates()`), and fixing inconsistent data types are all performed on DataFrames.
4. **[Feature engineering](/wiki/feature_engineering):** Creating new [features](/wiki/feature) from existing columns, encoding categorical variables, normalizing numeric values, and generating interaction terms.
5. **Train/test splitting:** DataFrames are split into training and test sets, often using scikit-learn's `train_test_split()` function, which accepts DataFrames directly.
6. **Model input:** While many ML models require NumPy arrays internally, libraries like scikit-learn accept DataFrames as input and preserve column names for feature importance analysis.

This tight integration between DataFrames and ML libraries makes pandas the de facto standard for data preparation in Python-based machine learning projects.[1]

## Which DataFrame libraries exist beyond pandas?

While pandas remains the most widely used DataFrame library, several alternatives have emerged to address specific limitations around performance, scalability, and language support.

### pandas (Python)

[Pandas](/wiki/pandas) is the original and most popular DataFrame library in the Python ecosystem. First released in 2009, it provides an intuitive API for data manipulation, cleaning, and analysis.[1][2] Pandas operates in-memory on a single machine and is best suited for datasets that fit in RAM (typically up to a few million rows). As of 2026, pandas 3.0.3 is the current release (shipped May 11, 2026), featuring copy-on-write behavior by default and improved support for Apache Arrow-backed data types.[2]

### Polars (Python, Rust)

Polars is a high-performance DataFrame library written in Rust, with Python and Node.js bindings, created by Ritchie Vink to address pandas' performance limitations.[3] Polars can execute common operations 5 to 10 times faster than pandas (and by some benchmarks 10-30 times faster) by using multi-threaded execution, a columnar memory layout, and an Apache Arrow-based backend.[10] One of its key features is lazy evaluation: operations are not executed immediately but instead build a query plan that is optimized before execution, similar to how SQL databases work. Polars also supports streaming execution for datasets that exceed available memory, and typically uses 2 to 4 times the dataset size in RAM versus pandas' 5 to 10 times.[3][10]

### Apache Spark DataFrame (Python, Scala, Java, R)

Apache Spark provides a distributed DataFrame API designed for processing datasets that span terabytes or petabytes across clusters of machines. Spark DataFrames are immutable and lazily evaluated; transformations build a logical plan that is optimized by the Catalyst query optimizer before execution.[4] PySpark, the Python API for Spark, offers a pandas-like syntax while distributing computation across a cluster. Spark is the standard choice when data volumes exceed what a single machine can handle.

### Dask DataFrame (Python)

Dask extends the pandas API to larger-than-memory and distributed computing scenarios. A Dask DataFrame is composed of many smaller pandas DataFrames, partitioned along the index. Operations are lazy, building a task graph that is executed only when `.compute()` is called.[5] Dask is well suited for users who are already familiar with pandas and need to scale their workflows without rewriting their code. It can run on a single machine (utilizing all CPU cores) or across a distributed cluster.

### cuDF (Python, GPU)

cuDF is the GPU-accelerated DataFrame library in NVIDIA's RAPIDS suite, exposing a pandas-like API that runs on CUDA GPUs and falls back to CPU pandas when needed.[14] Its `cudf.pandas` accelerator mode, made generally available in RAPIDS 24.02, can speed up unmodified pandas code with zero code changes: NVIDIA reports up to roughly 150 times faster processing on a 5 GB dataset by enabling the accelerator and importing pandas as usual.[14] cuDF is used where large tabular workloads benefit from GPU parallelism without rewriting existing pandas scripts.

### R data.frame and tibble (R)

R's built-in `data.frame` is one of the oldest DataFrame implementations, inherited from the S language and available since R's creation around 1991. It stores tabular data with named columns and supports heterogeneous column types.[6] The `tibble`, introduced by the tidyverse ecosystem, is a modernized version that provides cleaner printing, stricter subsetting behavior (no partial column name matching), and never converts strings to factors automatically.[7] R's `dplyr` package provides a grammar of data manipulation with verbs like `filter()`, `mutate()`, `select()`, and `summarise()` that operate on data frames and tibbles.

### Comparison of DataFrame Libraries

| Library | Language | Parallelism | Best For | Lazy Evaluation |
|---|---|---|---|---|
| pandas | Python | Single-threaded | Small to medium data (up to ~10 GB) | No |
| Polars | Python, Rust | Multi-threaded | Medium to large data on a single machine | Yes |
| Spark DataFrame | Python, Scala, Java, R | Distributed cluster | Very large data (terabytes and beyond) | Yes |
| Dask | Python | Multi-threaded or distributed | Scaling pandas workflows | Yes |
| cuDF | Python (GPU) | GPU-parallel | Accelerating pandas workloads on NVIDIA GPUs | No (eager) |
| R data.frame / tibble | R | Single-threaded (base R) | Statistical analysis and visualization | No |

## Example

The following example demonstrates a common DataFrame workflow in pandas: loading data, exploring it, cleaning it, and computing a summary.

```python
import pandas as pd

# Load data from CSV
df = pd.read_csv("students.csv")

# Inspect the first few rows
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Fill missing grades with the column mean
df["Grade"] = df["Grade"].fillna(df["Grade"].mean())

# Compute average grade by subject
avg_by_subject = df.groupby("Subject")["Grade"].mean()
print(avg_by_subject)

# Filter students with grades above 90
top_students = df[df["Grade"] > 90]
print(top_students)
```

A sample DataFrame of student grades looks like this:

| Name | Math | Science | English | History |
|---|---|---|---|---|
| Alice | 78 | 80 | 99 | 65 |
| Bob | 95 | 91 | 75 | 90 |
| Charlie | 68 | 77 | 75 | 84 |
| Emma | 94 | 79 | 88 | 96 |

In this table, each row represents a student and each column represents a subject. The values are the grades that each student earned in each subject.

## Explain Like I'm 5 (ELI5)

A DataFrame is like a big table where you keep information organized in rows and columns, just like a chart on a wall.

Imagine you have a chart listing all your friends. Each friend gets their own row. The columns across the top say things like "Name," "Favorite Color," and "Number of Pets." So you can look at any row and see everything about one friend, or look down a column and see everyone's favorite color at once.

Computers use DataFrames to organize information the same way. When a computer is learning to recognize cats in photos, it might have a DataFrame where each row is a photo and the columns list things like "has pointy ears," "has whiskers," and "has a tail." By looking at all the rows together, the computer figures out what makes a cat a cat.

The nice thing about a DataFrame is that you can ask questions like "show me only the friends who have more than 2 pets" or "what is the average number of pets?" and get answers instantly. That is why data scientists and programmers use DataFrames every day.

## References

1. McKinney, W. (2010). "Data Structures for Statistical Computing in Python." *Proceedings of the 9th Python in Science Conference*, 56-61. https://proceedings.scipy.org/articles/Majora-92bf1922-00a
2. pandas development team. "pandas: powerful Python data analysis toolkit." pandas documentation. https://pandas.pydata.org/docs/
3. Ritchie Vink & Polars contributors. "Polars: Extremely fast DataFrames library." https://pola.rs/
4. Apache Software Foundation. "Spark SQL, DataFrames and Datasets Guide." Apache Spark documentation. https://spark.apache.org/docs/latest/sql-programming-guide.html
5. Dask Development Team. "Dask: Scalable analytics in Python." Dask documentation. https://docs.dask.org/
6. R Core Team. "R: A Language and Environment for Statistical Computing." https://www.r-project.org/
7. Wickham, H. et al. "tibble: Simple Data Frames." tidyverse. https://tibble.tidyverse.org/
8. pandas documentation. "Scaling to large datasets." https://pandas.pydata.org/docs/user_guide/scale.html
9. GeeksforGeeks. "Pandas Tutorial." https://www.geeksforgeeks.org/pandas/pandas-tutorial/
10. Databricks. "Polars vs Pandas." Databricks Blog. https://www.databricks.com/blog/polars-vs-pandas
11. pandas development team. "pandas.DataFrame." pandas API reference. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
12. "pandas (software)" and "Towards Scalable Dataframe Systems" (Petersohn et al., 2020), citing Chambers, Hastie & Pregibon, "Statistical Models in S" (Compstat 1990) and Becker, Chambers & Wilks, *The New S Language* (1988). https://arxiv.org/pdf/2001.00888
13. Wickham, H. (2011). "The Split-Apply-Combine Strategy for Data Analysis." *Journal of Statistical Software*, 40(1), 1-29. https://www.jstatsoft.org/article/view/v040i01
14. NVIDIA. "RAPIDS cuDF Accelerates pandas Nearly 150x with Zero Code Changes." NVIDIA Technical Blog. https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/