DataFrame
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v5 ยท 3,391 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v5 ยท 3,391 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns), in which each column can hold a different data type and arithmetic operations align on both row and column labels.[1][11] It is the central data structure of modern data science and machine learning data preparation: programmers use it to load, clean, transform, and analyze structured data the way they would a spreadsheet or a SQL table, but programmatically and reproducibly. The concept was introduced in the S statistical language in 1990 and popularized in Python by the pandas library, whose DataFrame is now the de facto standard for tabular data in Python.[1][2][3]
A DataFrame is a tabular data structure in which information is organized into rows and columns. Each row represents a single observation, record, or instance, while each column represents a variable, attribute, or feature. Every column in a DataFrame has a name (header) and a consistent data type, such as integers, floating-point numbers, strings, dates, or categorical values. Rows are typically identified by an index, which can be a simple integer sequence or a meaningful label like a date or ID.
The official pandas documentation describes the structure precisely: a DataFrame is "Two-dimensional, size-mutable, potentially heterogeneous tabular data" whose "Data structure also contains labeled axes (rows and columns)" and where "Arithmetic operations align on both row and column labels."[11] That alignment behavior, automatically matching values by their labels rather than by raw position, is the defining feature that separates a DataFrame from a plain matrix or NumPy array.
The key properties of a DataFrame include:
The DataFrame did not originate in R, as is sometimes assumed; it was introduced in the S language, which R later reimplemented. The data.frame was first presented by John Chambers, Trevor Hastie, and Daryl Pregibon at the 1990 Computational Statistics conference (Compstat), where the authors wrote: "We have introduced into S a class of objects called data.frames, which can be used if convenient to organize all of the variables relevant to a particular analysis."[12] S itself was created earlier at Bell Laboratories by John Chambers and colleagues, with the interactive version described by Becker, Chambers, and Wilks in 1988.[12] Chambers and Hastie expanded the data frame design in the 1992 book Statistical Models in S.
R, the open-source successor to S, was created by Ross Ihaka and Robert Gentleman at the University of Auckland around 1991 and inherited the data.frame as a core built-in type.[6] The term "data frame" therefore predates Python by nearly two decades.
The DataFrame entered the Python ecosystem through pandas. Wes McKinney began building pandas in 2008 while working as a quantitative analyst at the hedge fund AQR Capital Management, frustrated that Python lacked an intuitive way to handle spreadsheet-like data with rows and columns.[2] He convinced AQR to let him open-source the library in 2009, releasing pandas 0.1 to the Python Package Index that year.[2] The name pandas derives from "panel data," an econometrics term for multidimensional datasets, and doubles as a play on "Python data analysis."[2] In his 2010 paper introducing the library, McKinney described the pandas DataFrame as "basically a pythonic data.frame, but with automatic data alignment."[1]
| Year | Milestone | Source |
|---|---|---|
| 1988 | The S language (interactive version) described by Becker, Chambers, and Wilks at Bell Labs | [12] |
| 1990 | data.frame introduced into S by Chambers, Hastie, and Pregibon at Compstat | [12] |
| 1991 | R created by Ross Ihaka and Robert Gentleman, inheriting data.frame | [6] |
| 2008 | Wes McKinney begins pandas at AQR Capital Management | [2] |
| 2009 | pandas open-sourced (version 0.1 on PyPI) | [2] |
| 2010 | McKinney publishes "Data Structures for Statistical Computing in Python" | [1] |
| 2026 | pandas 3.0.3 released (copy-on-write default, Arrow-backed types) | [2] |
In pandas, the most widely used DataFrame library, there are several common ways to create a DataFrame:
The most straightforward approach is passing a Python dictionary to the pd.DataFrame() constructor. The dictionary keys become column names and the values become the column data.[9]
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "London", "Tokyo"]
}
df = pd.DataFrame(data)
The read_csv() function is one of the most commonly used methods for loading data into a DataFrame. It reads comma-separated values from a file or URL.[2]
df = pd.read_csv("sales_data.csv")
Each dictionary in the list represents one row, and the keys become column names.
records = [
{"Name": "Alice", "Score": 92},
{"Name": "Bob", "Score": 85},
]
df = pd.DataFrame(records)
DataFrames can be created from NumPy arrays, with column names specified separately.[9]
import numpy as np
array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(array, columns=["A", "B", "C"])
DataFrame libraries provide a rich set of operations for data manipulation. The following table summarizes the most important operations available in pandas.[2]
| Operation | Description | Example Syntax |
|---|---|---|
| Selection | Retrieve specific columns or rows by label or position | df["col"], df.loc<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, df.iloc[0:5] |
| Filtering | Subset rows based on Boolean conditions | df[df["Age"] > 25] |
| Sorting | Order rows by one or more columns | df.sort_values("Age") |
| GroupBy | Split data into groups, apply a function, and combine results | df.groupby("City")["Sales"].sum() |
| Merge/Join | Combine two DataFrames on shared columns or indices | pd.merge(df1, df2, on="ID") |
| Pivot | Reshape data from long to wide format | df.pivot_table(values="Sales", index="Region", columns="Year") |
| Apply | Apply a custom function to rows or columns | df["col"].apply(lambda x: x * 2) |
| Aggregation | Compute summary statistics (mean, sum, count, etc.) | df.describe(), df["col"].mean() |
| Missing data | Detect, fill, or drop missing values | df.dropna(), df.fillna(0) |
| Concatenation | Stack DataFrames vertically or horizontally | pd.concat([df1, df2]) |
Pandas offers multiple ways to select data. Label-based indexing with .loc[] selects rows and columns by their names, while integer-based indexing with .iloc[] uses numerical positions.[2] Boolean indexing allows filtering rows based on conditions. For example, df.loc[df["Status"] == "Active", "Revenue"] selects the Revenue column only for rows where Status equals "Active."[9]
The groupby() method is one of the most powerful features in any DataFrame library. It implements the "split-apply-combine" pattern: the data is split into groups based on one or more keys, a function is applied to each group independently, and the results are combined back into a DataFrame.[1] The pattern was formalized by Hadley Wickham in his 2011 paper "The Split-Apply-Combine Strategy for Data Analysis," which describes breaking a large problem "into manageable pieces, operate on each piece independently and then put all the pieces back together."[13] It is essential for computing aggregated statistics like totals, averages, or counts per category.[2]
DataFrames support SQL-style joins (inner, left, right, and outer) to combine datasets. The merge() function joins two DataFrames on one or more shared columns, while join() aligns on index values.[2] These operations are fundamental for combining data from multiple sources, such as linking a customer table with an orders table.
Every column in a DataFrame has an associated data type (dtype) that determines how values are stored in memory and what operations can be performed. Understanding data types is critical for both correctness and performance.
| Data Type | pandas dtype | Description | Typical Memory per Value |
|---|---|---|---|
| Integer | int64, int32, int8 | Whole numbers | 1 to 8 bytes |
| Float | float64, float32 | Decimal numbers | 4 to 8 bytes |
| Boolean | bool | True/False values | 1 byte |
| String/Object | object, string | Text data | Variable |
| DateTime | datetime64 | Dates and timestamps | 8 bytes |
| Categorical | category | Fixed set of values (e.g., "Red", "Blue", "Green") | Depends on cardinality |
| Nullable Integer | Int64, Int32 | Integers that support missing values (NA) | 1 to 8 bytes + bitmap |
Choosing the right data type can drastically reduce memory usage. For instance, converting a column of repeated string values (like country names) to the category dtype can reduce memory consumption by 80% or more, since only the unique values are stored once and each row references them via small integer codes.[8]
DataFrames in pandas store data in contiguous blocks of memory organized by data type. For large data sets, memory optimization becomes important. Several strategies can help reduce the memory footprint:[8]
int8 or int16 instead of the default int64 when the value range permits. An int64 column uses 8 bytes per value, while int8 uses only 1 byte.category type stores each unique value once and uses integer codes for references.usecols parameter to avoid loading unnecessary data.read_csv() supports a chunksize parameter that reads and processes the file in smaller pieces.[8]The df.info(memory_usage='deep') and df.memory_usage(deep=True) methods allow users to inspect exactly how much memory each column consumes. As a rough guide, in-memory pandas operations typically need around 5 to 10 times as much RAM as the size of the dataset on disk, which is one reason memory-efficient alternatives such as Polars (roughly 2 to 4 times) have gained traction for larger workloads.[10]
DataFrames share similarities with spreadsheets and SQL tables but serve different purposes and have different strengths.
| Feature | DataFrame (pandas) | Spreadsheet (Excel) | SQL Table |
|---|---|---|---|
| Interface | Code (Python scripts) | Graphical (point and click) | Query language (SQL) |
| Data size | Millions of rows (limited by RAM) | ~1 million rows (Excel limit) | Billions of rows (disk-based) |
| Reproducibility | Fully reproducible via scripts | Manual steps are hard to reproduce | Reproducible via saved queries |
| Data types | Strict per-column types | Loose (cell-level types) | Strict per-column types |
| Integration with ML | Direct (scikit-learn, TensorFlow) | Requires export/import | Requires extraction |
| Speed | Fast for in-memory data | Slow for large data | Optimized for large disk-based queries |
| Collaboration | Via version control (Git) | Via shared files or cloud | Via database access control |
DataFrames are particularly well suited for data science and machine learning workflows because they integrate directly with Python's scientific computing ecosystem, including NumPy, scikit-learn, Matplotlib, and TensorFlow.
DataFrames play a central role in nearly every stage of a machine learning pipeline:
describe(), value_counts(), and corr() to understand the distribution and relationships within the data.dropna(), fillna()), removing duplicates (drop_duplicates()), and fixing inconsistent data types are all performed on DataFrames.train_test_split() function, which accepts DataFrames directly.This tight integration between DataFrames and ML libraries makes pandas the de facto standard for data preparation in Python-based machine learning projects.[1]
While pandas remains the most widely used DataFrame library, several alternatives have emerged to address specific limitations around performance, scalability, and language support.
Pandas is the original and most popular DataFrame library in the Python ecosystem. First released in 2009, it provides an intuitive API for data manipulation, cleaning, and analysis.[1][2] Pandas operates in-memory on a single machine and is best suited for datasets that fit in RAM (typically up to a few million rows). As of 2026, pandas 3.0.3 is the current release (shipped May 11, 2026), featuring copy-on-write behavior by default and improved support for Apache Arrow-backed data types.[2]
Polars is a high-performance DataFrame library written in Rust, with Python and Node.js bindings, created by Ritchie Vink to address pandas' performance limitations.[3] Polars can execute common operations 5 to 10 times faster than pandas (and by some benchmarks 10-30 times faster) by using multi-threaded execution, a columnar memory layout, and an Apache Arrow-based backend.[10] One of its key features is lazy evaluation: operations are not executed immediately but instead build a query plan that is optimized before execution, similar to how SQL databases work. Polars also supports streaming execution for datasets that exceed available memory, and typically uses 2 to 4 times the dataset size in RAM versus pandas' 5 to 10 times.[3][10]
Apache Spark provides a distributed DataFrame API designed for processing datasets that span terabytes or petabytes across clusters of machines. Spark DataFrames are immutable and lazily evaluated; transformations build a logical plan that is optimized by the Catalyst query optimizer before execution.[4] PySpark, the Python API for Spark, offers a pandas-like syntax while distributing computation across a cluster. Spark is the standard choice when data volumes exceed what a single machine can handle.
Dask extends the pandas API to larger-than-memory and distributed computing scenarios. A Dask DataFrame is composed of many smaller pandas DataFrames, partitioned along the index. Operations are lazy, building a task graph that is executed only when .compute() is called.[5] Dask is well suited for users who are already familiar with pandas and need to scale their workflows without rewriting their code. It can run on a single machine (utilizing all CPU cores) or across a distributed cluster.
cuDF is the GPU-accelerated DataFrame library in NVIDIA's RAPIDS suite, exposing a pandas-like API that runs on CUDA GPUs and falls back to CPU pandas when needed.[14] Its cudf.pandas accelerator mode, made generally available in RAPIDS 24.02, can speed up unmodified pandas code with zero code changes: NVIDIA reports up to roughly 150 times faster processing on a 5 GB dataset by enabling the accelerator and importing pandas as usual.[14] cuDF is used where large tabular workloads benefit from GPU parallelism without rewriting existing pandas scripts.
R's built-in data.frame is one of the oldest DataFrame implementations, inherited from the S language and available since R's creation around 1991. It stores tabular data with named columns and supports heterogeneous column types.[6] The tibble, introduced by the tidyverse ecosystem, is a modernized version that provides cleaner printing, stricter subsetting behavior (no partial column name matching), and never converts strings to factors automatically.[7] R's dplyr package provides a grammar of data manipulation with verbs like filter(), mutate(), select(), and summarise() that operate on data frames and tibbles.
| Library | Language | Parallelism | Best For | Lazy Evaluation |
|---|---|---|---|---|
| pandas | Python | Single-threaded | Small to medium data (up to ~10 GB) | No |
| Polars | Python, Rust | Multi-threaded | Medium to large data on a single machine | Yes |
| Spark DataFrame | Python, Scala, Java, R | Distributed cluster | Very large data (terabytes and beyond) | Yes |
| Dask | Python | Multi-threaded or distributed | Scaling pandas workflows | Yes |
| cuDF | Python (GPU) | GPU-parallel | Accelerating pandas workloads on NVIDIA GPUs | No (eager) |
| R data.frame / tibble | R | Single-threaded (base R) | Statistical analysis and visualization | No |
The following example demonstrates a common DataFrame workflow in pandas: loading data, exploring it, cleaning it, and computing a summary.
import pandas as pd
# Load data from CSV
df = pd.read_csv("students.csv")
# Inspect the first few rows
print(df.head())
# Check for missing values
print(df.isnull().sum())
# Fill missing grades with the column mean
df["Grade"] = df["Grade"].fillna(df["Grade"].mean())
# Compute average grade by subject
avg_by_subject = df.groupby("Subject")["Grade"].mean()
print(avg_by_subject)
# Filter students with grades above 90
top_students = df[df["Grade"] > 90]
print(top_students)
A sample DataFrame of student grades looks like this:
| Name | Math | Science | English | History |
|---|---|---|---|---|
| Alice | 78 | 80 | 99 | 65 |
| Bob | 95 | 91 | 75 | 90 |
| Charlie | 68 | 77 | 75 | 84 |
| Emma | 94 | 79 | 88 | 96 |
In this table, each row represents a student and each column represents a subject. The values are the grades that each student earned in each subject.
A DataFrame is like a big table where you keep information organized in rows and columns, just like a chart on a wall.
Imagine you have a chart listing all your friends. Each friend gets their own row. The columns across the top say things like "Name," "Favorite Color," and "Number of Pets." So you can look at any row and see everything about one friend, or look down a column and see everyone's favorite color at once.
Computers use DataFrames to organize information the same way. When a computer is learning to recognize cats in photos, it might have a DataFrame where each row is a photo and the columns list things like "has pointy ears," "has whiskers," and "has a tail." By looking at all the rows together, the computer figures out what makes a cat a cat.
The nice thing about a DataFrame is that you can ask questions like "show me only the friends who have more than 2 pets" or "what is the average number of pets?" and get answers instantly. That is why data scientists and programmers use DataFrames every day.