See also: Machine learning terms
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is one of the most fundamental concepts in modern data analysis, serving as the primary way programmers and data scientists organize, manipulate, and analyze structured information. DataFrames function much like a spreadsheet or SQL table, but they exist within a programming environment, which gives users the ability to perform sophisticated operations programmatically.
The concept of a DataFrame originated in the R programming language, where the data.frame type has been a core data structure since the language's inception in the 1990s. The idea was later adopted and popularized in the Python ecosystem through the pandas library, which was first released in 2008 by Wes McKinney. Today, DataFrame implementations exist across multiple languages and frameworks, including Polars, Apache Spark, Dask, and Julia's DataFrames.jl.
A DataFrame is a tabular data structure in which information is organized into rows and columns. Each row represents a single observation, record, or instance, while each column represents a variable, attribute, or feature. Every column in a DataFrame has a name (header) and a consistent data type, such as integers, floating-point numbers, strings, dates, or categorical values. Rows are typically identified by an index, which can be a simple integer sequence or a meaningful label like a date or ID.
The key properties of a DataFrame include:
In pandas, the most widely used DataFrame library, there are several common ways to create a DataFrame:
The most straightforward approach is passing a Python dictionary to the pd.DataFrame() constructor. The dictionary keys become column names and the values become the column data.
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "London", "Tokyo"]
}
df = pd.DataFrame(data)
The read_csv() function is one of the most commonly used methods for loading data into a DataFrame. It reads comma-separated values from a file or URL.
df = pd.read_csv("sales_data.csv")
Each dictionary in the list represents one row, and the keys become column names.
records = [
{"Name": "Alice", "Score": 92},
{"Name": "Bob", "Score": 85},
]
df = pd.DataFrame(records)
DataFrames can be created from NumPy arrays, with column names specified separately.
import numpy as np
array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(array, columns=["A", "B", "C"])
DataFrame libraries provide a rich set of operations for data manipulation. The following table summarizes the most important operations available in pandas.
| Operation | Description | Example Syntax |
|---|---|---|
| Selection | Retrieve specific columns or rows by label or position | df["col"], df.loc<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, df.iloc[0:5] |
| Filtering | Subset rows based on Boolean conditions | df[df["Age"] > 25] |
| Sorting | Order rows by one or more columns | df.sort_values("Age") |
| GroupBy | Split data into groups, apply a function, and combine results | df.groupby("City")["Sales"].sum() |
| Merge/Join | Combine two DataFrames on shared columns or indices | pd.merge(df1, df2, on="ID") |
| Pivot | Reshape data from long to wide format | df.pivot_table(values="Sales", index="Region", columns="Year") |
| Apply | Apply a custom function to rows or columns | df["col"].apply(lambda x: x * 2) |
| Aggregation | Compute summary statistics (mean, sum, count, etc.) | df.describe(), df["col"].mean() |
| Missing data | Detect, fill, or drop missing values | df.dropna(), df.fillna(0) |
| Concatenation | Stack DataFrames vertically or horizontally | pd.concat([df1, df2]) |
Pandas offers multiple ways to select data. Label-based indexing with .loc[] selects rows and columns by their names, while integer-based indexing with .iloc[] uses numerical positions. Boolean indexing allows filtering rows based on conditions. For example, df.loc[df["Status"] == "Active", "Revenue"] selects the Revenue column only for rows where Status equals "Active."
The groupby() method is one of the most powerful features in any DataFrame library. It implements the "split-apply-combine" pattern: the data is split into groups based on one or more keys, a function is applied to each group independently, and the results are combined back into a DataFrame. This is essential for computing aggregated statistics like totals, averages, or counts per category.
DataFrames support SQL-style joins (inner, left, right, and outer) to combine datasets. The merge() function joins two DataFrames on one or more shared columns, while join() aligns on index values. These operations are fundamental for combining data from multiple sources, such as linking a customer table with an orders table.
Every column in a DataFrame has an associated data type (dtype) that determines how values are stored in memory and what operations can be performed. Understanding data types is critical for both correctness and performance.
| Data Type | pandas dtype | Description | Typical Memory per Value |
|---|---|---|---|
| Integer | int64, int32, int8 | Whole numbers | 1 to 8 bytes |
| Float | float64, float32 | Decimal numbers | 4 to 8 bytes |
| Boolean | bool | True/False values | 1 byte |
| String/Object | object, string | Text data | Variable |
| DateTime | datetime64 | Dates and timestamps | 8 bytes |
| Categorical | category | Fixed set of values (e.g., "Red", "Blue", "Green") | Depends on cardinality |
| Nullable Integer | Int64, Int32 | Integers that support missing values (NA) | 1 to 8 bytes + bitmap |
Choosing the right data type can drastically reduce memory usage. For instance, converting a column of repeated string values (like country names) to the category dtype can reduce memory consumption by 80% or more, since only the unique values are stored once and each row references them via small integer codes.
DataFrames in pandas store data in contiguous blocks of memory organized by data type. For large data sets, memory optimization becomes important. Several strategies can help reduce the memory footprint:
int8 or int16 instead of the default int64 when the value range permits. An int64 column uses 8 bytes per value, while int8 uses only 1 byte.category type stores each unique value once and uses integer codes for references.usecols parameter to avoid loading unnecessary data.read_csv() supports a chunksize parameter that reads and processes the file in smaller pieces.The df.info(memory_usage='deep') and df.memory_usage(deep=True) methods allow users to inspect exactly how much memory each column consumes.
DataFrames share similarities with spreadsheets and SQL tables but serve different purposes and have different strengths.
| Feature | DataFrame (pandas) | Spreadsheet (Excel) | SQL Table |
|---|---|---|---|
| Interface | Code (Python scripts) | Graphical (point and click) | Query language (SQL) |
| Data size | Millions of rows (limited by RAM) | ~1 million rows (Excel limit) | Billions of rows (disk-based) |
| Reproducibility | Fully reproducible via scripts | Manual steps are hard to reproduce | Reproducible via saved queries |
| Data types | Strict per-column types | Loose (cell-level types) | Strict per-column types |
| Integration with ML | Direct (scikit-learn, TensorFlow) | Requires export/import | Requires extraction |
| Speed | Fast for in-memory data | Slow for large data | Optimized for large disk-based queries |
| Collaboration | Via version control (Git) | Via shared files or cloud | Via database access control |
DataFrames are particularly well suited for data science and machine learning workflows because they integrate directly with Python's scientific computing ecosystem, including NumPy, scikit-learn, Matplotlib, and TensorFlow.
DataFrames play a central role in nearly every stage of a machine learning pipeline:
describe(), value_counts(), and corr() to understand the distribution and relationships within the data.dropna(), fillna()), removing duplicates (drop_duplicates()), and fixing inconsistent data types are all performed on DataFrames.train_test_split() function, which accepts DataFrames directly.This tight integration between DataFrames and ML libraries makes pandas the de facto standard for data preparation in Python-based machine learning projects.
While pandas remains the most widely used DataFrame library, several alternatives have emerged to address specific limitations around performance, scalability, and language support.
Pandas is the original and most popular DataFrame library in the Python ecosystem. Released in 2008, it provides an intuitive API for data manipulation, cleaning, and analysis. Pandas operates in-memory on a single machine and is best suited for datasets that fit in RAM (typically up to a few million rows). As of 2026, pandas 3.0 is the current major release, featuring copy-on-write behavior by default and improved support for Apache Arrow-backed data types.
Polars is a high-performance DataFrame library written in Rust, with Python and Node.js bindings. It was designed from the ground up to address pandas' performance limitations. Polars can execute common operations 5 to 10 times faster than pandas by using multi-threaded execution, columnar memory layout, and an Apache Arrow-based backend. One of its key features is lazy evaluation: operations are not executed immediately but instead build a query plan that is optimized before execution, similar to how SQL databases work. Polars also supports streaming execution for datasets that exceed available memory.
Apache Spark provides a distributed DataFrame API designed for processing datasets that span terabytes or petabytes across clusters of machines. Spark DataFrames are immutable and lazily evaluated; transformations build a logical plan that is optimized by the Catalyst query optimizer before execution. PySpark, the Python API for Spark, offers a pandas-like syntax while distributing computation across a cluster. Spark is the standard choice when data volumes exceed what a single machine can handle.
Dask extends the pandas API to larger-than-memory and distributed computing scenarios. A Dask DataFrame is composed of many smaller pandas DataFrames, partitioned along the index. Operations are lazy, building a task graph that is executed only when .compute() is called. Dask is well suited for users who are already familiar with pandas and need to scale their workflows without rewriting their code. It can run on a single machine (utilizing all CPU cores) or across a distributed cluster.
R's built-in data.frame is one of the oldest DataFrame implementations, available since R's creation in the 1990s. It stores tabular data with named columns and supports heterogeneous column types. The tibble, introduced by the tidyverse ecosystem, is a modernized version that provides cleaner printing, stricter subsetting behavior (no partial column name matching), and never converts strings to factors automatically. R's dplyr package provides a grammar of data manipulation with verbs like filter(), mutate(), select(), and summarise() that operate on data frames and tibbles.
| Library | Language | Parallelism | Best For | Lazy Evaluation |
|---|---|---|---|---|
| pandas | Python | Single-threaded | Small to medium data (up to ~10 GB) | No |
| Polars | Python, Rust | Multi-threaded | Medium to large data on a single machine | Yes |
| Spark DataFrame | Python, Scala, Java, R | Distributed cluster | Very large data (terabytes and beyond) | Yes |
| Dask | Python | Multi-threaded or distributed | Scaling pandas workflows | Yes |
| R data.frame / tibble | R | Single-threaded (base R) | Statistical analysis and visualization | No |
The following example demonstrates a common DataFrame workflow in pandas: loading data, exploring it, cleaning it, and computing a summary.
import pandas as pd
# Load data from CSV
df = pd.read_csv("students.csv")
# Inspect the first few rows
print(df.head())
# Check for missing values
print(df.isnull().sum())
# Fill missing grades with the column mean
df["Grade"] = df["Grade"].fillna(df["Grade"].mean())
# Compute average grade by subject
avg_by_subject = df.groupby("Subject")["Grade"].mean()
print(avg_by_subject)
# Filter students with grades above 90
top_students = df[df["Grade"] > 90]
print(top_students)
A sample DataFrame of student grades looks like this:
| Name | Math | Science | English | History |
|---|---|---|---|---|
| Alice | 78 | 80 | 99 | 65 |
| Bob | 95 | 91 | 75 | 90 |
| Charlie | 68 | 77 | 75 | 84 |
| Emma | 94 | 79 | 88 | 96 |
In this table, each row represents a student and each column represents a subject. The values are the grades that each student earned in each subject.
A DataFrame is like a big table where you keep information organized in rows and columns, just like a chart on a wall.
Imagine you have a chart listing all your friends. Each friend gets their own row. The columns across the top say things like "Name," "Favorite Color," and "Number of Pets." So you can look at any row and see everything about one friend, or look down a column and see everyone's favorite color at once.
Computers use DataFrames to organize information the same way. When a computer is learning to recognize cats in photos, it might have a DataFrame where each row is a photo and the columns list things like "has pointy ears," "has whiskers," and "has a tail." By looking at all the rows together, the computer figures out what makes a cat a cat.
The nice thing about a DataFrame is that you can ask questions like "show me only the friends who have more than 2 pets" or "what is the average number of pets?" and get answers instantly. That is why data scientists and programmers use DataFrames every day.