See also: Machine learning terms, Data analysis, NumPy
Pandas is an open-source data analysis and manipulation library for the Python programming language. It provides high-performance, flexible data structures designed for working with structured (tabular, multidimensional, potentially heterogeneous) and time series data. Built on top of NumPy, pandas has become one of the most widely used tools in data science, machine learning, and quantitative analysis. The library is licensed under the three-clause BSD license and is implemented in Python, Cython, and C.
The name "pandas" is derived from "panel data," an econometrics term for multidimensional structured datasets, and also serves as wordplay on "Python data analysis." As of 2026, pandas is maintained by a large open-source community and is a fiscally sponsored project of NumFOCUS, a U.S. 501(c)(3) nonprofit charity. The latest stable release is version 3.0.2 (March 2026).
Wes McKinney began developing pandas in 2008 while working at AQR Capital Management, a quantitative investment management firm. McKinney, who graduated from the Massachusetts Institute of Technology with a B.S. in Mathematics in 2007, needed a high-performance, flexible tool to perform quantitative analysis on financial data. At the time, Python's data ecosystem was fragmented. NumPy provided excellent support for numerical computation but lacked features for handling structured data with column labels, mixed data types, and proper indexing.
McKinney's design drew inspiration from R's data.frame object, adapting its concepts for Python while leveraging NumPy's speed and Python's readability. His early work focused on two core data structures: the Series (a one-dimensional labeled array) and the DataFrame (a two-dimensional labeled table). He convinced AQR's management to allow the library to be open-sourced, and by mid-2009, pandas 0.1 was released on PyPI.
Chang She joined as the second major contributor in 2012, also coming from AQR. In 2015, pandas became a fiscally sponsored project of NumFOCUS, which helped ensure long-term financial support and governance. Over the years, the project has grown from a niche financial analysis tool into a foundational component of the Python data science ecosystem, used by data scientists at Google, Meta, JP Morgan, and virtually every major organization that performs data analysis.
| Version | Release Date | Key Changes |
|---|---|---|
| 0.1 | Mid-2009 | Initial public release with Series and DataFrame |
| 0.8 | June 2012 | Major performance improvements, new groupby engine |
| 0.20 | May 2017 | .ix indexer deprecated in favor of .loc and .iloc |
| 1.0 | January 2020 | First major release, new nullable integer and string types, convert_dtypes() |
| 2.0 | April 2023 | Apache Arrow backend, Copy-on-Write opt-in, nullable dtypes by default |
| 2.2 | January 2024 | PyArrow-backed string improvements, ADBC database drivers |
| 3.0 | January 2026 | Copy-on-Write enabled by default, str dtype inferred, pd.col() expressions, PyArrow required dependency |
Pandas provides three primary data structures, each serving a different purpose in data manipulation workflows.
A Series is a one-dimensional labeled array capable of holding any data type, including integers, floats, strings, Python objects, and more. It is similar to a column in a spreadsheet or a single column in a SQL table. Each element in a Series has an associated label called an index, which allows for fast lookups and alignment during operations.
import pandas as pd
prices = pd.Series([29.99, 49.99, 19.99], index=['product_a', 'product_b', 'product_c'])
print(prices['product_b']) # Output: 49.99
The DataFrame is the most commonly used pandas data structure. It is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it as a spreadsheet, a SQL table, or a dictionary of Series objects sharing the same index. DataFrames can be created from dictionaries, lists of dictionaries, NumPy arrays, CSV files, SQL queries, and many other sources.
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [30, 25, 35],
'salary': [70000, 50000, 90000]
})
The Index is a metadata object that stores axis labels for Series and DataFrames. It enables fast lookups, alignment between datasets, and label-based slicing. Specialized index types include DatetimeIndex for time series data, MultiIndex for hierarchical indexing, and CategoricalIndex for categorical data.
Pandas offers a broad set of operations for data manipulation. The following table summarizes the most commonly used functions and methods.
| Operation | Method/Function | Description |
|---|---|---|
| Reading data | pd.read_csv(), pd.read_excel(), pd.read_sql(), pd.read_json(), pd.read_parquet() | Import data from various file formats and databases |
| Writing data | df.to_csv(), df.to_excel(), df.to_parquet(), df.to_sql() | Export DataFrames to various formats |
| Selection by label | df.loc[row_label, col_label] | Select rows and columns by their labels |
| Selection by position | df.iloc[row_pos, col_pos] | Select rows and columns by integer position |
| Filtering | df[df['column'] > value] | Boolean indexing to filter rows |
| Grouping | df.groupby('column').agg(func) | Split-apply-combine pattern for aggregation |
| Merging | pd.merge(df1, df2, on='key') | SQL-style joins between DataFrames |
| Concatenation | pd.concat([df1, df2]) | Stack DataFrames vertically or horizontally |
| Pivot tables | df.pivot_table(values, index, columns, aggfunc) | Reshape data and compute aggregations |
| Apply functions | df.apply(func), df['col'].map(func) | Apply custom functions to rows, columns, or elements |
| Sorting | df.sort_values('column'), df.sort_index() | Sort by column values or index |
| Missing data | df.fillna(), df.dropna(), df.isna() | Handle missing (NaN) values |
| String methods | df['col'].str.upper(), df['col'].str.contains() | Vectorized string operations |
| Reshaping | df.melt(), df.stack(), df.unstack() | Transform between wide and long formats |
The groupby() method is one of the most powerful features in pandas. It follows the split-apply-combine pattern: the data is split into groups based on one or more keys, a function is applied to each group independently, and the results are combined back into a single data structure.
# Average salary by department
df.groupby('department')['salary'].mean()
# Multiple aggregations
df.groupby('department').agg(
avg_salary=('salary', 'mean'),
headcount=('employee_id', 'count'),
max_salary=('salary', 'max')
)
Pandas supports SQL-style joins through the merge() function and the join() method. These operations allow combining two DataFrames based on shared columns or indices, supporting inner, left, right, and outer join types.
orders = pd.DataFrame({'order_id': [1, 2, 3], 'customer_id': [101, 102, 101]})
customers = pd.DataFrame({'customer_id': [101, 102, 103], 'name': ['Alice', 'Bob', 'Charlie']})
result = pd.merge(orders, customers, on='customer_id', how='left')
Pandas provides two primary indexing methods for selecting data. The loc accessor selects data by label (row and column names), while iloc selects by integer position. These replaced the older .ix indexer, which was deprecated in version 0.20 due to ambiguity between label-based and position-based indexing.
# Select by label
df.loc[df['age'] > 30, ['name', 'salary']]
# Select by position
df.iloc[0:5, [0, 2]] # First 5 rows, columns 0 and 2
Pandas provides a comprehensive set of tools for data cleaning, which is often the most time-consuming step in any data analysis or machine learning workflow.
Real-world datasets frequently contain missing values. Pandas uses NaN (Not a Number) for floating-point missing values and NaT (Not a Time) for datetime missing values. Key methods include isna() and notna() for detection, fillna() for imputation, and dropna() for removal.
Optimizing data types is critical for both memory efficiency and correctness. The astype() method converts columns between types, while pd.to_datetime(), pd.to_numeric(), and pd.Categorical() provide specialized conversions. Pandas 3.0 introduced automatic inference of the new str dtype for string columns instead of the generic object dtype.
The duplicated() and drop_duplicates() methods identify and remove duplicate rows. The value_counts() method provides frequency distributions, and describe() generates summary statistics for quick data validation.
Pandas offers robust support for working with time series data, making it suitable for financial analysis, IoT data processing, and temporal machine learning tasks. Key capabilities include generating date ranges with pd.date_range(), resampling time series to different frequencies with resample(), handling time zones with tz_localize() and tz_convert(), computing rolling and expanding window statistics, and performing date arithmetic with Timedelta objects. Pandas 3.0 changed the default datetime resolution from nanoseconds to microseconds (or the resolution of the input), improving compatibility with other systems.
Although pandas is not a machine learning library itself, it plays a central role in nearly every ML workflow. It serves as the primary tool for data loading, exploration, cleaning, and feature engineering before data is passed to modeling libraries like scikit-learn, XGBoost, or TensorFlow.
| Step | Pandas Role | Example Operations |
|---|---|---|
| Data loading | Import raw data | pd.read_csv(), pd.read_sql(), pd.read_parquet() |
| Exploration | Understand data shape and distribution | df.describe(), df.info(), df.value_counts() |
| Cleaning | Handle missing and inconsistent data | df.fillna(), df.dropna(), df.drop_duplicates() |
| Feature engineering | Create new features from existing data | df.apply(), df.groupby().transform(), pd.get_dummies() |
| Encoding | Convert categorical variables to numeric | pd.get_dummies(), df['col'].cat.codes |
| Splitting | Divide data into train and test sets | df.sample(), index slicing |
| Post-processing | Analyze model outputs | Merge predictions with original data, compute metrics |
Since scikit-learn version 1.2, pipelines can output pandas DataFrames directly using the set_output(transform='pandas') configuration. This improvement makes it easier to chain multiple ColumnTransformer operations while maintaining the DataFrame structure, column names, and data types throughout the pipeline. The sklearn-pandas package provides additional bridge utilities for integrating pandas with scikit-learn transformers and pipelines, helping prevent data leakage and ensuring consistent preprocessing across training and prediction.
Pandas and SQL serve overlapping but distinct purposes in data manipulation. Both operate on tabular data and support similar operations, but they differ in execution context, scalability, and flexibility.
| Aspect | Pandas | SQL |
|---|---|---|
| Execution | In-memory, client-side | Server-side, database engine |
| Data size | Limited by available RAM | Can handle datasets larger than memory |
| Language | Python API (method chaining) | Declarative query language |
| Flexibility | Arbitrary Python functions, complex transformations | Structured queries, limited to SQL functions |
| Use case | Exploratory analysis, prototyping, feature engineering | Production queries, reporting, data warehousing |
| Learning curve | Requires Python knowledge | Standalone language, widely taught |
| Ecosystem | Integrates with Python ML/visualization libraries | Integrates with databases and BI tools |
The logic behind pandas and SQL operations is similar. For example, pandas groupby() corresponds to SQL GROUP BY, merge() to JOIN, and boolean indexing to WHERE clauses. Many data practitioners use SQL to extract and pre-filter data from databases, then load the results into pandas for further analysis, cleaning, and modeling.
Pandas operations can be significantly accelerated by following performance best practices. The core principle is to use vectorized operations instead of Python loops whenever possible.
Vectorized operations in pandas delegate computation to optimized C code through NumPy, avoiding the overhead of Python's interpreter loop. This approach can yield speedups of 100x to 1000x compared to row-by-row iteration with iterrows() or apply() with a Python function.
# Slow: Python loop
for i, row in df.iterrows():
df.at[i, 'total'] = row['price'] * row['quantity']
# Fast: Vectorized operation (100x+ faster)
df['total'] = df['price'] * df['quantity']
| Tip | Description | Impact |
|---|---|---|
| Use vectorized operations | Operate on entire columns instead of looping row by row | 100x to 1000x speedup |
| Optimize data types | Use int32 instead of int64, category instead of object for low-cardinality strings | 50% to 90% memory reduction |
| Use built-in methods | Prefer .agg(), .query(), .isin() over custom loops | Significant speedup |
| Read only needed columns | Pass usecols parameter to read_csv() | Reduced memory and load time |
| Use chunked reading | Process large files in chunks with chunksize parameter | Enables processing files larger than RAM |
| Use Parquet format | Prefer .parquet over .csv for storage and retrieval | Faster I/O, smaller file size, type preservation |
| Leverage NumPy | Use .to_numpy() for computation-heavy operations | Eliminates pandas overhead |
Pandas requires the entire dataset to be loaded into RAM. Creator Wes McKinney has recommended having 5 to 10 times as much RAM as the size of the dataset to accommodate intermediate operations and copies. This memory requirement is one of the primary motivations behind alternative libraries and the Apache Arrow integration.
Pandas 2.0, released in April 2023, marked a significant architectural shift by introducing optional Apache Arrow as a backend for data storage. Apache Arrow provides a columnar, language-independent memory format that offers several advantages over the traditional NumPy-based backend.
The Arrow backend provides native support for more data types, including nested lists and structs, which historically were stored as inefficient NumPy object arrays. It offers better memory efficiency for string data, since Arrow stores strings contiguously in memory rather than as Python objects. It also provides improved interoperability with other tools in the data ecosystem that support the Arrow format, such as Apache Spark, DuckDB, and Polars.
Pandas 2.0 introduced an opt-in Copy-on-Write (CoW) mode, which was made the default behavior in pandas 3.0. Under CoW, any indexing operation or method that returns a new DataFrame or Series behaves as if it were a copy. Actual data copying is deferred until a write operation occurs, reducing unnecessary memory allocation. This change also eliminates the confusing SettingWithCopyWarning that previously plagued pandas users, and chained assignment (e.g., df['a']['b'] = value) no longer works.
Pandas 3.0, released in January 2026, made several changes that were previewed in the 2.x series. PyArrow became a required dependency. String columns are now inferred as the new str dtype instead of object. The release also introduced pd.col(), a new syntax for building column expressions that can be used in methods like DataFrame.assign(). The Arrow PyCapsule Interface was added for importing and exporting data via DataFrame.from_arrow() and Series.from_arrow().
As datasets have grown in size and performance requirements have increased, several alternative libraries have emerged to address pandas' limitations in scalability and speed.
| Library | Approach | Best For | API Compatibility |
|---|---|---|---|
| Polars | Rust-based, multithreaded, lazy evaluation | Fast single-machine analytics (any dataset size) | Different API, some pandas-like methods |
| Dask | Parallel computing, distributed DataFrames | Scaling pandas workflows to clusters | High (mirrors pandas API) |
| Modin | Drop-in pandas replacement using Ray or Dask | Speeding up existing pandas code with minimal changes | Very high (aims for 100% pandas coverage) |
| Vaex | Lazy, out-of-core, memory-mapped DataFrames | Exploring and visualizing very large datasets on a single machine | Partial pandas compatibility |
| RAPIDS cuDF | GPU-accelerated DataFrames | GPU-powered analytics | Partial pandas compatibility |
| DuckDB | In-process SQL OLAP database | Analytical queries on local data | SQL interface, pandas integration |
Polars has emerged as the most prominent pandas alternative. Written in Rust, it uses multithreaded execution and a query optimizer to achieve 5x to 10x faster performance than pandas on common operations. Polars also requires 2 to 4 times the RAM of the dataset size (compared to pandas' 5 to 10 times), making it more memory-efficient. Benchmarks such as the PDS-H (formerly TPC-H) show Polars performing on par with DuckDB and significantly outperforming pandas, Dask, and PySpark.
Dask extends the pandas API to larger-than-memory and distributed datasets by breaking DataFrames into partitions that are processed in parallel. It mirrors the pandas API closely, making migration straightforward. Dask is strongest when data needs to be processed across a cluster of machines, though for single-machine workloads, Polars typically offers better performance.
Modin provides a drop-in replacement for pandas that accelerates computation by distributing work across all available CPU cores using Ray or Dask as a backend. Users can switch from pandas to Modin by changing a single import statement (import modin.pandas as pd), with the goal of achieving 4x to 6x speedup without any other code changes.
Imagine you have a giant spreadsheet full of data, like a table of every student in a school with their names, grades, and test scores. Pandas is a tool that lets you work with that spreadsheet using Python code instead of clicking around in Excel.
With pandas, you can do things like: find all students who scored above 90, calculate the average grade for each class, combine two different spreadsheets together, and fix cells that are missing data. The main thing you work with is called a DataFrame, which is just a fancy word for a table with rows and columns.
Pandas is especially popular because it makes these tasks really fast and easy to write. Instead of writing a loop to check every single row one at a time, you can tell pandas "give me all rows where the score is above 90" in a single line of code. This is called vectorization, and it makes pandas much faster than doing things the manual way.
Practically every data scientist and machine learning engineer uses pandas as the first step when working with data. You load your data into a DataFrame, clean it up, explore it, and then pass it along to a machine learning model for training.
import pandas as pd
# Load a CSV file
df = pd.read_csv('sales_data.csv')
# Quick overview
print(df.shape) # (rows, columns)
print(df.dtypes) # Data types per column
print(df.describe()) # Summary statistics
print(df.head(10)) # First 10 rows
# Fill missing values
df['revenue'].fillna(0, inplace=True)
# Remove duplicates
df = df.drop_duplicates(subset=['order_id'])
# Convert data types
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
# Create new features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['revenue_per_unit'] = df['revenue'] / df['quantity']
# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['category', 'region'])
# Group-level features
df['avg_revenue_by_category'] = df.groupby('category')['revenue'].transform('mean')
# Create a pivot table
pivot = df.pivot_table(
values='revenue',
index='region',
columns='quarter',
aggfunc='sum',
fill_value=0
)