# Dask

> Source: https://aiwiki.ai/wiki/dask
> Updated: 2026-04-28
> Categories: AI Infrastructure, Data Science, Machine Learning, Programming Languages
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Dask** is an open-source Python library for parallel and distributed computing that scales the familiar APIs of libraries such as [NumPy](/wiki/numpy), [pandas](/wiki/pandas), and [scikit-learn](/wiki/scikit-learn) to process larger-than-memory datasets. Created by Matthew Rocklin in December 2014, Dask enables analysts, data scientists, and engineers to work with datasets that exceed the capacity of a single machine's RAM while staying within the Python ecosystem. The library is released under the BSD 3-Clause license and is a fiscally sponsored project of [NumFOCUS](/wiki/numfocus).

Dask is used by over 3,200 organizations worldwide, including Walmart, Capital One, Blue Yonder, NASA, and the United States Air Force. Its GitHub repository has accumulated over 13,000 stars and contributions from more than 500 developers.

## History and Origins

Dask traces its roots to the Blaze project at Continuum Analytics (later renamed [Anaconda](/wiki/anaconda_python_distribution)), which was funded through [DARPA](/wiki/darpa)'s XDATA program. The XDATA initiative was a $100 million DARPA research program aimed at developing computational techniques and software tools for processing large volumes of data. DARPA awarded approximately $3 million to Continuum Analytics to advance Python's data processing and visualization capabilities for large-scale data workloads. Within that effort, Matthew Rocklin began building Dask in December 2014 as a way to parallelize NumPy operations so that a single workstation could fully utilize all of its CPU cores.

Rocklin presented Dask at the SciPy 2015 conference, and the library quickly attracted interest from the scientific Python community. The first projects to adopt Dask were [Xarray](/wiki/xarray) (widely used in geosciences and climate research) and scikit-image (used in image processing). Recognizing demand for a lightweight parallelism solution that could also scale pandas DataFrames and [machine learning](/wiki/machine_learning) tools, the project expanded its scope beyond array computing.

Dask reached version 1.0 on November 29, 2018, signaling API stability and production readiness. In 2019, Dask became a fiscally sponsored project of NumFOCUS, which allows the project to accept tax-deductible donations and manage community funding. The project's fiscal oversight team includes Matthew Rocklin, James Bourbeau, Jim Crist, Tom Augspurger, and Stephan Hoyer.

Dask has also received support from NSF and NASA research grants over the years, reflecting its importance across scientific computing disciplines. As of early 2026, the latest release is version 2026.1.2, and the library continues to see active development with regular monthly releases.

## Core Concepts

### Task Graphs and Lazy Evaluation

At its foundation, Dask is a dynamic task scheduler that represents computations as directed acyclic graphs (DAGs). Each node in the graph represents a Python function call, and edges represent data dependencies between tasks. When a user writes operations on Dask collections, those operations are not executed immediately. Instead, Dask records them in a task graph, a process known as [lazy evaluation](/wiki/lazy_evaluation).

This lazy approach provides two major advantages. First, Dask can analyze the full computation graph before execution and optimize it by fusing adjacent tasks, eliminating unnecessary intermediate results, and reordering operations. Second, it allows users to build up complex, multi-step pipelines without consuming memory or CPU time until they explicitly trigger computation by calling `.compute()` or `.persist()`. The `.compute()` method returns the final result as an in-memory Python object, while `.persist()` keeps the result distributed across the cluster's workers for further use.

The Dask scheduler processes roughly 1 millisecond per task for scheduling overhead, which makes fine-grained task graphs practical even for workloads involving millions of small operations.

### Chunked and Partitioned Data

Dask handles large datasets by breaking them into smaller pieces. Arrays are split into chunks (sub-arrays), and DataFrames are split into partitions (smaller pandas DataFrames). Each chunk or partition fits comfortably in memory, and Dask processes them independently and in parallel. Common chunk sizes range from 10 MB to 1 GB, depending on available RAM and the nature of the computation.

This blocked algorithm approach means that Dask can process datasets that are many times larger than available memory, streaming chunks through computation and writing results back to disk or aggregating them as needed.

## Schedulers

Dask provides two families of schedulers, each suited to different deployment scenarios.

### Single-Machine Schedulers

| Scheduler | Description | Best For |
|---|---|---|
| **Threaded** (default) | Uses `concurrent.futures.ThreadPoolExecutor`. Overhead of roughly 50 microseconds per task. | Computations dominated by NumPy, pandas, or other C-extension code that releases the [GIL](/wiki/global_interpreter_lock). |
| **Multiprocessing** | Ships tasks to separate OS processes via `concurrent.futures.ProcessPoolExecutor`. | Pure Python code that does not release the GIL; linear workflows with minimal data transfer between tasks. |
| **Synchronous** | Runs all tasks sequentially in a single thread. | Debugging, profiling with IPython/Jupyter magic commands (`%debug`, `%pdb`, `%prun`). |

### Distributed Scheduler

The distributed scheduler (`dask.distributed`) is the most powerful option and works on both single machines and multi-node clusters. It follows a client-scheduler-worker architecture:

- **Client**: A local Python session that submits computations to the scheduler and collects results. Users interact with the Client object to submit work, either through the high-level collections API or directly via `client.submit()` and `client.map()` calls.
- **Scheduler**: A centralized, asynchronous, event-driven process that manages task assignment, tracks worker state, and handles fault tolerance. The scheduler maintains a continuously evolving DAG and moves tasks through states such as released, waiting, queued, processing, memory, and error. It is written in pure Python using asynchronous coroutines for high concurrency.
- **Workers**: Separate processes (often on different machines) that execute tasks and store intermediate results. Workers communicate with each other directly via TCP for bulk data transfer, reducing bottlenecks at the scheduler. Each worker manages its own memory and can spill data to disk when memory pressure is high.

Even on a single machine, the distributed scheduler is often preferred because it provides access to a real-time diagnostic dashboard (built on [Bokeh](/wiki/bokeh)) that shows task progress, memory usage, CPU utilization, and worker communication patterns. Users can configure schedulers globally, as context managers, or on a per-computation basis using `dask.config.set()`.

The distributed scheduler can be deployed on a variety of infrastructure platforms, including job queue systems (SLURM, PBS, SGE), Kubernetes, [Apache Hadoop](/wiki/apache_hadoop) YARN, and cloud platforms (AWS, GCP, Azure).

## Collections

Dask offers four primary high-level collections that mirror familiar Python data structures and provide parallel, out-of-core alternatives.

### Dask Array

Dask Array implements a large subset of the NumPy ndarray interface using blocked algorithms. A Dask Array is composed of many smaller NumPy arrays arranged in a grid of chunks. Operations on the Dask Array are translated into operations on individual chunks, which can then be executed in parallel.

Dask Array supports most NumPy operations, including slicing, broadcasting, aggregations (sum, mean, std), linear algebra routines, and element-wise arithmetic. It integrates tightly with Xarray for labeled multi-dimensional data and with libraries like [Zarr](/wiki/zarr) for chunked, compressed, on-disk storage.

Typical use cases include processing satellite imagery, running numerical simulations in climate science, and performing large-scale array computations in physics and astronomy.

### Dask DataFrame

Dask DataFrame reuses the pandas API and memory model, splitting a logical DataFrame into multiple pandas DataFrame partitions along the index. Each partition is a regular pandas DataFrame that fits in memory, and Dask coordinates operations across all partitions in parallel.

Dask DataFrame supports common pandas operations such as `groupby`, `join`, `merge`, filtering, column operations, `apply`, and aggregations. Because it builds on top of pandas rather than reimplementing the DataFrame from scratch, Dask DataFrame inherits the full richness of the pandas ecosystem, including compatibility with file formats like [Parquet](/wiki/apache_parquet), CSV, JSON, ORC, and HDF5.

Dask DataFrame is best suited for ETL (extract, transform, load) workloads, data cleaning pipelines, and exploratory data analysis on datasets ranging from a few gigabytes to several terabytes.

### Dask Bag

Dask Bag provides a parallel abstraction over Python lists and iterators, designed for semi-structured or unstructured data. It implements operations like `map`, `filter`, `fold`, `frequencies`, `groupby`, and `pluck` on collections of arbitrary Python objects.

Bags are commonly used for:

- Parsing and cleaning log files
- Processing collections of JSON records
- Ingesting raw text data before converting it to a more structured format

Bag operations that require shuffling data across workers (such as `groupby`) can be expensive due to inter-worker communication. The recommended workflow is to use Dask Bag for initial data cleaning and ingestion, then convert the results into a Dask DataFrame or Dask Array for more complex analytics.

### Dask Delayed

Dask Delayed provides a low-level interface for parallelizing arbitrary Python code that does not fit neatly into the Array, DataFrame, or Bag abstractions. By wrapping a function with the `@dask.delayed` decorator (or calling `dask.delayed(func)`), the function's execution is deferred, and the return value becomes a lazy placeholder that records the computation in the task graph.

```python
import dask

@dask.delayed
def load(filename):
    # Load data from file
    ...

@dask.delayed
def process(data):
    # Transform the data
    ...

@dask.delayed
def combine(results):
    # Merge processed results
    ...

files = ['data_01.csv', 'data_02.csv', 'data_03.csv']
loaded = [load(f) for f in files]
processed = [process(d) for d in loaded]
result = combine(processed)
result.compute()  # Triggers parallel execution
```

Dask Delayed is particularly useful for parallelizing legacy codebases, orchestrating complex multi-step pipelines, and building custom workflows where the structure of the computation varies dynamically. The `dask.graph_manipulation.bind` function can also be used to express dependencies based on side effects rather than direct data flow between functions.

### Futures

The Futures interface extends Python's `concurrent.futures` API and provides real-time, eager (non-lazy) task submission. Unlike Delayed, which builds a graph before executing, Futures begin execution immediately when submitted to the scheduler as soon as input data and worker capacity are available. This makes the Futures API suitable for real-time workloads, reactive pipelines, and situations where the computation evolves dynamically based on intermediate results.

Key operations include `client.submit()` for individual tasks and `client.map()` for applying a function across a sequence of inputs. The Futures interface also supports cancellation of in-progress tasks, asynchronous result gathering, and fire-and-forget execution patterns.

## Dask-ML

Dask-ML is a companion library that provides scalable machine learning in Python by integrating Dask with popular ML frameworks. It addresses two distinct scaling challenges: computationally expensive models (such as large hyperparameter searches) and datasets too large to fit in memory.

### Features

Dask-ML provides the following capabilities:

| Category | Algorithms and Tools |
|---|---|
| **Model Selection** | GridSearchCV, RandomizedSearchCV, HyperbandSearchCV, SuccessiveHalvingSearchCV, IncrementalSearchCV |
| **Preprocessing** | StandardScaler, RobustScaler, MinMaxScaler, OrdinalEncoder, LabelEncoder, DummyEncoder, CountVectorizer, HashingVectorizer |
| **Linear Models** | LinearRegression, LogisticRegression, PoissonRegression |
| **Clustering** | KMeans, SpectralClustering |
| **Naive Bayes** | GaussianNB |
| **Dimensionality Reduction** | PCA, IncrementalPCA, TruncatedSVD |
| **Ensemble Methods** | BlockwiseVotingClassifier, BlockwiseVotingRegressor |
| **Cross-Validation** | KFold, ShuffleSplit |

### Scikit-Learn Integration

One of Dask-ML's most powerful features is its integration with scikit-learn through the [joblib](/wiki/joblib) backend. Scikit-learn uses joblib internally for parallelism (the `n_jobs` parameter), and Dask provides an alternative joblib backend that distributes work across a Dask cluster instead of using local threads or processes.

```python
from dask.distributed import Client
import joblib

client = Client()  # Connect to Dask cluster

with joblib.parallel_backend('dask'):
    # Any scikit-learn estimator that uses n_jobs
    # will now run across the Dask cluster
    model.fit(X_train, y_train)
```

This approach requires minimal code changes and is especially effective for hyperparameter tuning and ensemble methods, where many independent model fits can run in parallel. It is most useful for training large models on medium-sized datasets, for example when searching over many hyperparameter combinations or when using ensemble methods with many individual estimators.

### Integration with Other Frameworks

Dask-ML also integrates with [XGBoost](/wiki/xgboost) for distributed gradient boosting, and it provides utilities for distributing training across [TensorFlow](/wiki/tensorflow) and [PyTorch](/wiki/pytorch) workloads. The `dask_ml.wrappers.ParallelPostFit` meta-estimator allows models trained on a single machine to generate predictions in parallel across a Dask cluster, which is useful when the training dataset is small but the prediction dataset is large.

## Comparison with Other Frameworks

Several Python-ecosystem tools address large-scale data processing and distributed computing. The following table summarizes how Dask compares to the most commonly discussed alternatives.

| Feature | Dask | Apache Spark | [Ray](/wiki/ray) | Vaex | [Polars](/wiki/polars) |
|---|---|---|---|---|---|
| **Primary Language** | Python | Scala/Java (PySpark for Python) | Python | Python (C++ backend) | Rust (Python bindings) |
| **License** | BSD 3-Clause | Apache 2.0 | Apache 2.0 | MIT | MIT |
| **Distributed Computing** | Yes (multi-node clusters) | Yes (multi-node clusters) | Yes (multi-node clusters) | No (single machine) | No (single machine, Polars Cloud in development) |
| **DataFrame API** | Mirrors pandas API | Custom API with SQL support | No built-in DataFrame (uses other libraries) | Custom API, similar to pandas | Custom API with method chaining |
| **Array Computing** | Full NumPy-compatible arrays | No native multi-dimensional array support | No native array support | No | No |
| **Lazy Evaluation** | Yes | Yes | No (eager by default) | Yes | Yes |
| **Query Optimizer** | Limited | Advanced Catalyst optimizer | N/A | Yes | Yes (advanced) |
| **ML Libraries** | Dask-ML, scikit-learn integration | MLlib | Ray Train, Ray Tune, RLlib | Limited | No native ML |
| **Streaming** | Futures interface (manual) | Structured Streaming (first-class) | Ray Serve for serving | No | No |
| **Best Scale** | Medium (1 GB to ~10 TB) | Large (10 GB to petabytes) | Medium to large (ML workloads) | Medium (single-machine, up to RAM + disk) | Medium (single-machine, very fast) |
| **Ecosystem** | Extends NumPy, pandas, scikit-learn | Self-contained ecosystem | General-purpose distributed runtime | Standalone | Standalone |
| **Overhead** | Lightweight (~1 ms/task scheduling) | Heavier (JVM startup, cluster management) | Moderate | Minimal | Minimal |

### When to Choose Each Tool

**Choose Dask when:**
- You already work in the Python/NumPy/pandas ecosystem and want to scale existing code with minimal changes.
- Your data is in the range of 1 GB to several terabytes.
- You need multi-dimensional array computing at scale (Dask Array has no equivalent in Spark, Ray, or Polars).
- You want a single framework that handles both data processing and ML workflows within Python.

**Choose Apache Spark when:**
- You work with petabyte-scale data or need a mature, battle-tested engine for production ETL.
- Your team prefers SQL or works in a JVM-centric environment.
- You need an all-in-one platform with built-in support for streaming, graph processing, and ML.

**Choose Ray when:**
- Your primary workload involves distributed model training, hyperparameter tuning, [reinforcement learning](/wiki/reinforcement_learning), or model serving.
- You need fine-grained control over [GPU](/wiki/gpu_computing) scheduling and heterogeneous compute resources.
- You are building custom distributed applications that go beyond data processing.

**Choose Vaex when:**
- Your data fits on a single machine's disk (even if not in RAM) and you need fast exploratory analysis.
- You want memory-mapped, out-of-core processing without the overhead of a distributed scheduler.

**Choose Polars when:**
- Your data fits on a single machine and you want the fastest possible DataFrame performance.
- You prefer a modern, expressive API with an advanced query optimizer.
- You do not need distributed computing across multiple nodes.

It is worth noting that these frameworks are not always mutually exclusive. Dask and Spark can coexist in the same data pipeline, reading and writing common formats such as Parquet, CSV, JSON, and ORC. Dask can also run on top of Ray as a scheduler backend using the dask-on-ray integration.

## Coiled

Coiled is the commercial managed platform for Dask, founded in 2020 by Matthew Rocklin. The company raised a $5 million seed round in September 2020 and a $21 million Series A in May 2021 led by Bessemer Venture Partners, with participation from FirstMark Capital and other investors. Total funding stands at approximately $26 million.

Coiled simplifies the deployment and management of Dask clusters in the cloud. Users can spin up Dask clusters on [AWS](/wiki/amazon_web_services), [Google Cloud](/wiki/google_cloud_terms), or [Azure](/wiki/azure_openai) with a few lines of Python code:

```python
import coiled

cluster = coiled.Cluster(
    n_workers=20,
    region="us-east-1"
)
client = cluster.get_client()
```

Coiled handles infrastructure provisioning, software environment synchronization (ensuring workers have the same Python packages as the local environment), GPU support, network security, cost management, and automatic cluster scaling. The platform is used by organizations including NASA, Capital One, Anthem Health, and the U.S. Air Force.

While Dask itself remains fully open source and can be deployed independently on any infrastructure, Coiled provides a turnkey solution for teams that want managed Dask clusters without the operational complexity of self-hosting.

## Community and Governance

Dask is developed as a community-driven open-source project under the fiscal sponsorship of NumFOCUS. The project's governance is led by a small steering committee, and contributions come from individuals and organizations including Anaconda, Coiled, [NVIDIA](/wiki/nvidia), and many independent developers.

The Dask ecosystem extends beyond the core library to include several related projects:

- **dask-distributed**: The distributed scheduler and worker framework.
- **dask-ml**: Scalable machine learning.
- **dask-image**: Distributed image processing.
- **dask-gateway**: A multi-tenant server for managing Dask clusters.
- **dask-labextension**: Integration with JupyterLab for monitoring Dask clusters.
- **dask-kubernetes**: Deploy Dask on Kubernetes clusters.
- **dask-yarn**: Deploy Dask on Hadoop YARN clusters.
- **dask-jobqueue**: Deploy Dask on HPC job queue systems (SLURM, PBS, SGE, LSF).

The community communicates through GitHub issues and discussions, a Discourse forum, and regular community meetings. Dask also hosts an annual Dask Distributed Summit that brings together users and contributors from around the world.

## Limitations

While Dask is a versatile tool, it has some notable limitations:

- **No high-level query optimizer**: Unlike Spark's Catalyst optimizer or Polars' query planning, Dask's DataFrame operations are translated relatively directly into task graphs without extensive logical optimization. This can result in less efficient execution plans for complex queries.
- **Scheduler overhead at extreme scale**: While Dask performs well for medium-scale workloads (up to roughly 10 TB), workloads at the petabyte scale may encounter scheduler bottlenecks because the centralized scheduler must track every task individually.
- **GIL constraints for threaded scheduler**: The default threaded scheduler only provides true parallelism for code that releases Python's [Global Interpreter Lock](/wiki/global_interpreter_lock). Pure Python computation requires the multiprocessing or distributed scheduler.
- **Shuffle-heavy operations**: Operations that require redistributing data across workers (such as large joins or groupby operations on high-cardinality columns) can be slow due to the communication overhead.
- **Not a database**: Dask does not maintain persistent state, indexes, or provide ACID transactions. It is a computation framework, not a storage or query engine.

## See Also

- Apache Spark
- [Ray](/wiki/ray)
- [Polars](/wiki/polars)
- [Pandas](/wiki/pandas)
- [NumPy](/wiki/numpy)
- [Distributed Computing](/wiki/distributed_computing)
- [Parallel Computing](/wiki/parallel_computing)

## References

1. Rocklin, M. (2015). "Dask: Parallel Computation with Blocked algorithms and Task Scheduling." *Proceedings of the 14th Python in Science Conference (SciPy 2015)*, pp. 126-132.
2. Dask Development Team. "Dask: Scale the Python tools you love." Official documentation. https://docs.dask.org/
3. Dask GitHub Repository. https://github.com/dask/dask
4. Dask-ML Documentation. https://ml.dask.org/
5. Dask Distributed Documentation. https://distributed.dask.org/
6. "Dask joins NumFOCUS Sponsored Projects." NumFOCUS Blog, 2019. https://numfocus.org/blog/dask-joins-numfocus-sponsored-projects
7. "Coiled Cloud Launches at Dask Distributed Summit after securing $21M in series A funding led by Bessemer Venture Partners." PR Newswire, May 2021. https://www.prnewswire.com/news-releases/coiled-cloud-launches-at-dask-distributed-summit-after-securing-21m-in-series-a-funding-led-by-bessemer-venture-partners-301294178.html
8. "Dask Version 1.0." Dask Blog, November 29, 2018. https://blog.dask.org/2018/11/29/version-1.0
9. Dask Comparison to Spark. Official documentation. https://docs.dask.org/en/latest/spark.html
10. "DARPA Funds Python Big Data Effort." InformationWeek. https://www.informationweek.com/data-management/darpa-funds-python-big-data-effort
