# Device

> Source: https://aiwiki.ai/wiki/device
> Updated: 2026-05-11
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Device in machine learning

The term "device" in machine learning refers to the hardware target on which [tensor](/wiki/tensor) operations are executed. In both [TensorFlow](/wiki/tensorflow) and [PyTorch](/wiki/pytorch), a device is a logical handle that points to a specific processor, whether that is a CPU, an [NVIDIA](/wiki/nvidia) GPU through [CUDA](/wiki/cuda), an Apple GPU through Metal Performance Shaders, a [Google](/wiki/google) [TPU](/wiki/tensor_processing_unit) through XLA, or an Intel GPU through XPU. Every tensor and every model parameter lives on exactly one device at a time, and the framework is responsible for moving data between devices when the user requests it. Choosing the right device, and managing how tensors travel between devices, is one of the most basic skills in [deep learning](/wiki/deep_learning) engineering.

Devices can range from basic personal computers to powerful, specialized processors designed specifically for machine learning tasks. This article covers the hardware categories that show up under the device abstraction, the device APIs in TensorFlow and PyTorch, and the practical concerns that come with multi device training.

## Hardware categories

### Central processing units (CPUs)

Central Processing Units (CPUs) are the primary processing units in most general purpose computers. They are versatile and capable of handling a wide range of tasks, including machine learning algorithms. While CPUs are not as fast or efficient as specialized hardware for large training runs, they are still widely used for small scale tasks, particularly during the development and testing phases of machine learning projects. Some advantages of using CPUs include their accessibility, compatibility with most programming languages, and relatively low cost. A CPU is also the only device available for many production inference scenarios where a GPU is impractical, such as serverless functions, mobile inference, and edge deployments.

### Graphics processing units (GPUs)

Graphics Processing Units (GPUs) were initially designed to handle graphics rendering tasks, but they have since been repurposed for various computing tasks, including machine learning. GPUs are particularly well suited for machine learning tasks due to their massively parallel architecture, which allows them to process large amounts of data simultaneously. This architecture is useful for training deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), which require significant computational power. GPUs have become a popular choice for machine learning practitioners and researchers because of their performance compared to CPUs in many cases. Most modern deep learning research runs on NVIDIA GPUs through the CUDA toolkit, although AMD GPUs (via ROCm) and Intel GPUs (via XPU) are also supported in newer framework releases.

### Tensor processing units (TPUs)

Tensor Processing Units (TPUs) are specialized hardware accelerators designed specifically for machine learning tasks. Developed by Google, TPUs are optimized for the execution of tensor operations that are common in deep learning algorithms. These devices offer performance improvements over both CPUs and GPUs for certain machine learning workloads, particularly when it comes to power efficiency and processing speed. TPUs are commonly used in large scale machine learning applications, such as training models on massive datasets or deploying them in production environments. TensorFlow has native support for TPUs through the [TPUStrategy](/wiki/tpustrategy) API, and PyTorch reaches them through the PyTorch/XLA package, which bridges PyTorch tensors to Google's XLA compiler.

### Apple silicon and other accelerators

Apple silicon Macs expose their integrated GPU to PyTorch through the MPS (Metal Performance Shaders) backend, which Apple and Meta announced in May 2022 with PyTorch 1.12. The MPS backend uses Metal kernels and the MPSGraph framework to run ATen operations on the Apple GPU, and it benefits from the unified memory architecture of M series chips: the CPU and GPU share a single pool of memory, so transfers between host and device are dramatically cheaper than on a discrete GPU. The MPS backend requires macOS 12.3 or later. Other accelerators that appear under the device abstraction include Intel GPUs (`xpu`), AMD GPUs (treated as `cuda` devices under ROCm), and the special `meta` device in PyTorch, which holds tensor shapes and dtypes without allocating any actual storage.

## Device strings and the torch.device class

In PyTorch, every tensor has a `.device` attribute that records where it lives. The `torch.device` class wraps a device type and an optional ordinal (the integer index of the specific accelerator). The class accepts several construction styles:

```python
import torch

torch.device('cpu')          # CPU
torch.device('cuda')         # current CUDA device
torch.device('cuda:0')       # first GPU
torch.device('cuda', 0)      # equivalent to cuda:0
torch.device('mps')          # Apple GPU
torch.device('xla')          # TPU or other XLA device
torch.device('xpu:0')        # first Intel GPU
torch.device('meta')         # symbolic, no storage
```

The device string is the standard way to specify hardware targets across the ecosystem. PyTorch also accepts the bare integer form (`torch.device(0)`), which uses the current accelerator type, although that raises a `RuntimeError` if no accelerator is detected. A common pattern is to pick a device once at the top of a script and use it everywhere:

```python
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')
```

### Moving tensors with .to(device)

The `Tensor.to()` method is the standard way to move data between devices. It returns a new tensor on the target device and leaves the original tensor untouched, so it must be reassigned:

```python
x = torch.randn(3, 3)        # on CPU by default
x = x.to(device)             # now on the selected device
```

The same method works on `nn.Module` objects. Calling `model.to(device)` moves all parameters and buffers to the target device in place, which is one of the few cases where `.to()` mutates the receiver. For computation involving two or more tensors, every operand must live on the same device, otherwise PyTorch raises a `RuntimeError`. The legacy methods `tensor.cuda()` and `tensor.cpu()` still work as shortcuts, but `.to(device)` is preferred because it makes the script portable across hardware.

### Pinned memory and non-blocking transfers

Host-to-GPU transfers are faster when the source tensor sits in pinned (page locked) memory, because the CUDA driver can stream pinned memory directly to the device via DMA without an extra staging copy. PyTorch exposes this through the `pin_memory()` method on CPU tensors and the `pin_memory=True` argument on `DataLoader`. Combined with `non_blocking=True` in the `.to()` call, the CPU can prepare the next batch while the GPU is still consuming the previous one:

```python
batch = batch.pin_memory()
batch = batch.to(device, non_blocking=True)
```

Pinning memory is not free. Pinned pages cannot be paged out, so over-pinning can starve the operating system of usable RAM. PyTorch documentation warns that calling `pin_memory()` on a one-off pageable tensor usually costs more than it saves, because the synchronous pinning call is roughly as expensive as the transfer itself. Pinning is worth it for the data loader, where the cost is amortized across many batches.

## Device placement in TensorFlow

TensorFlow uses string identifiers that look slightly different from PyTorch's. The CPU is `/device:CPU:0` (or shortened to `/CPU:0`), the first GPU is `/GPU:0`, and fully qualified names like `/job:localhost/replica:0/task:0/device:GPU:1` show up in distributed contexts. The framework discovers devices automatically and assigns operations through a placement algorithm. If an operation has both a CPU and a GPU kernel, TensorFlow prefers the GPU; if an op only has a CPU kernel (such as `tf.cast` in earlier versions), it falls back to the CPU even when a GPU is available.

### tf.device context manager

Manual placement uses the `tf.device` context manager. Every op created inside the `with` block is pinned to that device:

```python
with tf.device('/CPU:0'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

c = tf.matmul(a, b)   # runs on GPU by default, copies inputs as needed
```

For debugging, `tf.debugging.set_log_device_placement(True)` prints every op's device assignment, which is useful when a GPU is sitting idle and the user does not know why. If a script targets a device that does not exist on the current machine, TensorFlow raises a `RuntimeError`. Setting `tf.config.set_soft_device_placement(True)` tells the runtime to silently pick an existing device instead, which makes scripts portable between machines with different GPU counts.

### GPU memory management

TensorFlow allocates all available GPU memory at startup by default, which can be hostile to other processes on the same machine. Two settings control this. The first is memory growth, enabled per device with `tf.config.experimental.set_memory_growth(gpu, True)` or the environment variable `TF_FORCE_GPU_ALLOW_GROWTH=true`, which makes the runtime grow its allocation as needed. The second is a hard limit, set with `tf.config.set_logical_device_configuration` and a `memory_limit` value in megabytes. Hard limits also enable virtual GPUs, where a single physical GPU is split into several logical ones for testing multi GPU code on a single device.

## Multi GPU and distributed training

Once a model outgrows one device, the framework needs a strategy for splitting the work. The two dominant patterns are data parallelism (every device holds a copy of the model and processes a different shard of the batch) and model parallelism (the model itself is split across devices).

In TensorFlow, `tf.distribute.MirroredStrategy` is the standard data parallel strategy for a single machine with multiple GPUs. It replicates the model across all visible GPUs and uses the NVIDIA Collective Communications Library (NCCL) for all reduce by default. A batch size of 64 with two GPUs means each device processes 32 examples per step; the value passed to the dataset is the global batch size, not the per replica size. For multi machine training, `MultiWorkerMirroredStrategy` and `TPUStrategy` extend the same pattern.

In PyTorch, the recommended approach is `torch.nn.parallel.DistributedDataParallel` (DDP), which spawns one process per GPU and coordinates gradient synchronization through one of three backends: `nccl` for NVIDIA GPUs (recommended), `gloo` for CPU and cross platform support, or `mpi` for HPC clusters. Each DDP process owns exactly one CUDA device, identified by the local rank, and the user is expected to call `torch.cuda.set_device(local_rank)` before constructing the model. PyTorch's older `nn.DataParallel` (single process, multi threaded) still exists but is largely deprecated in favor of DDP. The MPS backend does not support distributed training at present: `gloo` and `nccl` both refuse to bind to `mps`, so only single GPU training works on Apple silicon.

## CUDA semantics and asynchronous execution

GPU operations in PyTorch are asynchronous by default. When the user calls a function that uses the GPU, the operation is enqueued onto the CUDA stream of the current device and returns immediately. Subsequent operations queue up behind it, and the host only waits when it needs a value that is not yet computed, for example when calling `tensor.item()` or copying a result back to the CPU. This is what makes Python overhead invisible in most training loops: the host can stay ahead of the GPU as long as the queue keeps filling.

The consequence is that profiling and debugging behave unexpectedly unless the user inserts explicit synchronization. Setting `CUDA_LAUNCH_BLOCKING=1` in the environment forces every kernel launch to block until completion, which makes error tracebacks point to the line that actually caused the failure rather than a later line where the asynchronous error finally surfaces. PyTorch also exposes a caching allocator that holds GPU memory between operations; the actual free memory reported by `nvidia-smi` will be larger than the value returned by `torch.cuda.memory_allocated()`, because the cached blocks are still owned by the process. The `torch.cuda.empty_cache()` call releases unused cached memory back to the driver but does not affect tensors that are still live.

## Explain Like I'm 5 (ELI5)

A device in machine learning is the piece of hardware that actually does the math when a computer is learning. There are different kinds of devices, like CPUs, GPUs, and TPUs. Each one has its own strengths. CPUs are like a Swiss Army knife: they can do many things but might not be the fastest at any one of them. GPUs are like a big team of workers who can all do the same job at the same time, which is great for the giant grids of numbers that neural networks use. TPUs are like a very specialized tool that is really good at one specific job, which can make them faster and more efficient for certain tasks. When you write a program in PyTorch or TensorFlow, you tell the program which device to use, and you can move your numbers (tensors) from one device to another, just like moving toys from one box to another box. If two tensors are in different boxes, they cannot play together until you put them in the same box.

## References

- [TensorFlow: Use a GPU](https://www.tensorflow.org/guide/gpu)
- [PyTorch: CUDA semantics](https://docs.pytorch.org/docs/2.11/notes/cuda.html)
- [PyTorch: MPS backend](https://docs.pytorch.org/docs/stable/notes/mps.html)
- [PyTorch: Tensor Attributes (torch.device)](https://docs.pytorch.org/docs/stable/tensor_attributes.html)
- [PyTorch: A guide on good usage of non_blocking and pin_memory()](https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html)
- [PyTorch blog: Introducing Accelerated PyTorch Training on Mac](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/)
- [PyTorch/XLA on GitHub](https://github.com/pytorch/xla)
- [TensorFlow: tf.distribute.MirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy)
- [PyTorch: Getting Started with DistributedDataParallel](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)
- [Apple Developer: Accelerated PyTorch training on Mac](https://developer.apple.com/metal/pytorch/)

