Device
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,113 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,113 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
The term "device" in machine learning refers to the hardware target on which tensor operations are executed. In both TensorFlow and PyTorch, a device is a logical handle that points to a specific processor, whether that is a CPU, an NVIDIA GPU through CUDA, an Apple GPU through Metal Performance Shaders, a Google TPU through XLA, or an Intel GPU through XPU. Every tensor and every model parameter lives on exactly one device at a time, and the framework is responsible for moving data between devices when the user requests it. Choosing the right device, and managing how tensors travel between devices, is one of the most basic skills in deep learning engineering.
Devices can range from basic personal computers to powerful, specialized processors designed specifically for machine learning tasks. This article covers the hardware categories that show up under the device abstraction, the device APIs in TensorFlow and PyTorch, and the practical concerns that come with multi device training.
Central Processing Units (CPUs) are the primary processing units in most general purpose computers. They are versatile and capable of handling a wide range of tasks, including machine learning algorithms. While CPUs are not as fast or efficient as specialized hardware for large training runs, they are still widely used for small scale tasks, particularly during the development and testing phases of machine learning projects. Some advantages of using CPUs include their accessibility, compatibility with most programming languages, and relatively low cost. A CPU is also the only device available for many production inference scenarios where a GPU is impractical, such as serverless functions, mobile inference, and edge deployments.
Graphics Processing Units (GPUs) were initially designed to handle graphics rendering tasks, but they have since been repurposed for various computing tasks, including machine learning. GPUs are particularly well suited for machine learning tasks due to their massively parallel architecture, which allows them to process large amounts of data simultaneously. This architecture is useful for training deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), which require significant computational power. GPUs have become a popular choice for machine learning practitioners and researchers because of their performance compared to CPUs in many cases. Most modern deep learning research runs on NVIDIA GPUs through the CUDA toolkit, although AMD GPUs (via ROCm) and Intel GPUs (via XPU) are also supported in newer framework releases.
Tensor Processing Units (TPUs) are specialized hardware accelerators designed specifically for machine learning tasks. Developed by Google, TPUs are optimized for the execution of tensor operations that are common in deep learning algorithms. These devices offer performance improvements over both CPUs and GPUs for certain machine learning workloads, particularly when it comes to power efficiency and processing speed. TPUs are commonly used in large scale machine learning applications, such as training models on massive datasets or deploying them in production environments. TensorFlow has native support for TPUs through the TPUStrategy API, and PyTorch reaches them through the PyTorch/XLA package, which bridges PyTorch tensors to Google's XLA compiler.
Apple silicon Macs expose their integrated GPU to PyTorch through the MPS (Metal Performance Shaders) backend, which Apple and Meta announced in May 2022 with PyTorch 1.12. The MPS backend uses Metal kernels and the MPSGraph framework to run ATen operations on the Apple GPU, and it benefits from the unified memory architecture of M series chips: the CPU and GPU share a single pool of memory, so transfers between host and device are dramatically cheaper than on a discrete GPU. The MPS backend requires macOS 12.3 or later. Other accelerators that appear under the device abstraction include Intel GPUs (xpu), AMD GPUs (treated as cuda devices under ROCm), and the special meta device in PyTorch, which holds tensor shapes and dtypes without allocating any actual storage.
In PyTorch, every tensor has a .device attribute that records where it lives. The torch.device class wraps a device type and an optional ordinal (the integer index of the specific accelerator). The class accepts several construction styles:
import torch
torch.device('cpu') # CPU
torch.device('cuda') # current CUDA device
torch.device('cuda:0') # first GPU
torch.device('cuda', 0) # equivalent to cuda:0
torch.device('mps') # Apple GPU
torch.device('xla') # TPU or other XLA device
torch.device('xpu:0') # first Intel GPU
torch.device('meta') # symbolic, no storage
The device string is the standard way to specify hardware targets across the ecosystem. PyTorch also accepts the bare integer form (torch.device(0)), which uses the current accelerator type, although that raises a RuntimeError if no accelerator is detected. A common pattern is to pick a device once at the top of a script and use it everywhere:
if torch.cuda.is_available():
device = torch.device('cuda')
elif torch.backends.mps.is_available():
device = torch.device('mps')
else:
device = torch.device('cpu')
The Tensor.to() method is the standard way to move data between devices. It returns a new tensor on the target device and leaves the original tensor untouched, so it must be reassigned:
x = torch.randn(3, 3) # on CPU by default
x = x.to(device) # now on the selected device
The same method works on nn.Module objects. Calling model.to(device) moves all parameters and buffers to the target device in place, which is one of the few cases where .to() mutates the receiver. For computation involving two or more tensors, every operand must live on the same device, otherwise PyTorch raises a RuntimeError. The legacy methods tensor.cuda() and tensor.cpu() still work as shortcuts, but .to(device) is preferred because it makes the script portable across hardware.
Host-to-GPU transfers are faster when the source tensor sits in pinned (page locked) memory, because the CUDA driver can stream pinned memory directly to the device via DMA without an extra staging copy. PyTorch exposes this through the pin_memory() method on CPU tensors and the pin_memory=True argument on DataLoader. Combined with non_blocking=True in the .to() call, the CPU can prepare the next batch while the GPU is still consuming the previous one:
batch = batch.pin_memory()
batch = batch.to(device, non_blocking=True)
Pinning memory is not free. Pinned pages cannot be paged out, so over-pinning can starve the operating system of usable RAM. PyTorch documentation warns that calling pin_memory() on a one-off pageable tensor usually costs more than it saves, because the synchronous pinning call is roughly as expensive as the transfer itself. Pinning is worth it for the data loader, where the cost is amortized across many batches.
TensorFlow uses string identifiers that look slightly different from PyTorch's. The CPU is /device:CPU:0 (or shortened to /CPU:0), the first GPU is /GPU:0, and fully qualified names like /job:localhost/replica:0/task:0/device:GPU:1 show up in distributed contexts. The framework discovers devices automatically and assigns operations through a placement algorithm. If an operation has both a CPU and a GPU kernel, TensorFlow prefers the GPU; if an op only has a CPU kernel (such as tf.cast in earlier versions), it falls back to the CPU even when a GPU is available.
Manual placement uses the tf.device context manager. Every op created inside the with block is pinned to that device:
with tf.device('/CPU:0'):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b) # runs on GPU by default, copies inputs as needed
For debugging, tf.debugging.set_log_device_placement(True) prints every op's device assignment, which is useful when a GPU is sitting idle and the user does not know why. If a script targets a device that does not exist on the current machine, TensorFlow raises a RuntimeError. Setting tf.config.set_soft_device_placement(True) tells the runtime to silently pick an existing device instead, which makes scripts portable between machines with different GPU counts.
TensorFlow allocates all available GPU memory at startup by default, which can be hostile to other processes on the same machine. Two settings control this. The first is memory growth, enabled per device with tf.config.experimental.set_memory_growth(gpu, True) or the environment variable TF_FORCE_GPU_ALLOW_GROWTH=true, which makes the runtime grow its allocation as needed. The second is a hard limit, set with tf.config.set_logical_device_configuration and a memory_limit value in megabytes. Hard limits also enable virtual GPUs, where a single physical GPU is split into several logical ones for testing multi GPU code on a single device.
Once a model outgrows one device, the framework needs a strategy for splitting the work. The two dominant patterns are data parallelism (every device holds a copy of the model and processes a different shard of the batch) and model parallelism (the model itself is split across devices).
In TensorFlow, tf.distribute.MirroredStrategy is the standard data parallel strategy for a single machine with multiple GPUs. It replicates the model across all visible GPUs and uses the NVIDIA Collective Communications Library (NCCL) for all reduce by default. A batch size of 64 with two GPUs means each device processes 32 examples per step; the value passed to the dataset is the global batch size, not the per replica size. For multi machine training, MultiWorkerMirroredStrategy and TPUStrategy extend the same pattern.
In PyTorch, the recommended approach is torch.nn.parallel.DistributedDataParallel (DDP), which spawns one process per GPU and coordinates gradient synchronization through one of three backends: nccl for NVIDIA GPUs (recommended), gloo for CPU and cross platform support, or mpi for HPC clusters. Each DDP process owns exactly one CUDA device, identified by the local rank, and the user is expected to call torch.cuda.set_device(local_rank) before constructing the model. PyTorch's older nn.DataParallel (single process, multi threaded) still exists but is largely deprecated in favor of DDP. The MPS backend does not support distributed training at present: gloo and nccl both refuse to bind to mps, so only single GPU training works on Apple silicon.
GPU operations in PyTorch are asynchronous by default. When the user calls a function that uses the GPU, the operation is enqueued onto the CUDA stream of the current device and returns immediately. Subsequent operations queue up behind it, and the host only waits when it needs a value that is not yet computed, for example when calling tensor.item() or copying a result back to the CPU. This is what makes Python overhead invisible in most training loops: the host can stay ahead of the GPU as long as the queue keeps filling.
The consequence is that profiling and debugging behave unexpectedly unless the user inserts explicit synchronization. Setting CUDA_LAUNCH_BLOCKING=1 in the environment forces every kernel launch to block until completion, which makes error tracebacks point to the line that actually caused the failure rather than a later line where the asynchronous error finally surfaces. PyTorch also exposes a caching allocator that holds GPU memory between operations; the actual free memory reported by nvidia-smi will be larger than the value returned by torch.cuda.memory_allocated(), because the cached blocks are still owned by the process. The torch.cuda.empty_cache() call releases unused cached memory back to the driver but does not affect tensors that are still live.
A device in machine learning is the piece of hardware that actually does the math when a computer is learning. There are different kinds of devices, like CPUs, GPUs, and TPUs. Each one has its own strengths. CPUs are like a Swiss Army knife: they can do many things but might not be the fastest at any one of them. GPUs are like a big team of workers who can all do the same job at the same time, which is great for the giant grids of numbers that neural networks use. TPUs are like a very specialized tool that is really good at one specific job, which can make them faster and more efficient for certain tasks. When you write a program in PyTorch or TensorFlow, you tell the program which device to use, and you can move your numbers (tensors) from one device to another, just like moving toys from one box to another box. If two tensors are in different boxes, they cannot play together until you put them in the same box.