Device

Machine Learning

12 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v3 · 2,391 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

What is a device in machine learning?

In machine learning, a device is the hardware target on which tensor operations are executed: a CPU, an NVIDIA GPU through CUDA, an Apple GPU through Metal Performance Shaders, a Google TPU through XLA, or an Intel GPU through XPU. In both TensorFlow and PyTorch, a device is a logical handle that points to one specific processor, every tensor and model parameter lives on exactly one device at a time, and the framework moves data between devices on request ^[1]^[4]. Choosing the right device, and managing how tensors travel between devices, is one of the most basic skills in deep learning engineering.

Devices can range from basic personal computers to powerful, specialized processors designed specifically for machine learning tasks. This article covers the hardware categories that show up under the device abstraction, the device APIs in TensorFlow and PyTorch, and the practical concerns that come with multi device training.

What kinds of hardware count as a device?

Central processing units (CPUs)

Central Processing Units (CPUs) are the primary processing units in most general purpose computers. They are versatile and capable of handling a wide range of tasks, including machine learning algorithms. While CPUs are not as fast or efficient as specialized hardware for large training runs, they are still widely used for small scale tasks, particularly during the development and testing phases of machine learning projects. Some advantages of using CPUs include their accessibility, compatibility with most programming languages, and relatively low cost. A CPU is also the only device available for many production inference scenarios where a GPU is impractical, such as serverless functions, mobile inference, and edge deployments.

Graphics processing units (GPUs)

Graphics Processing Units (GPUs) were initially designed to handle graphics rendering tasks, but they have since been repurposed for various computing tasks, including machine learning. GPUs are particularly well suited for machine learning tasks due to their massively parallel architecture, which allows them to process large amounts of data simultaneously. This architecture is useful for training deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), which require significant computational power. GPUs have become a popular choice for machine learning practitioners and researchers because of their performance compared to CPUs in many cases. Most modern deep learning research runs on NVIDIA GPUs through the CUDA toolkit, although AMD GPUs (via ROCm) and Intel GPUs (via XPU) are also supported in newer framework releases.

Tensor processing units (TPUs)

Tensor Processing Units (TPUs) are specialized hardware accelerators designed specifically for machine learning tasks. Developed by Google, TPUs are optimized for the execution of tensor operations that are common in deep learning algorithms. The first generation TPU was deployed in Google's data centers in early 2015, built on a 28 nm process running at 700 MHz, drawing about 40 W, and built around a 256x256 systolic array of 8-bit multiply-accumulators that delivered roughly 92 TOPS of INT8 compute for inference ^[9]. These devices offer performance improvements over both CPUs and GPUs for certain machine learning workloads, particularly when it comes to power efficiency and processing speed. TPUs are commonly used in large scale machine learning applications, such as training models on massive datasets or deploying them in production environments. TensorFlow has native support for TPUs through the TPUStrategy API, and PyTorch reaches them through the PyTorch/XLA package, which bridges PyTorch tensors to Google's XLA compiler ^[7].

Apple silicon and other accelerators

Apple silicon Macs expose their integrated GPU to PyTorch through the MPS (Metal Performance Shaders) backend, which Apple and Meta announced in May 2022 for the PyTorch 1.12 release ^[6]. According to PyTorch's announcement, "In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac" ^[6]. The MPS backend maps machine learning computational graphs and primitives onto the MPSGraph framework and tuned Metal kernels, and it benefits from the unified memory architecture of M series chips: PyTorch notes that "Every Apple silicon Mac has a unified memory architecture, providing the GPU with direct access to the full memory store," so transfers between host and device are dramatically cheaper than on a discrete GPU ^[6]. The release shipped with prototype status and Apple reported speedups of up to 20x over CPU-based training on some workloads ^[6]. The MPS backend requires macOS 12.3 or later ^[3]. Other accelerators that appear under the device abstraction include Intel GPUs (xpu), AMD GPUs (treated as cuda devices under ROCm), and the special meta device in PyTorch, which holds tensor shapes and dtypes without allocating any actual storage.

How do you set a device in PyTorch?

In PyTorch, every tensor has a .device attribute that records where it lives. The torch.device class wraps a device type and an optional ordinal (the integer index of the specific accelerator) ^[4]. The class accepts several construction styles:

import torch

torch.device('cpu')          # CPU
torch.device('cuda')         # current CUDA device
torch.device('cuda:0')       # first GPU
torch.device('cuda', 0)      # equivalent to cuda:0
torch.device('mps')          # Apple GPU
torch.device('xla')          # TPU or other XLA device
torch.device('xpu:0')        # first Intel GPU
torch.device('meta')         # symbolic, no storage

The device string is the standard way to specify hardware targets across the ecosystem. PyTorch also accepts the bare integer form (torch.device(0)), which uses the current accelerator type, although that raises a RuntimeError if no accelerator is detected ^[4]. A common pattern is to pick a device once at the top of a script and use it everywhere, which is what makes code device-agnostic and portable across machines:

if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

How do you move a tensor to a device with .to(device)?

The Tensor.to() method is the standard way to move data between devices. It returns a new tensor on the target device and leaves the original tensor untouched, so it must be reassigned:

x = torch.randn(3, 3)        # on CPU by default
x = x.to(device)             # now on the selected device

The same method works on nn.Module objects. Calling model.to(device) moves all parameters and buffers to the target device in place, which is one of the few cases where .to() mutates the receiver. For computation involving two or more tensors, every operand must live on the same device, otherwise PyTorch raises a RuntimeError. The legacy methods tensor.cuda() and tensor.cpu() still work as shortcuts, but .to(device) is preferred because it makes the script portable across hardware.

Why are pinned memory and non-blocking transfers faster?

Host-to-GPU transfers are faster when the source tensor sits in pinned (page locked) memory, because the CUDA driver can stream pinned memory directly to the device via DMA without an extra staging copy ^[5]. PyTorch exposes this through the pin_memory() method on CPU tensors and the pin_memory=True argument on DataLoader. Combined with non_blocking=True in the .to() call, the CPU can prepare the next batch while the GPU is still consuming the previous one ^[5]:

batch = batch.pin_memory()
batch = batch.to(device, non_blocking=True)

Pinning memory is not free. Pinned pages cannot be paged out, so over-pinning can starve the operating system of usable RAM. PyTorch's tutorial warns that "calling pin_memory() on a pageable tensor before casting it to GPU should not bring any significant speed-up, on the contrary this call is usually slower than just executing the transfer" ^[5], because the synchronous pinning call is roughly as expensive as the transfer itself. Pinning is worth it for the data loader, where a separate thread amortizes the cost across many batches ^[5].

How do you set a device in TensorFlow?

TensorFlow uses string identifiers that look slightly different from PyTorch's. The CPU is /device:CPU:0 (or shortened to /CPU:0), the first GPU is /GPU:0, and fully qualified names like /job:localhost/replica:0/task:0/device:GPU:1 show up in distributed contexts ^[1]. The framework discovers devices automatically and assigns operations through a placement algorithm. If an operation has both a CPU and a GPU kernel, TensorFlow prefers the GPU; if an op only has a CPU kernel (such as tf.cast in earlier versions), it falls back to the CPU even when a GPU is available ^[1].

tf.device context manager

Manual placement uses the tf.device context manager. Every op created inside the with block is pinned to that device ^[1]:

with tf.device('/CPU:0'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

c = tf.matmul(a, b)   # runs on GPU by default, copies inputs as needed

For debugging, tf.debugging.set_log_device_placement(True) prints every op's device assignment, which is useful when a GPU is sitting idle and the user does not know why ^[1]. If a script targets a device that does not exist on the current machine, TensorFlow raises a RuntimeError. Setting tf.config.set_soft_device_placement(True) tells the runtime to silently pick an existing device instead, which makes scripts portable between machines with different GPU counts ^[1].

How does TensorFlow manage GPU memory?

By default TensorFlow maps nearly all of the GPU memory of every visible GPU at startup, which can be hostile to other processes on the same machine ^[1]. Two settings control this. The first is memory growth, enabled per device with tf.config.experimental.set_memory_growth(gpu, True) or the environment variable TF_FORCE_GPU_ALLOW_GROWTH=true, which makes the runtime start with very little memory and grow its allocation as needed ^[1]^[8]. The second is a hard limit, set with tf.config.set_logical_device_configuration and a memory_limit value in megabytes. Hard limits also enable virtual GPUs, where a single physical GPU is split into several logical ones for testing multi GPU code on a single device ^[1].

How do you train across multiple devices?

Once a model outgrows one device, the framework needs a strategy for splitting the work. The two dominant patterns are data parallelism (every device holds a copy of the model and processes a different shard of the batch) and model parallelism (the model itself is split across devices).

In TensorFlow, tf.distribute.MirroredStrategy is the standard data parallel strategy for a single machine with multiple GPUs. It creates one replica per GPU device and uses the NVIDIA Collective Communications Library (NCCL) for all reduce by default ^[8]. The value passed to the dataset is the global batch size, which TensorFlow divides by strategy.num_replicas_in_sync to get the per replica size: a global batch of 64 with two GPUs means each device processes 32 examples per step ^[8]. For multi machine training, MultiWorkerMirroredStrategy and TPUStrategy extend the same pattern.

In PyTorch, the recommended approach is torch.nn.parallel.DistributedDataParallel (DDP), which spawns one process per GPU and coordinates gradient synchronization through one of three backends: nccl for NVIDIA GPUs (recommended), gloo for CPU and cross platform support, or mpi for HPC clusters (available only if PyTorch is built from source) ^[10]. Each DDP process owns exactly one CUDA device, identified by the local rank, and the user is expected to call torch.cuda.set_device(local_rank) before constructing the model. PyTorch's older nn.DataParallel (single process, multi threaded) still exists but is largely deprecated in favor of DDP. The MPS backend does not support distributed training at present: gloo and nccl both refuse to bind to mps, so only single GPU training works on Apple silicon.

Why is CUDA execution asynchronous?

GPU operations in PyTorch are asynchronous by default. When the user calls a function that uses the GPU, the operation is enqueued onto the CUDA stream of the current device and returns immediately. PyTorch's documentation explains that "the effect of asynchronous computation is invisible to the caller, because (1) each device executes operations in the order they are queued, and (2) PyTorch automatically performs necessary synchronization when copying data between CPU and GPU or between two GPUs" ^[2]. Subsequent operations queue up behind the first, and the host only waits when it needs a value that is not yet computed, for example when calling tensor.item() or copying a result back to the CPU. This is what makes Python overhead invisible in most training loops: the host can stay ahead of the GPU as long as the queue keeps filling.

The consequence is that profiling and debugging behave unexpectedly unless the user inserts explicit synchronization. The documentation notes that "you can force synchronous computation by setting environment variable CUDA_LAUNCH_BLOCKING=1," which makes error tracebacks point to the line that actually caused the failure rather than a later line where the asynchronous error finally surfaces ^[2]. PyTorch also exposes a caching allocator that holds GPU memory between operations, so "the unused memory managed by the allocator will still show as if used in nvidia-smi"; the value returned by torch.cuda.memory_allocated() tracks only memory occupied by live tensors, while memory_reserved() tracks the total managed by the caching allocator ^[2]. The torch.cuda.empty_cache() call releases unused cached memory back to the driver but does not affect tensors that are still live.

Explain Like I'm 5 (ELI5)

A device in machine learning is the piece of hardware that actually does the math when a computer is learning. There are different kinds of devices, like CPUs, GPUs, and TPUs. Each one has its own strengths. CPUs are like a Swiss Army knife: they can do many things but might not be the fastest at any one of them. GPUs are like a big team of workers who can all do the same job at the same time, which is great for the giant grids of numbers that neural networks use. TPUs are like a very specialized tool that is really good at one specific job, which can make them faster and more efficient for certain tasks. When you write a program in PyTorch or TensorFlow, you tell the program which device to use, and you can move your numbers (tensors) from one device to another, just like moving toys from one box to another box. If two tensors are in different boxes, they cannot play together until you put them in the same box.

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms/All Machine learning terms/TensorFlow Terms

What is a device in machine learning?

What kinds of hardware count as a device?

Central processing units (CPUs)

Graphics processing units (GPUs)

Tensor processing units (TPUs)

Apple silicon and other accelerators

How do you set a device in PyTorch?

How do you move a tensor to a device with .to(device)?

Why are pinned memory and non-blocking transfers faster?

How do you set a device in TensorFlow?

tf.device context manager

How does TensorFlow manage GPU memory?

How do you train across multiple devices?

Why is CUDA execution asynchronous?

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here