tinygrad
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,083 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,083 words
Add missing citations, update stale details, or suggest a clearer explanation.
tinygrad is an open-source deep learning framework written primarily in Python that aims to occupy the space between Andrej Karpathy's pedagogical micrograd and full-scale production stacks like PyTorch. It was started by George Hotz (geohot) on 17 October 2020 as a deliberately minimal alternative whose readable source code makes the entire compiler and intermediate representation visible to the user.[^1][^2] The framework is maintained by the tiny corp, a company Hotz founded that announced a $5.1 million seed round on 24 May 2023 and now sells the tinybox, a multi-GPU workstation built around consumer Radeon and GeForce cards.[^3][^4] tinygrad's stated mission is to "commoditize the petaflop," and the project deliberately targets accelerators beyond NVIDIA, with a particular focus on AMD GPUs through a near-complete software stack of its own.[^3][^5]
| Field | Value |
|---|---|
| First commit | 17 October 2020[^2] |
| Original author | George Hotz |
| Maintainer | the tiny corp[^1] |
| License | MIT[^1] |
| Language | Python (with C, CUDA, Metal kernels)[^1] |
| Latest documented version | 0.12.0 (12 January 2026)[^1] |
| Approximate size (excl. tests) | ~18,935 lines[^2] |
| GitHub stars (Jan 2026) | 32.7k[^1] |
| Company funding | $5.1M seed, May 2023[^3] |
| Primary product | tinybox (red v2, green v2)[^6] |
The first commit to the tinygrad repository was made on 17 October 2020 by Hotz, the security researcher widely known for jailbreaking the iPhone and the PlayStation 3 and for founding the autonomous-driving startup comma.ai.[^2][^7] The project began as an experiment to see how small a working deep learning framework could be while still supporting the operations needed to train neural networks. Hotz framed it as something "between PyTorch and micrograd," with the readability and hackability of Karpathy's 200-line educational engine but the practical surface area of a real tensor library.[^1]
Through 2021 and 2022 tinygrad accumulated backends and operators while staying nominally under a self-imposed ceiling of a few thousand lines. The project's README has long emphasized that adding a new accelerator requires implementing only roughly 25 low-level operations, which kept the surface area small enough to be ported by small teams.[^1] Hotz's company comma.ai adopted tinygrad as a model-runtime alternative to its earlier thneed and Qualcomm SNPE pipelines, eventually moving the openpilot driving model entirely onto a tinygrad QCOM backend on the comma 3X devkit.[^8]
During this early period, much of tinygrad's design was driven by Hotz's livestream-based development style, where he would often refactor large portions of the codebase on camera. The repository's commit history reflects an emphasis on deletions rather than additions: an unusually high ratio of red-line to green-line changes, motivated by the goal of keeping the entire framework small enough that a new contributor could read every file in a single weekend.[^1][^9]
Hotz incorporated "the tiny corp" in late 2022 and announced a $5.1 million seed round on 24 May 2023 in a blog post titled "the tiny corp raised $5.1M." The post described the company's plan to "commoditize the petaflop" by building a consumer-priced AI workstation and porting machine-learning kernels to non-NVIDIA hardware.[^3] The same post outlined a hiring model entirely unlike a traditional startup: the only way to be hired was to submit "high quality pull requests" to the tinygrad repository, and contributors could earn cash bounties posted on GitHub for tasks aligned with the project's roadmap.[^3]
Hotz repeatedly stated that the company's initial commercial focus would be a small-form-factor box capable of training and inferring on large models, and that the long arc would be to challenge NVIDIA's CUDA software moat by making AMD's RDNA3 hardware practical for ML.[^3][^5] In the same announcement post he described tinygrad's design as deliberately constrained to twelve operations supporting only addition and multiplication, and explicitly avoiding Turing-complete kernels so that the framework could perform static analysis over memory access patterns. He argued that this constraint was the structural reason a small team could optimise modern ML workloads without relying on opaque vendor compilers.[^3]
Through 2024 the tiny corp publicly worked on getting AMD's Radeon RX 7900 XTX into the MLPerf benchmark suite. The company posted detailed engineering notes on stability problems with AMD's firmware and driver, eventually documenting the 7900 XTX in its own tinygrad/7900xtx repository.[^5] By early 2026, Hotz announced a "completely sovereign" compute stack for AMD GPUs, in which tinygrad implements its own kernel-level driver in Python, bypasses AMD's Micro Engine Scheduler (MES) firmware, and submits PM4 packets directly rather than going through the higher-level ROCm/HSA stack.[^5][^9]
In a fifth-anniversary blog post dated 29 December 2025, Hotz wrote that tinygrad's code (excluding tests) had grown to about 18,935 lines, that the team had grown to six people, that the computer-sales business generated roughly $2 million a year, and that the project earned additional revenue from AMD contracts related to MLPerf benchmarking.[^2][^9] He also noted that during the year the project had eliminated its remaining LLVM dependency, so a fresh checkout of tinygrad now requires nothing more than a Python installation for AMD code generation, with the LLVM path retained only as one optional fallback.[^9][^5]
tinygrad is organized around three load-bearing ideas.
Every operation on a Tensor in tinygrad returns a new lazy node rather than executing immediately. Computation is only triggered when the user calls .realize() or when an operation otherwise needs a concrete numeric result, such as conversion to NumPy or printing.[^10] This delayed execution lets the scheduler see a larger graph of pending operations before committing to a code-generation strategy. By contrast, PyTorch's eager mode executes most operators immediately and relies on torch.compile for similar fusion opportunities.[^10][^1]
Because tensors are lazy, tinygrad can examine many pending operations and decide which can be fused into a single GPU kernel. Movement operations (reshape, permute, expand, pad, shrink, stride) are represented symbolically through a data structure called the ShapeTracker, which composes views of an underlying buffer without copying data. The scheduling phase determines which operations can be merged into one fused kernel and which need to be "realized" first, while the lowering phase emits target-specific code for that kernel.[^11][^10]
The framework distills neural-network workloads down to a small set of fundamental UOps (unary, binary, reduce, movement) rather than implementing specialized convolutions or matrix multiplies as monolithic operators. This means autodifferentiation and accelerator support fall out automatically for any operation expressible in those primitives.[^3][^1]
A core ergonomic claim of tinygrad is that the entire compiler and IR are user-visible. The DEBUG=2 environment variable, for instance, prints every kernel that is compiled and dispatched, including timing, FLOPS, and bandwidth estimates. Hotz frames this as the project's "show me the kernel" thesis: a deep-learning framework should let the user trace the path from a high-level nn.Linear call all the way down to the bytes that run on the GPU.[^1][^11]
tinygrad does not maintain a PyTorch-style dynamic Python autograd tape. Backward passes are computed symbolically against the lazy computation graph, which makes whole-program optimization, JIT replay via the @TinyJit decorator, and ahead-of-time compilation possible without an additional graph capture step.[^10]
The project's working slogan, repeated in Hotz's blog and on tinygrad.org, is "the best part is no part."[^9] The team explicitly resists adding compatibility layers, vendored kernel libraries, or auto-tuning systems if a smaller implementation suffices. As of late 2025 tinygrad operates without any required external dependency beyond Python itself, having removed its prior reliance on LLVM for AMD code generation as part of the sovereign-stack push.[^9][^5]
The execution pipeline can be summarised in three layers:
Tensor class exposes a PyTorch-like API: element-wise math, reductions, conv2d, matmul, softmax, and so on. Calls construct a lazy graph of high-level operations.[^10]The frontend deliberately omits the nn.Module base class familiar from PyTorch. Neural-network modules in tinygrad are plain Python classes whose forward pass is written as __call__ rather than forward, and stateless operations are exposed as plain methods on Tensor rather than as wrapped classes. Tensor sharding across multiple GPUs is built in via Tensor.shard, which annotates a tensor with a list of devices and a shard axis, so the same model code can be moved to multiple GPUs by changing a single argument rather than by introducing distributed-training wrappers.[^10]
The @TinyJit decorator captures the kernels launched on the first call into a function and replays them on subsequent calls, giving JIT-style performance without requiring a separate compilation phase or a static graph language. Because the captured kernels are exactly the ones the user can already inspect via DEBUG=2, the JIT does not introduce a new black box.[^10][^1]
tinygrad's documentation lists the following first-party runtimes, each selectable through the DEV (or legacy CUDA=1, METAL=1, etc.) environment variables.
| Runtime | Hardware | Compiler / interface | Notes |
|---|---|---|---|
| NV | NVIDIA Ampere, Ada, Blackwell | nvrtc or PTX | Native NVIDIA path[^12] |
| CUDA | NVIDIA | nvrtc or PTX | Uses NVIDIA's CUDA driver[^12] |
| AMD | AMD RDNA3, RDNA4, CDNA3, CDNA4 | LLVM or HIP/COMGR | Includes "AM" sovereign driver bypassing ROCm[^12][^5] |
| METAL | Apple M1 and later | Metal Shading Language | Production-ready on macOS/iOS[^12] |
| QCOM | Qualcomm Adreno 6xx | OpenCL kernels | Used by comma.ai openpilot[^12][^8] |
| CL | OpenCL-capable GPUs | OpenCL | Generic fallback[^12] |
| WEBGPU | Browsers, Dawn | WGSL via Google Dawn | Runs inference inside Chrome[^12] |
| CPU | x86, ARM, RISC-V | Clang or LLVM | Reference path[^12] |
The framework also provides zero-copy interoperation with PyTorch CUDA/Metal tensors and with OpenCL on Qualcomm through the Tensor.from_blob API.[^12]
The tinybox is a self-contained workstation aimed at researchers and small teams who want local training and inference of large models. It is built by the tiny corp and ships in several SKUs colour-coded by GPU vendor.[^6][^4]
The tinybox is a 12U rack-mountable case, measuring 19" wide, 21" tall, and 16.25" deep, and weighs roughly 70 lb in the v2 form. Original boxes used dual 1600 W power supplies and required either a 120 V 30 A circuit or a 220 V 20 A circuit, with an option to power-limit GPUs to about 150 W each for single-outlet operation. Every variant ships with Ubuntu 22.04, tinygrad pre-installed, and PyTorch available as a fallback runtime.[^6][^13]
The original red tinybox uses six AMD Radeon RX 7900 XTX cards, paired with a 32-core AMD EPYC Genoa CPU, 128 GB of system RAM, four Western Digital SN850X 1 TB NVMe SSDs in RAID plus a separate 1 TB boot drive, and an empty 16x OCP 3.0 slot for networking. Tom's Hardware reported the headline performance figure as 738 FP16 TFLOPS with 96 GB of aggregate GDDR6 memory and 21 TB/s of aggregate memory bandwidth, at a retail price of $15,000.[^4][^14]
A central design choice was that the consumer-grade Radeon RX 7900 XTX exposes the peer-to-peer interconnect that the GeForce RTX 4090 does not, allowing six cards to share data efficiently over PCIe for distributed AI workloads.[^4]
The green variant substitutes six NVIDIA GeForce RTX 4090 GPUs into the same chassis, reaching roughly 991 FP16 TFLOPS and 144 GB of GDDR6X, at a retail price of $25,000.[^13][^14] Both red and green originals went on retail in 2024, after a long period of pre-orders during which the company refined the cooling and PCIe topology.[^14]
In 2024 the tiny corp also opened pre-orders for a "tinybox pro" configured with eight RTX 4090 GPUs and two AMD EPYC Genoa processors, listed at $40,000 and aimed at users who wanted denser NVIDIA compute in one box.[^15]
The current red v2, sold through the tiny corp Shopify store, drops to four AMD Radeon RX 9070 XT (RDNA4) GPUs and is rated at 778 FP16 TFLOPS with 64 GB of VRAM, in a single 15 A plug enclosure. It pairs the four GPUs with a 32-core AMD EPYC CPU, 128 GB of system RAM, a 2 TB NVMe drive, and a 1600 W PSU, at a list price of $12,000.[^16]
The green v2 uses four NVIDIA RTX 5090 (Blackwell) GPUs and is sold made-to-order at $65,000, while the tiny corp has publicly described an "exabox" research target priced at roughly $10 million and intended for delivery in 2027, aimed at delivering on the order of one exaflop in a single rack.[^17][^6]
| SKU | GPUs | FP16 TFLOPS | GPU memory | Price (USD) |
|---|---|---|---|---|
| tinybox red (original) | 6x Radeon RX 7900 XTX | 738[^4] | 96 GB GDDR6[^4] | $15,000[^4] |
| tinybox green (original) | 6x GeForce RTX 4090 | 991[^13] | 144 GB GDDR6X[^13] | $25,000[^14] |
| tinybox pro | 8x GeForce RTX 4090 | not published | not published | $40,000[^15] |
| tinybox red v2 | 4x Radeon RX 9070 XT | 778[^16] | 64 GB[^16] | $12,000[^16] |
| tinybox green v2 | 4x GeForce RTX 5090 | not published | not published | $65,000[^17] |
The "blue" name has been informally associated by community coverage with a planned AMD Instinct MI300X data-center configuration; the tiny corp has not published a formal blue tinybox product page, and that SKU should not be treated as confirmed for retail sale.[^4]
Rather than running a traditional engineering interview process, the tiny corp posts cash bounties on its GitHub repository for tasks aligned with the project's published roadmap. The amounts are tiered roughly as: $100 for trivial fixes, $200 for a few hours of standalone work, $500 for several days of work with some prerequisites, $1,000 for changes that require refactoring core tinygrad, and larger amounts for multi-week efforts.[^18] Bounties are paid out at Hotz's discretion when a pull request is merged, with payment via USDC on Ethereum or PayPal. The same blog post that announced the seed round stated bluntly that job interviews are "obsolete" and that the only way to be hired at the tiny corp is to submit high-quality pull requests.[^3][^18]
Many of the public bounties target AMD GPU support specifically, including matrix-instruction "MMAPEAK" benchmarks on the 7900 XTX and 9070 XT, an ACO shader compiler backend, and improvements to the matching engine that schedules fused kernels.[^18] Other bounties touch concerns familiar from any compiler project: faster pattern-matching in the rewrite engine, missing fusion opportunities, and edge cases in shape arithmetic. Hotz has been explicit in public discussions that bounties are not paid for pull requests that introduce serious hacks or unmaintainable code, even if they superficially solve the requested problem, and that the bar for what counts as "clean" is set by his own review.[^18][^22]
Since 2023 the tiny corp has framed itself, in Hotz's own posts and livestreams, as the most credible third party trying to make AMD GPUs competitive with NVIDIA for ML.[^3][^5] The work has two parts.
First, tinygrad's lowering pipeline targets AMD shader ISAs directly and, in the most recent code, no longer requires LLVM as a dependency to do so.[^9] Second, tinygrad implements its own user-space driver, sometimes called "AM," that submits work to RDNA3 GPUs via PM4 packets while largely bypassing the Micro Engine Scheduler firmware. According to Hotz's posts and Phoronix coverage, this gives tinygrad reproducible kernel launches and avoids stability problems that the team encountered with AMD's stock ROCm/HSA stack.[^5][^9]
The "show me the kernel" rhetorical move is the user-facing counterpart of this work: because the entire compiler is in Python and visible, a developer can set DEBUG=4 and see the exact GPU assembly that tinygrad generated, then file targeted bounties or patches against the lowering stage. Hotz argues that this transparency is the structural advantage that lets a small team make an AMD stack work where larger projects have struggled.[^9][^11]
A typical ML training loop on a 7900 XTX, according to Hotz's published notes on the tinygrad/7900xtx work, resubmits the same roughly 100-millisecond run-queue containing the fused kernels for forward, backward, and optimizer steps, pointing at different input buffers across iterations. Hotz has stated that this submission pattern, executed by tinygrad's own Python-resident driver, outperforms the conventional ROCm-mediated path on the same hardware while needing no privileged firmware updates.[^5] The work is framed as a deliberate "sovereign" stance: the entire stack from Tensor API down to the GPU command processor is owned by tinygrad, so a developer with a hardware question can read every line involved without consulting an opaque vendor binary.[^9][^5]
The tiny corp has emphasized two public benchmarking efforts:
Independent reviews by Tom's Hardware noted that the original tinybox red delivered roughly 37% of an H100's compute performance but with more aggregate memory (96 GB versus 80 GB) and substantially higher aggregate memory bandwidth (21 TB/s versus 3.35 TB/s) for the price.[^4] The framework's claimed "1,000x smaller" code footprint compared with PyTorch plus CUDA plus LLVM is repeatedly cited by Hotz and discussed at length in Hacker News commentary, although critics note that the comparison excludes the vendor-supplied compilers tinygrad itself emits source code for.[^9]
The MLPerf reference implementations maintained inside tinygrad cover image classification, object detection, language-model training, and text-to-image generation; their presence in the repository is one way the tiny corp demonstrates that the framework can execute industry-standard workloads end-to-end on its hardware. Documentation summaries describe these reference implementations as live tests for the scheduler and lowering pipeline rather than only as benchmark submissions, so changes to the IR or to a backend are exercised against the same workloads used to assess external claims.[^19] Tom's Hardware coverage of the production launch reported that the tiny corp had taken 583 pre-orders ahead of the first 100-unit production run, and that the marginal cost of building a tinybox red was around $10,000 against its $15,000 retail price.[^4]
tinygrad's most prominent production user is comma.ai's openpilot, an open-source advanced driver-assistance system for more than 300 supported car models. As of the openpilot 0.9.8 and 0.9.9 release cycles, the driving model and the driver-monitoring model run end-to-end on tinygrad using its QCOM Adreno backend on the comma 3X device.[^8] comma.ai's release notes describe the migration from the older thneed runtime and Qualcomm SNPE pipeline to tinygrad as both shrinking lines of code and reducing dependencies, while leaving headroom to ship larger driving models in the future. The same notes mention that tinygrad will support plugging an external GPU into the comma 3X's auxiliary USB-C port so that the device can run models that exceed the on-board Adreno's capacity.[^8]
Beyond comma.ai, the project is widely used as an educational and prototyping framework because of its small size. Developer write-ups commonly recommend tinygrad for learning the internals of automatic differentiation, for running smaller Llama and Stable Diffusion models on consumer hardware, and for experimenting with new accelerators since each backend can be written in well under a thousand lines of code.[^21][^1] The project's own examples directory ships reference implementations of LLaMA, Stable Diffusion, Whisper, and YOLO families of models, which contributors and reviewers use as starting points when bringing up new hardware.[^1]
tinygrad's minimalism comes with trade-offs that contributors and outside reviewers regularly discuss:
| Framework | Primary language | Eager vs. lazy | Distinguishing trait |
|---|---|---|---|
| PyTorch | Python + C++ | Eager (with torch.compile for graph mode) | De facto industry standard, huge operator surface |
| JAX | Python + XLA | Traced/lazy via jit | Functional API, XLA backend, TPUs |
| GGML / llama.cpp | C / C++ | Eager | Minimal CPU/Metal inference for LLMs |
| tinygrad | Python | Lazy by default | Visible compiler, sovereign AMD stack, tiny code base |