Etched Sohu
Last reviewed
May 17, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 ยท 3,502 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 ยท 3,502 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Etched Sohu | |
|---|---|
| General information | |
| Manufacturer | Etched |
| Country of origin | United States |
| Announced | June 25, 2024 |
| Status | Pre-production (not yet shipping as of May 2026) |
| Architecture | Transformer-only ASIC |
| Process node | TSMC 4nm (N4 family) |
| Memory | 144 GB HBM3E |
| Memory bandwidth | Approximately 4,800 GB/s (claimed) |
| Server configuration | 8x Sohu chips per node |
| Claimed throughput | 500,000+ tokens/sec on Llama-3 70B (8-chip server) |
| Performance claim vs H100 | ~20x faster on transformer inference (Etched estimate) |
| Software stack | Open-source compiler, drivers, kernels, serving stack |
| Website | etched.com |
Sohu is a transformer-specialized application-specific integrated circuit (ASIC) developed by Etched, a San Jose based AI hardware startup founded in 2022. Announced on June 25, 2024, Sohu is the first commercially marketed chip designed to run only one neural network architecture: the transformer. By hardcoding the transformer architecture into silicon rather than offering general-purpose programmable compute, Etched claims Sohu can deliver roughly 20 times the inference throughput of an NVIDIA H100 GPU on large language models such as Llama-3 70B, while using significantly less energy.
The chip is fabricated on TSMC 4 nanometer process technology and pairs the compute die with 144 GB of HBM3E memory. According to Etched's published claims, one server fitted with eight Sohu chips can sustain more than 500,000 Llama 70B tokens per second, compared with roughly 23,000 tokens per second for an eight-GPU H100 server and approximately 45,000 tokens per second for an eight-GPU Blackwell B200 server. Etched argues that this performance comes from achieving over 90 percent floating-point unit utilization, against approximately 30 percent for general-purpose GPUs running attention-heavy workloads.
Despite the headline numbers, as of May 2026 Sohu has not shipped in volume to external customers, and no independent third-party benchmarks have been published. The company has, however, attracted significant attention and capital. Etched raised a $120 million Series A in June 2024 led by Primary Venture Partners and Positive Sum Ventures with participation from Peter Thiel, Github CEO Thomas Dohmke, and former Coinbase CTO Balaji Srinivasan. In January 2026 it closed an approximately $500 million growth round led by Stripes with participation from Peter Thiel, Positive Sum, and Ribbit Capital, valuing the company at roughly $5 billion and bringing total funding close to $1 billion.
Etched was founded in 2022 by Gavin Uberti, Chris Zhu, and Robert Wachen, three undergraduates who left Harvard University to build a dedicated transformer accelerator. Uberti, the company's chief executive officer, had previously worked at OctoML and Xnor.ai on inference optimization. Chris Zhu holds degrees in mathematics and computer science from Harvard. Robert Wachen previously co-founded Prod, a startup accelerator. The founders are alumni of the Thiel Fellowship and built the first prototypes of Sohu from a dorm room before raising institutional capital.
The core thesis behind Sohu is that the transformer architecture, introduced by Google researchers in the 2017 paper "Attention Is All You Need," has become so dominant for large language models, diffusion models, and increasingly for vision and video generation, that there is enough scale to justify a chip that runs nothing else. Etched's argument is that general-purpose accelerators waste most of their transistor budget on flexibility the market no longer needs, since modern frontier AI workloads are almost entirely transformer-based.
Etched is headquartered in San Jose, California, with a satellite presence in Cupertino. The company partners with TSMC's Emerging Businesses Group for fabrication and with Rambus on HBM controller IP. Engineering hires have come from Cypress Semiconductor, Broadcom, Apple, and NVIDIA. Sohu was unveiled on June 25, 2024, alongside Etched's $120 million Series A, with the marketing line "Meet Sohu, the fastest AI chip of all time." The announcement drew coverage from TechCrunch, CNBC, Tom's Hardware, and The Wall Street Journal.
A conventional GPU such as the H100 or B200 is a programmable parallel processor with thousands of generic compute units (CUDA cores, tensor cores) plus a fixed-function memory hierarchy. The compiler maps any neural network onto those units at runtime. By contrast, Sohu's compute fabric is laid out as a pipeline of dedicated transformer blocks. Each block contains hardware tailored to one stage of the transformer forward pass:
Because these stages are fixed in hardware, Etched does not need to emit instructions to schedule them or pay for the area of programmable issue logic. The control path is dramatically simpler than that of a GPU, and the same transistor budget can host far more useful arithmetic units. Etched claims that this is what allows Sohu to reach more than 90 percent floating-point unit utilization on transformer workloads, against roughly 30 percent for GPUs where unit utilization is gated by memory bandwidth, kernel launch overheads, and attention sparsity.
Sohu's compute path is optimized for FP8. The headline 500,000 tokens per second figure is reported for Llama-3 70B running in FP8 with 2,048 input tokens and 128 output tokens. The chip also supports INT8 for quantized models. Etched has not published a peak TFLOPS number, instead reporting tokens-per-second on specific reference models, which has drawn criticism from analysts who prefer architecture-neutral metrics.
Sohu is described as a reticle-limit die, roughly 800 square millimeters, fabricated on a 4 nanometer process node from TSMC. The die is connected via interposer to six HBM3E stacks totaling 144 GB of capacity. Memory bandwidth is reported as approximately 4,800 GB/s, comparable to the H100's 3,350 GB/s.
| Specification | Value |
|---|---|
| Process node | TSMC 4nm (N4 family) |
| Die size | Reticle limit (~800 mm-squared) |
| On-package memory | 144 GB HBM3E |
| Memory bandwidth | ~4,800 GB/s |
| Primary numeric format | FP8 (also INT8) |
| Claimed FLOPS utilization | >90% on transformer inference |
| Server configuration | 8x Sohu per node |
| Headline benchmark | 500,000+ tokens/sec on Llama-3 70B (8x Sohu) |
The 144 GB memory budget is enough to fit a 70 billion parameter model at FP8 weights (about 70 GB) with substantial room for the KV cache and activations, which is essential for serving long context windows. An 8-chip server can hold a 400-billion to 600-billion parameter model with tensor parallelism, making Sohu well suited to the mixture of experts variants now dominant in frontier deployments.
Because Sohu only runs transformers, Etched's software is much narrower than that of a general accelerator. There is no equivalent of CUDA and no general kernel programming model. Instead, the stack consists of:
Etched has emphasized that the stack is open source, which the company hopes will accelerate adoption by allowing frontier labs to audit and modify the compiler. The stack natively handles modern transformer variants, including grouped-query attention, multi-query attention, sliding window attention, rotary position embeddings, parallel attention and feed-forward layers, and mixture-of-experts routing. Etched has stated that future architectures that remain within the transformer family will be supported via firmware and compiler updates, but that a successor chip will be required for any architecturally distinct successor to the transformer, such as a pure state space model or recurrent style network.
Etched's headline performance figure is 500,000 tokens per second for Llama-3 70B running on a single 8x Sohu server in FP8 precision, with 2,048 input tokens and 128 output tokens per request. The same benchmark on equivalent eight-GPU servers yields, by Etched's measurements, roughly:
| Platform | Llama-3 70B (FP8) tokens/sec | Ratio to Sohu |
|---|---|---|
| 8x Sohu (server) | ~500,000 | 1.0x |
| 8x NVIDIA B200 (Blackwell) | ~45,000 | ~0.09x |
| 8x NVIDIA H100 (Hopper) | ~23,000 | ~0.046x |
| 8x NVIDIA A100 | ~9,000 | ~0.018x |
Etched concludes that a single 8x Sohu server is equivalent to about 160 H100 GPUs on transformer inference. Critics have raised several concerns: the H100 numbers are for unoptimized stock paths while Sohu uses its own optimized stack; the benchmark uses input and output lengths favorable to high prefill throughput; no third party has measured Sohu in physical form; and Etched has not released power numbers. Independent commentators have noted that even if Sohu delivers half the claimed throughput, it would still represent a meaningful architectural advance, since the 90 percent FLOPS utilization figure is consistent with what a fixed transformer pipeline could in principle achieve.
Sohu enters a crowded field of AI accelerator startups, each of which has made a different bet about the right architectural specialization point.
| Chip / system | Vendor | Strategy | Specialization |
|---|---|---|---|
| H100, B200, GB200 | NVIDIA | General-purpose GPU plus tensor cores | Wide: training and inference, all model classes |
| TPU v5p / v6 | Systolic-array matrix engine | Training and inference, internal use plus cloud | |
| Trainium 2 / Inferentia 2 | AWS | Custom AI ASIC | Training and inference inside AWS |
| MTIA | Meta | Custom inference ASIC | Internal recommendation and language model inference |
| Groq LPU | Groq | Deterministic streaming compute with on-chip SRAM | Ultra-low-latency LLM inference |
| Cerebras WSE-3 | Cerebras | Wafer-scale, in-memory compute | Training and inference for very large models |
| SambaNova SN40L | SambaNova | Reconfigurable dataflow architecture | Training and inference for foundation models |
| Taalas hardcoded models | Taalas | Each chip is a single trained model in silicon | One model per chip, extreme specialization |
| Sohu | Etched | Hardcoded transformer architecture (not model) | Transformer inference only |
Several distinctions are worth highlighting:
Etched has raised approximately $620 million across two publicly disclosed rounds.
| Round | Date | Amount | Lead investor | Notable participants |
|---|---|---|---|---|
| Series A | June 25, 2024 | $120 million | Primary Venture Partners, Positive Sum Ventures | Peter Thiel, Thomas Dohmke (Github), Balaji Srinivasan, Amjad Masad (Replit), Kyle Vogt (Cruise), Charlie Cheever (Quora) |
| Growth round | January 2026 | ~$500 million | Stripes | Peter Thiel, Positive Sum, Ribbit Capital |
The January 2026 round, reported by The Information and corroborated by Reuters, valued Etched at roughly $5 billion. At the time of the round, the company had been operating for about three and a half years and had not yet shipped Sohu to external customers. Investors cited as their primary rationale Etched's ability to lock in TSMC 4 nanometer capacity, the strength of its founder team, and the strategic value of an alternative to NVIDIA in transformer inference. The round brought Etched's total raised to approximately $620 million. Peter Thiel participated personally in both rounds. Primary Venture Partners has positioned Etched as the centerpiece of its AI hardware portfolio.
On October 31, 2024, Etched and Decart, an Israeli AI startup, jointly released Oasis, an interactive, playable world model that generates a Minecraft-style 3D environment frame by frame from keyboard and mouse inputs. Oasis is a diffusion transformer that performs next-frame prediction, treating each frame of the game as a token to be predicted given a short history of previous frames and the player's most recent inputs. There is no game engine and no procedural world; the world exists only in the model's predictions.
The model was trained on millions of hours of Minecraft footage and runs at 20 FPS at 360p in its public demo. The architecture combines a vision transformer (ViT) based autoencoder with a DiT (Diffusion Transformer) backbone. The public demo ran on NVIDIA H100 GPUs, but Etched described Oasis as a demonstration of the kind of workload Sohu is meant to accelerate, with future versions on Sohu targeting models exceeding 100 billion parameters at 4K resolution. The release also gave Etched a concrete reference workload (high-throughput, real-time, multimodal transformer inference) that few competing inference engines can serve well.
As of May 2026, Etched has not publicly confirmed that Sohu has shipped in volume to external customers. The company has discussed reservation commitments with major AI labs and cloud providers, and has stated that early reference units are in the hands of select partners, but it has not named those partners or published deployment dates. Public commentary, including from Etched's own investors and the Manifold prediction market, suggests that volume shipment slipped from the original "late 2024 to early 2025" target into 2026.
Key gaps in the public record include:
The absence of these data points has fueled both skepticism (Etched is selling a bet on silicon that may not ship) and excitement (whoever is using early Sohu units is presumably under non-disclosure with Etched and is one of the largest hyperscalers or AI labs). Etched has publicly stated that it is prioritizing volume readiness over early demonstrations, and has cited the example of Groq, which scaled deployment slowly to ensure reliability, as a precedent.
| Risk | Description |
|---|---|
| Architectural | Sohu only runs transformer-family models. If the dominant architecture shifts to a state space model, Mamba, or a hybrid successor, Sohu's value collapses. Etched argues the transformer has been dominant for nearly a decade and that even rumored hybrids still spend most of their compute inside transformer layers. |
| Execution | Shipping a reticle-limit chip on TSMC 4nm with HBM3E typically requires hundreds of engineers and $200-300 million in NRE. Etched has fewer than 100 employees as of mid-2026. Slippage has historically killed AI chip startups (Wave Computing, others). The January 2026 round was widely interpreted as runway insurance. |
| Benchmark verification | Until independent third parties measure Sohu, 500,000 tokens per second is a marketing claim. Comparisons against unoptimized GPU baselines rather than current TensorRT-LLM or vLLM configurations are a recurring objection. |
| Software porting | Customers must port inference workloads from CUDA to Etched's compiler and serving stack. Adoption of any non-NVIDIA accelerator has been slow at hyperscale. |
| Supply chain | Sohu depends on TSMC 4nm capacity, HBM3E supply from SK Hynix or Micron, and advanced CoWoS packaging. All three are in tight supply and subject to U.S. export controls. |
Reception of Sohu has been polarized. Bullish coverage from venture capital and AI infrastructure outlets has framed Etched as the most credible attempt to date to break the NVIDIA monopoly on AI inference. Skeptical coverage has focused on three themes: that Etched is selling a chip that has not shipped against benchmarks the company controls; that the transformer architecture, while dominant, is not necessarily permanent; and that the founders' age and lack of prior chip industry track record raises execution risk. Several analysts have argued Etched will likely ship a working chip, but that hitting the headline 20x performance claim in real deployments is unlikely. The AI safety community on LessWrong and similar venues has noted that an order-of-magnitude reduction in inference cost would accelerate deployment of autonomous agents, reasoning systems, and multimodal applications, raising downstream questions about cheap-inference safety.
Etched has publicly stated that Sohu is the first chip in a multi-generation roadmap. A second-generation chip on a more advanced node is reportedly in early design, targeting both inference and prefill-heavy training workloads. The company has also discussed a smaller, lower-power inference chip for edge and on-device transformer inference. Etched executives have suggested that the addressable market for transformer-specific silicon will exceed $100 billion annually by the end of the decade if inference volumes grow as expected.