Song Han
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,557 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,557 words
Add missing citations, update stale details, or suggest a clearer explanation.
Song Han is a computer scientist and an associate professor with tenure in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT), where he leads the MIT HAN Lab [1][2]. He is one of the most influential researchers in efficient deep learning, a field concerned with making neural networks smaller, faster, and cheaper to run on resource-constrained hardware. Han is best known for "Deep Compression," an early technique that shrinks neural networks through pruning and quantization, and for the Efficient Inference Engine (EIE), among the first hardware accelerators to exploit weight sparsity. He co-founded two startups that were acquired by major chipmakers, DeePhi Technology (acquired by Xilinx, now part of AMD) and OmniML (acquired by NVIDIA), and he serves as a distinguished scientist at NVIDIA [1][3].
Han earned his bachelor's degree from Tsinghua University in Beijing, China [1]. He then moved to the United States for graduate study at Stanford University, where he completed his PhD in 2017 under the supervision of Bill Dally, Stanford professor and chief scientist at NVIDIA [1][2]. Dally's group focused on the intersection of computer architecture and machine learning, and Han's doctoral research established the agenda he has pursued ever since: co-designing efficient algorithms and the hardware that runs them.
His dissertation work produced two landmark results. "Deep Compression," presented at the International Conference on Learning Representations (ICLR) in 2016, combined network pruning, trained quantization, and Huffman coding to reduce the storage required by deep neural networks by 35x to 49x without loss of accuracy, and it received a best paper award [4]. The companion hardware project, EIE (Efficient Inference Engine), presented at the International Symposium on Computer Architecture (ISCA) in 2016, was among the first accelerators to take advantage of the weight sparsity that pruning creates [5]. EIE became one of the most cited papers in the 50-year history of ISCA, and its ideas influenced later commercial hardware, including the Sparse Tensor Core in NVIDIA's Ampere GPU architecture [1].
Han's research program spans algorithms, systems, and hardware for efficient AI. His group has contributed widely used methods across model compression, neural architecture search, on-device "TinyML," and, more recently, the acceleration of large language models and generative models.
A recurring theme is hardware-aware neural architecture search. ProxylessNAS and the Once-for-All (OFA) network let a single trained "supernetwork" be specialized into many efficient subnetworks tailored to different hardware targets without retraining each one. This line of work fed directly into TinyML, the effort to run deep learning on microcontrollers. The MCUNet system demonstrated neural network inference on devices with on the order of 1000x less memory than a typical mobile phone, bringing learning to low-power Internet of Things sensors at the edge.
As the field shifted toward transformers and generative AI, Han's lab produced a series of widely adopted methods for LLM efficiency. SmoothQuant introduced a training-free approach to 8-bit post-training quantization of weights and activations by migrating quantization difficulty from activations to weights [7]. AWQ (Activation-aware Weight Quantization) extended low-bit quantization to 4 bits for on-device LLM deployment by protecting the small fraction of weights that matter most; it won a best paper award at MLSys 2024 [6]. StreamingLLM showed that a phenomenon the authors called "attention sinks" lets a model generate text of effectively unbounded length using a fixed-size KV cache, enabling streaming deployment without ever-growing memory [8]. The serving system QServe pushed LLM inference to W4A8KV4 precision, and LongLoRA made it cheaper to extend the context length of pretrained models.
Several of these techniques have been adopted in industry software, including NVIDIA's TensorRT-LLM and Sparse Tensor Core, Intel's OpenVINO and Neural Compressor, AMD-Xilinx Vitis AI, Qualcomm's AIMET, and Apple's Neural Engine tooling [1]. More recent work applies the same efficiency philosophy to generative vision: SANA uses a deep compression autoencoder and a linear diffusion transformer for high-resolution image synthesis, while HART and SANA-Video target fast autoregressive image and video generation on commodity hardware.
It is worth noting that Han's pruning research predated and helped motivate the broader sparsity literature, including the later "lottery ticket hypothesis" of Jonathan Frankle and Michael Carbin; that specific result is not Han's work, though it built on the same pruning tradition he helped establish.
The table below summarizes several of his most influential projects.
| Work | Year | Venue | Contribution |
|---|---|---|---|
| Deep Compression | 2016 | ICLR (best paper) | Pruning + quantization + Huffman coding, 35x to 49x smaller models |
| EIE | 2016 | ISCA | Early accelerator exploiting weight sparsity |
| ProxylessNAS / Once-for-All | 2019 to 2020 | ICLR | Hardware-aware neural architecture search |
| MCUNet | 2020 | NeurIPS | Deep learning inference on microcontrollers (TinyML) |
| SmoothQuant | 2023 | ICML | Training-free 8-bit LLM quantization |
| StreamingLLM | 2024 | ICLR | Infinite-length generation via attention sinks |
| AWQ | 2024 | MLSys (best paper) | 4-bit activation-aware weight quantization |
Han joined the MIT EECS faculty in 2018 and directs the MIT HAN Lab, whose name doubles as an acronym for "Hardware, AI, and Neural-nets" [1]. The group's research is organized around three pillars: efficient generative AI (quantization, parallelization, KV-cache optimization, and long-context modeling), model compression and TinyML, and accelerating AI with sparsity [1]. The lab maintains numerous open-source projects with large followings, including StreamingLLM, AWQ, LongLoRA, EfficientViT, Once-for-All, and ProxylessNAS [1].
Han is also active as an educator. He created and teaches the MIT course 6.5940, "TinyML and Efficient Deep Learning Computing," and released the material publicly through the EfficientML.ai lecture series, which has become a widely used reference for students and practitioners entering the field [1]. He was promoted to associate professor with tenure, and the EECS profile lists his areas as artificial intelligence and machine learning, computer architecture, and integrated circuits and systems [2].
Han's research has translated directly into commercial ventures. In 2016 he co-founded DeePhi Technology, a Beijing-based startup that built FPGA-based inference accelerators around deep compression, pruning, and system-level optimization for neural networks [9]. Xilinx, which had been an investor and technology partner, acquired DeePhi in July 2018 for a reported figure of roughly 252 million US dollars [9]. Xilinx itself was later acquired by AMD in 2022, so DeePhi's technology is now part of AMD.
In 2021 Han co-founded OmniML alongside Di Wu, who served as chief executive, and Huizi Mao, who served as chief technology officer [10]. The company built a platform, branded Omnimizer, for compressing and optimizing machine-learning models so they could run efficiently on edge devices such as those used in autonomous vehicles, drones, and industrial robots [10]. NVIDIA acquired OmniML in 2023; the deal was not formally announced by NVIDIA but was widely reported, and it strengthened NVIDIA's edge-AI and model-optimization capabilities [10]. Following the acquisition, Han took on the role of distinguished scientist at NVIDIA, a position he holds alongside his MIT professorship [1][3]. The arrangement keeps him close to Jensen Huang's company, which also employs his former doctoral advisor, Bill Dally, as chief scientist.
Han has received broad recognition for his contributions to efficient AI. MIT Technology Review named him to its "35 Innovators Under 35" list in 2019, citing the Deep Compression technique for letting powerful AI run on low-power mobile devices [11]. In 2020 he received the National Science Foundation CAREER Award for efficient algorithms and hardware for accelerated machine learning, and he was named to the IEEE "AI's 10 to Watch" list [1]. He was awarded a Sloan Research Fellowship in 2023 [1].
His papers have repeatedly earned best paper awards, including at ICLR 2016 (Deep Compression), FPGA 2017, and MLSys 2024 (AWQ) [1][4][6]. He has also received a Samsung Global Research Outreach Award (2021), a Red Dot design award (2022), and multiple faculty awards from companies including NVIDIA, Sony, Meta (Facebook), and Amazon [1]. His group's neural architecture search and TinyML work has won several low-power computer vision contests at flagship AI conferences and has been covered by outlets such as MIT News, Wired, VentureBeat, and IEEE Spectrum [1].
As of 2026, Han remains an associate professor at MIT, continues to direct the MIT HAN Lab, and serves as a distinguished scientist at NVIDIA, with ongoing research on efficient generative AI and recent publications at venues including ICLR [1][2].