ChipNeMo
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,745 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,745 words
Add missing citations, update stale details, or suggest a clearer explanation.
ChipNeMo is a research project and a family of domain-adapted large language models developed by Nvidia to assist with industrial semiconductor and chip-design tasks. Rather than deploying general-purpose models off the shelf, ChipNeMo continues the training of Meta's Llama 2 models on a corpus of proprietary hardware-design data and documentation, then aligns and augments them for three concrete chip-design applications: an engineering assistant chatbot, electronic design automation (EDA) tool script generation, and bug summarization and analysis. The work was introduced in the paper "ChipNeMo: Domain-Adapted LLMs for Chip Design," first posted to arXiv on 31 October 2023 (arXiv:2311.00176) and revised through April 2024. [1][2]
The project is notable as an early, detailed industrial case study showing that domain adaptation can let a comparatively small model match or exceed much larger general-purpose models on specialized tasks, an argument with broad relevance to the economics of deploying LLMs in narrow professional domains. The paper has 42 NVIDIA authors, with equal-contribution lead authors including Mingjie Liu, Teodor-Dumitru Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, and Rongjian Liang; the effort was led by Haoxing (Mark) Ren, NVIDIA's director of design-automation research. [1][3]
Electronic design automation tools and algorithms have driven decades of gains in chip-design productivity, enabling system-on-chip designs with billions of transistors. Despite this, many time-consuming tasks that involve natural language or programming languages, such as writing tool scripts, answering engineering questions, generating reports, and triaging bugs, had not been automated. NVIDIA's researchers argued that modern LLMs offered an opportunity to address these language-related tasks. [1]
Two considerations motivated a domain-adapted approach rather than direct use of commercial models. First, prior domain-specific efforts such as BloombergGPT in finance and BioMedLLM in biomedicine had shown that specialized models can outperform general models on in-domain tasks. Second, sending proprietary chip-design data to third-party LLM APIs poses security and confidentiality risks, while training a domain-specific model entirely from scratch is prohibitively expensive, often requiring millions of GPU hours. ChipNeMo's premise was that continued pretraining of an existing foundation model could capture most of the benefit at a small fraction of that cost. [1]
The name reflects the project's two roots: the chip-design domain and NVIDIA's NeMo framework, which was used for all model training. [1]
ChipNeMo combines four domain-adaptation techniques applied on top of a pretrained base model, as illustrated in the paper's training flow: domain-adaptive tokenization, domain-adaptive pretraining, model alignment with domain-specific instructions, and retrieval-augmented generation with a fine-tuned retrieval model. The headline insight reported is that domain-adaptive pretraining (DAPT) was the primary technique driving improved performance on domain tasks. [1]
DAPT was applied to the Llama 2 family at three sizes: 7B, 13B, and 70B parameters. Each ChipNeMo foundation model is initialized from the weights of the corresponding pretrained Llama 2 model and then continued-pretrained on chip-design data, producing ChipNeMo-7B, ChipNeMo-13B, and ChipNeMo-70B. [1]
LLM tokenizers convert text into token sequences. Retraining a tokenizer from scratch would invalidate the pretrained foundation model, so ChipNeMo instead augments the existing Llama 2 tokenizer. The team trains a new tokenizer on domain data, identifies tokens that are absent from the general-purpose tokenizer and rare in general text (for example keywords common in register-transfer-level code), and adds only those tokens, initializing their embeddings using the general-purpose tokenizer as a guide. This augmentation reduced the domain-data token count by up to 3.3 percent without hurting downstream effectiveness. [1]
The domain pretraining corpus was assembled from proprietary hardware-related code (such as software, RTL, and verification testbenches) and natural-language data (such as hardware specifications and documentation), supplemented with a sample of publicly available natural-language and code data drawn from sources correlated with Llama 2's pretraining mix in order to preserve general capabilities. After cleaning, deduplication, and a data blend that upsampled design documentation and human-written EDA scripts, the training corpus totaled roughly 24 billion tokens (about 23.1 billion in the domain set). Training ran for one epoch using the standard autoregressive objective in the NeMo framework with tensor parallelism and flash attention. Critically, DAPT required only about 1.5 percent of the compute used for the original pretraining, making it far cheaper than training a domain model from scratch. [1]
Because foundation models are completion models with limited chat ability, ChipNeMo applies an alignment step to produce chat models. The team used a large body of publicly available general-purpose chat instruction data together with a small, expert-crafted domain-specific instruction set of 1,430 examples spanning design knowledge (302), EDA script generation (480), and bug summarization (648). NVIDIA found that alignment on general chat data was largely adequate to align the model to chip-design queries, and that adding the small domain instruction set improved results further. The paper reports both traditional supervised fine-tuning (SFT) and NVIDIA's SteerLM alignment method; SteerLM improved the chatbot's human-evaluation score by 0.62 points on a 7-point scale over SFT, while SFT on the additional 1.4K domain instructions improved EDA-script-generation correctness by 18 percent. [1]
For the engineering-assistant chatbot, ChipNeMo adds retrieval-augmented generation (RAG), an open-book approach that retrieves relevant in-domain passages from a data store to ground the model's response. A key finding was that fine-tuning a pretrained retrieval model on domain data improved the retriever's hit rate by about 30 percent over a state-of-the-art pretrained retriever and performed roughly twice as well as an unsupervised e5-small baseline, which in turn improved overall RAG response quality. The chatbot evaluation used a benchmark of 88 questions across specification, testbench, and build-infrastructure categories, backed by a store of about 1,800 documents segmented into roughly 67,000 passages. [1]
The full training pipeline is summarized below.
| Stage | Input | Output | Notes |
|---|---|---|---|
| Foundation model | Trillions of tokens of internet data | Llama 2 (7B, 13B, 70B) | Meta base models |
| Domain-adaptive pretraining (DAPT) | ~24B tokens of chip-design docs and code | ChipNeMo foundation models | ~1.5% of pretraining compute |
| Model alignment (SFT / SteerLM) | General chat instructions + 1,430 domain instructions | ChipNeMo chat models | Augmented tokenizer applied during DAPT |
| Inference augmentation | Fine-tuned domain retrieval model | RAG-grounded responses | Used for the chatbot |
The paper evaluates ChipNeMo on three applications selected for NVIDIA's GPU ASIC and architecture design teams. [1]
| Application | Task | Notable detail |
|---|---|---|
| Engineering assistant chatbot | Answer questions about internal hardware designs and explain complex design topics | Uses RAG over internal documents; reached a score of 6.0 on a 7-point scale in expert evaluation |
| EDA script generation | Generate scripts for VLSI timing-analysis tools from English specifications | Targets two domain-specific tools, one Python-based and one Tcl-based; over 70% correctness on simple scripts |
| Bug summarization and analysis | Summarize lengthy bug reports and recommend task assignment | Three sub-tasks: technical summary, managerial summary, and assignment recommendation; expert ratings above 5 on a 7-point scale |
The engineering assistant chatbot responds to questions about GPU architecture and design and helps engineers locate technical documents quickly. The EDA script generator targets two internal tools used for VLSI timing analysis, producing short scripts (on the order of 10 to 20 lines) in the tools' Python and Tcl interfaces; in NVIDIA's account it was an in-development assistant meant to integrate with existing design software. The bug-summarization tool condenses long comment histories from NVIDIA's internal bug-tracking system using hierarchical summarization and was described by NVIDIA as the most well-received of the three in early use. [1][2]
ChipNeMo's central result is that domain adaptation lets smaller models compete with much larger ones on chip-design tasks. Domain-adapted ChipNeMo models substantially outperformed the corresponding vanilla Llama 2 models on both multiple-choice domain benchmarks (the paper's "AutoEval" suite) and human evaluations, without degrading generic capabilities. NVIDIA reported that custom ChipNeMo models with as few as 13 billion parameters match or exceed the performance of general-purpose models as large as Llama 2 70B on chip-design tasks. [1][2]
At the top of the size range, ChipNeMo-70B (using SteerLM alignment) outperformed the much larger and more capable GPT-4 on two of the three use cases, the engineering assistant chatbot and EDA script generation, while remaining competitive on bug summarization, where GPT-4 still led. On the chatbot benchmark, ChipNeMo-70B-Steer beat the similarly sized Llama 2 70B-Chat by 3.31 points (model-only) and 1.81 points (with RAG) on the 7-point scale, and RAG improved all evaluated models. The best ChipNeMo model also surpassed GPT-3.5 across the design and bug benchmarks. NVIDIA summarized the practical payoff as enabling up to a 5x reduction in model size with similar or better performance on a range of design tasks, achieved with only about 1.5 percent additional pretraining compute. [1][3]
ChipNeMo is frequently cited as a concrete demonstration that organizations with valuable proprietary data can build effective, cost-efficient domain-specific LLMs by continued pretraining rather than training from scratch or relying solely on prompting of large general models. Because a smaller domain-adapted model can match a model several times its size, the approach reduces both the memory footprint and the inference cost of deploying such assistants at scale, an attractive property for embedding LLMs into internal engineering workflows. NVIDIA chief scientist Bill Dally framed the work as an important first step in applying LLMs to the complex task of designing semiconductors, while noting that highly specialized fields can use internal data to train useful generative models. [2]
Beyond the chip-design setting, the recipe of domain-adaptive tokenization, continued pretraining, instruction alignment, and a fine-tuned retriever has informed NVIDIA's broader work on customizing foundation models, and the techniques were later packaged into tutorials and tooling around the NeMo framework and related model families such as Llama Nemotron and Minitron. The project sits alongside contemporaneous domain-specific efforts in finance and biomedicine as evidence that targeted adaptation, rather than ever-larger general models, can be the more practical path for specialized professional applications. [1][2]