Hermes 4
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 2,162 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 2,162 words
Add missing citations, update stale details, or suggest a clearer explanation.
Hermes 4 is a family of open-weight large language models released by Nous Research in late August 2025. It is the fourth generation of the company's Hermes line of fine-tuned, instruction-following models, and its defining feature is hybrid reasoning: each model can answer directly in a fast "non-reasoning" mode or first deliberate inside explicit <think>...</think> tags in a "reasoning" mode, with the behavior selectable at inference time. The family was post-trained on top of Meta's Llama 3.1 base models for the 70-billion-parameter and 405-billion-parameter sizes, and on Alibaba's Qwen3-14B base for the 14-billion-parameter size [1]. Nous markets Hermes 4 as a neutrally aligned, highly steerable assistant with low refusal rates, positioning it as an open alternative to the more heavily guardrailed models offered by large commercial labs [1][2][3].
Hermes 4 was announced alongside a technical report, titled the "Hermes 4 Technical Report" (arXiv:2508.18255), authored by Ryan Teknium (who goes by the handle Teknium), Roger Jin, Jai Suphavadeeprasit, Dakota Mahan, Jeffrey Quesnelle, Joe Li, Chen Guang, Shannon Sands, and Karan Malhotra [1]. The models, full weights, and supporting documentation were published on Hugging Face, and a chat interface was made available at chat.nousresearch.com, with a dedicated landing page at hermes4.nousresearch.com [2][3].
The family comprises three sizes: 14B, 70B, and 405B parameters. All three are reasoning models in the sense that they can produce an internal chain of thought before answering, but unlike "always-on" reasoning systems, the reasoning step is optional and toggled by the user, which is why Nous describes them as "hybrid" reasoners [1]. The stated design goals are advanced reasoning in mathematics, code, and STEM, combined with broad general-purpose instruction following, reliable structured output such as JSON and schema adherence, tool use and function calling, strong creative writing and roleplay, and a marked reduction in unnecessary refusals [1][3].
Nous Research began in 2023 as an open-source AI collective formed by researchers collaborating online, and later incorporated as a company. Its co-founders include Jeffrey Quesnelle, Karan Malhotra, and Teknium. The group built its reputation by producing high-quality, fine-tuned variants of openly available base models, starting with Meta's Llama 2 [4][5]. In April 2025, Nous raised a 50 million US dollar Series A round led by the venture firm Paradigm, reported at a 1 billion US dollar token valuation, to fund its broader push into decentralized AI training, including its Psyche network and the DisTrO and DeMo distributed-training research [6][7].
The Hermes name has run across several generations. Early Nous-Hermes fine-tunes were built on Llama 2 at 7B, 13B, and 70B scales. Hermes 2 expanded the family across Llama 2 and Mistral base models, and Hermes 2 Pro, released in 2024, added a dedicated in-house function-calling and JSON-mode dataset that made structured output and tool use a core strength of the line [4]. Hermes 3 followed on August 16, 2024, as a set of full-parameter fine-tunes of Llama 3.1 at 8B, 70B, and 405B; the 405B version was described as the first publicly available full-parameter fine-tune of Llama 3.1 405B, and it was accompanied by its own technical report (arXiv:2408.11857) [5][8]. Hermes 4 continues this lineage and explicitly retains a significant portion of the Hermes 3 dataset to preserve earlier capabilities, while adding hybrid reasoning as the headline new feature [1][5].
Hermes 4 is not a from-scratch pretraining effort but a post-training (fine-tuning) program applied to existing open base models. According to the technical report, the team trained the 405B and 70B models starting from the corresponding Llama 3.1 checkpoints, using a modified version of the TorchTitan training framework. The 14B model was trained from the Qwen3-14B checkpoint instead, which is why the family spans two different base-model lineages and two different tokenizers [1]. This is an important clarification, because Hermes 4 is sometimes described as entirely Llama 3.1 based; that is accurate for the two larger sizes but not for the 14B.
The table below summarizes the family as described in the technical report and model cards. Benchmark figures shown elsewhere in this article use the 405B model unless otherwise stated.
| Attribute | Detail |
|---|---|
| Developer | Nous Research [1] |
| Release | Late August 2025 (technical report arXiv:2508.18255) [1] |
| Sizes | 14B, 70B, 405B parameters [1] |
| Base models | Llama 3.1 (70B, 405B); Qwen3-14B (14B) [1] |
| Type | Hybrid reasoning, open-weight instruction-tuned LLM [1] |
| Reasoning control | Optional <think>...</think> traces, toggled via flag or system prompt [1][3] |
| Post-training data | ~5 million samples / ~19 billion tokens (3.5M reasoning + 1.6M non-reasoning) [1] |
| Max thinking trace (training) | Up to ~16,000 tokens [1] |
| Evaluation context length | 40,960 tokens for reasoning and code benchmarks [1] |
| License | Llama 3.1 community license (70B, 405B) [3] |
| Availability | Hugging Face weights; chat at chat.nousresearch.com [2][3] |
Because the larger models inherit the Llama 3.1 architecture, they are released under Meta's Llama 3.1 community license rather than a fully permissive open-source license [3]. Quantized releases, including FP8 versions, were also published to make the larger models more deployable [2].
The central technical idea in Hermes 4 is teaching a single model to operate in two modes. In non-reasoning mode it answers immediately, like a conventional instruction-tuned assistant. In reasoning mode it first emits a deliberation enclosed in <think> and </think> tags, then produces its final answer. On the larger models this is activated through a reasoning system prompt or a thinking flag, with the documented system prompt instructing the model that it is "a deep thinking AI" that "may use extremely long chains of thought to deeply consider the problem" before answering [1][3].
To create the model, Nous assembled a large, primarily synthetic post-training corpus. The report states it totals approximately 5 million samples and 19 billion tokens, split into about 3.5 million reasoning samples and 1.6 million non-reasoning samples. The reasoning samples were deliberately token-heavy, averaging roughly five times more tokens per sample than the non-reasoning data, and accommodating thinking traces up to about 16,000 tokens long [1]. (Some early secondary coverage and an abbreviated model-card summary cited a figure of around 60 billion tokens, but the technical report's stated total is approximately 19 billion tokens [1].)
A key part of the pipeline is a graph-based synthetic data generator called DataForge. Inspired by Microsoft's AgentInstruct approach, DataForge generates conversational training data by taking a piece of pretraining seed content and performing a random walk through a directed acyclic graph (DAG). Each node implements a structure-to-structure transformation defined with a Planning Domain Definition Language (PDDL) style interface of preconditions and postconditions, which lets the system compose many small, well-defined transformations into diverse tasks [1]. The training methodology also incorporates loss masking, efficient sequence packing for the heterogeneous mixture of reasoning and non-reasoning data, and rejection sampling to select high-quality completions [1][3].
A notable engineering contribution addresses "overlong" reasoning, the tendency of a thinking model to keep generating and never terminate its chain of thought within the context budget. Nous found this especially pronounced on the 14B model trained from Qwen3, whose reasoning traces frequently exceeded 40,960 tokens. Their solution is a second, targeted supervised fine-tuning stage that teaches the model to close its reasoning by inserting a </think> tag at a budget of 30,000 tokens, with only the </think> and end-of-sequence tokens left unmasked so that the intervention teaches the termination criterion without disturbing the model's reasoning distribution. On the 14B model this length-control tuning reduced the fraction of non-terminating ("overlong") outputs by at least 98.9 percent across AIME 2024, AIME 2025, GPQA Diamond, and LiveCodeBench, at a cost of at most a few percent relative accuracy, and in some cases (such as LiveCodeBench) accuracy actually improved. The team reported that they did not consider this extra length-control stage necessary for the 70B or 405B models [1].
In reasoning mode, the 405B model posts strong scores on standard reasoning, math, and coding suites. All figures below are as reported by Nous in the Hermes 4 Technical Report and should be read as the developer's own evaluations [1]:
| Benchmark | Hermes 4 405B (reasoning) | Hermes 4 405B (non-reasoning) |
|---|---|---|
| MATH-500 | 96.2 | 73.8 |
| AIME 2024 | 81.9 | 11.4 |
| AIME 2025 | 78.1 | 10.6 |
| GPQA Diamond | 70.6 | 39.4 |
| LiveCodeBench (v6, Aug 2024+) | 61.4 | 28.1 |
| RefusalBench | 57.1 | 43.2 |
These numbers illustrate how much the optional reasoning mode contributes on hard math and code tasks: on the AIME competition-math benchmarks, the reasoning mode lifts accuracy from roughly 11 percent to about 80 percent [1]. The report compares Hermes 4 405B against similarly sized open-weight models such as DeepSeek R1, DeepSeek V3, and Qwen3 235B, where it is broadly competitive on math, reasoning, and knowledge while not leading every category [1].
The benchmark most central to Nous's pitch is RefusalBench, an internal evaluation the company built to measure how often a model refuses reasonable requests. It was constructed by identifying 32 categories of requests that commonly trigger refusals from frontier models and hand-crafting 166 prompts spanning those categories, with refusals judged by Claude Sonnet 4 acting as an automated grader. For three sensitive categories (specific harm to minors, exploitation and human trafficking, and suicide or self-harm), Nous deliberately inverted the scoring so that refusing is rewarded, meaning a high overall score reflects willingness to help on benign requests rather than blanket permissiveness [1]. On this benchmark, Hermes 4 405B in reasoning mode scored 57.1, which Nous reports as well above the scores it measured for other systems, including Grok 4 at 51.3, DeepSeek V3 at 28.1, Gemini 2.5 Pro at 24.23, Llama 3.1 405B at 21.7, GPT-4o at 17.67, Claude Sonnet 4 at about 17, and GPT-5 at 11.34 [1]. As an internally designed benchmark, RefusalBench has no independent baseline, so these comparisons reflect Nous's own methodology and grading.
This benchmark embodies the project's stated philosophy. Nous frames Hermes 4 as "neutrally aligned" and steerable: the model is meant to adopt the user's framing, follow system prompts faithfully, and avoid the reflexive disclaimers and policy hedging the report calls "policy rigidity" in some commercial models [1][3]. In qualitative probes covering role-play, persona adoption, and political analysis, the report observes that Hermes 4 tended to stay in character and engage with fictional or controlled prompts rather than breaking frame to assert its AI identity, in contrast to several proprietary systems tested [1]. Press coverage emphasized this angle, describing Hermes 4 as a deliberately unrestricted, low-refusal model and a bet that an open, user-controlled assistant can be both safe enough and more useful than tightly guardrailed alternatives [2][9].
Hermes 4 was released with open weights on Hugging Face under the Hermes 4 collection, alongside the chat interface at chat.nousresearch.com and the technical report on arXiv [1][2][3]. The 405B model was also made available through third-party inference providers and API aggregators such as OpenRouter, which broadened access for users who could not host a 405-billion-parameter model themselves [10]. FP8 and other quantized variants were published to ease deployment of the larger sizes [2].
Reception in the AI press focused on two themes: the addition of competitive, toggleable reasoning to an open model, and the model's openly low-refusal, neutral-alignment posture. Outlets including VentureBeat, MarkTechPost, and various AI-news sites covered the launch, frequently highlighting the RefusalBench results and framing Hermes 4 as an open challenger that can match or undercut commercial chatbots on willingness to engage while remaining fully open-weight and self-hostable [2][9][3]. Commentators also noted the transparency of the release, since Nous published not only the weights but a detailed account of its DataForge data pipeline, its training recipe, and its length-control method, in keeping with the company's reproducibility-focused mission [1][2]. Nous continued to iterate on the line after launch, releasing further point updates under the Hermes 4 banner in subsequent months [11].