UltraChat
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,474 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,474 words
Add missing citations, update stale details, or suggest a clearer explanation.
UltraChat is a large-scale synthetic multi-turn instructional conversation dataset released in May 2023 by the OpenBMB group at Tsinghua University, comprising approximately 1.5 million dialogues generated by iteratively prompting two GPT-3.5 Turbo APIs against one another, with one model simulating a human user and the other producing assistant replies.[^1][^2] The dataset is organized into three sectors covering "Questions about the World," "Creation and Generation," and "Assistance on Existing Materials," and it accompanied the paper "Enhancing Chat Language Models by Scaling High-quality Instructional Conversations" by Ning Ding and collaborators (arXiv:2305.14233).[^1] UltraChat became one of the most widely reused instruction corpora in the open-weights ecosystem after Hugging Face's H4 alignment team filtered it down to roughly 200,000 highest-quality conversations and used the resulting UltraChat 200k subset to supervise the first stage of Mistral 7B derived chat model Zephyr-7B-beta.[^3][^4] Alongside the companion preference dataset UltraFeedback, UltraChat is widely credited with anchoring the recipe by which small open chat models began approaching the MT-Bench scores of much larger proprietary systems during 2023 and 2024.[^4][^5]
By the spring of 2023 the open-source community had produced several instruction-tuned LLaMA derivatives, including Stanford Alpaca (52,002 single-turn pairs generated by Self-Instruct against text-davinci-003), Vicuna (fine-tuned on ShareGPT user dialogues), Koala, Baize, BELLE, GPT4All, and Dolly.[^1] Each of these projects demonstrated that competitive chat behavior could be elicited from open weights, but the underlying instruction corpora suffered from three recurring weaknesses that the UltraChat authors set out to address.[^1]
First, datasets such as Alpaca were almost entirely single-turn: the average number of conversational rounds was 1, which left fine-tuned models brittle on follow-up questions or multi-step dialogue. Second, corpora harvested from public ChatGPT logs (ShareGPT) or scraped from real user prompts raised privacy and licensing concerns, and they tended to over-represent the topics that early adopters happened to share. Third, the diversity statistics of these corpora lagged: average lexical diversity scores and topic-spread metrics were modest, and the dialogues frequently fell into stylistic ruts.[^1]
The UltraChat authors framed the problem in terms of "the final one mile" of chat-model quality: going from a 0-to-60 instruction-following baseline to a 60-to-100 conversational model. Their thesis, stated in the paper's introduction, was that the most direct lever for that final mile is the quality and diversity of fine-tuning data, not just additional preference learning or scaling.[^1] To reach a million-scale corpus without humans in the loop, the team needed (a) a principled taxonomy that genuinely spans human/AI interaction space, (b) a generation procedure that creates coherent multi-turn dialogues rather than disconnected single-turn pairs, and (c) prompts strong enough to prevent the simulated "user" GPT from collapsing back into assistant mode.[^1]
The paper "Enhancing Chat Language Models by Scaling High-quality Instructional Conversations" was first posted to arXiv on 23 May 2023 (arXiv:2305.14233) with authors Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou, all affiliated with Tsinghua University.[^1] A revised version was accepted to EMNLP 2023 and presented in Singapore in December 2023, where it appears at pages 3029-3051 of the main conference proceedings.[^2]
The dataset itself was released incrementally on GitHub (thunlp/UltraChat) and on Hugging Face (stingning/ultrachat) starting April 2023, with sector-by-sector releases continuing through mid-2023.[^6][^7] On 27 June 2023 the OpenBMB group released UltraLM-13B, the LLaMA-13B fine-tune trained on the full UltraChat corpus, which briefly topped the AlpacaEval leaderboard among open-source models.[^6] A 65B-parameter UltraLM v1.0 followed on 7 August 2023, and a v2.0 release on 26 September 2023 was accompanied by the launch of the UltraFeedback preference dataset and the UltraRM-13B and UltraCM-13B critique models.[^6][^8]
A heavily filtered HuggingFaceH4 subset, UltraChat 200k, appeared in late October 2023 in conjunction with the Zephyr-7B-beta release, and quickly became the dominant version of the corpus circulated for supervised fine-tuning of open chat models.[^3][^4]
The intellectual backbone of UltraChat is its tripartite taxonomy, which the authors derived from a principle that every interaction between a human user and an AI assistant can be modeled as a form of "obtaining information." From that observation they distinguish three modes of obtaining information, each of which becomes a sector of the corpus.[^1]
The first sector targets information access, that is, queries about entities, concepts, and objects that already exist in the world. The authors built it from two parallel streams of seed prompts.[^1]
The first stream is topic-centric. The team asked ChatGPT to enumerate 30 high-level meta-concepts spanning daily life ("Technology," "Health and wellness," "Travel and adventure," "Food and drink," "Art and culture," "Science and innovation," "Fashion and style," "Relationships and dating," "Sports and fitness," "Nature and the environment," "Music and entertainment," "Politics and current events," "Education and learning," "Money and finance," "Work and career," "Philosophy and ethics," "History and nostalgia," "Social media and communication," "Creativity and inspiration," "Personal growth and development," "Spirituality and faith," "Pop culture and trends," "Beauty and self-care," "Family and parenting," "Entrepreneurship and business," "Literature and writing," "Gaming and technology," "Mindfulness and meditation," "Diversity and inclusion," "Travel and culture exchange").[^1] Each meta-concept was then expanded into 30 to 50 subtopics or related concepts, yielding roughly 1,100 or more leaf topics. For each subtopic, ten distinct seed questions were generated, and then for every seed question ChatGPT produced ten additional related questions, multiplying the prompt pool further.[^1]
The second stream is entity-centric. The authors took the 10,000 most frequently occurring entities from Wikidata, ranked by their occurrence in Wikipedia articles, and generated five meta-questions per entity, expanded by ten specific and twenty extended follow-up questions. Extended questions were instructed to preserve some thematic similarity with the original while exploring distinct objects or angles, a deliberate hedge against entity-by-entity redundancy.[^1] After filtering and sampling, around 500,000 such questions served as opening lines for Sector I dialogues.[^1]
The second sector targets conditional information creation, that is, prompts where the user asks the assistant to produce new text under a user-supplied constraint. The authors enumerated 20 categories of written material (Articles and Blog Posts, Job Application Material, Stories, Legal Documents and Contracts, Poems, Educational Content, Screenplays, Scripts for Language Learning, Technical Documents and Reports, Marketing Materials, Social Media Posts, Personal Essays, Emails, Scientific Papers and Summaries, Speeches and Presentations, Recipes and Cooking Instructions, News Articles, Song Lyrics, Product Descriptions and Reviews, Programs and Code).[^1]
For each material type, ChatGPT generated 200 instructions; roughly 80 percent of those instructions were then fed back as in-context expansion seeds to produce more detailed instructions. The expanded instructions served as opening lines for dialogue generation. Throughout each dialogue, the user-simulator prompt continually reinforced the primary goal of the exchange (generating and refining a piece of writing), which the authors found necessary to keep the conversation focused rather than drifting off into open-ended chitchat.[^1]
The third sector targets information transformation, that is, operations performed on a piece of existing text such as rewriting, translation, summarization, continuation, or inference. The seed material was drawn from the C4 (Colossal Clean Crawled Corpus) using URL-keyword matching against the 20 material types from Sector II, yielding 10,000 source documents. ChatGPT produced five distinct instructions per document, and each (text, instruction) pair was then concatenated using one of seven manually authored templates (for example, {text}\n{instruction}, or Given the text: {text}\n{instruction}).[^1] In total, 500,000 such concatenated prompts served as opening lines for Sector III dialogues.[^1]
Across all three sectors, UltraChat dialogues are produced by an iterative two-API simulation that distinguishes UltraChat from one-shot single-turn datasets such as Alpaca and from naively prompted long-form dialogues.[^1]
The procedure works as follows. Two separate ChatGPT (GPT-3.5 Turbo) API endpoints are instantiated: one is prompted to play the role of a user model, the other to play the role of an AI assistant model. The opening line, constructed by one of the three sector pipelines described above, is fed in as the first user turn. The assistant model produces a reply. The user model then sees the running dialogue history, plus an explicit role-conditioning prompt instructing it to behave as a human user, and emits the next user turn. The two models swap turns for approximately three to seven rounds before the dialogue is terminated, after which a post-processing filter prunes unrealistic interactions.[^1]
A specific failure mode the authors highlight is role exchange: if the user model is shown only the dialogue history and not an explicit role-conditioning prompt, it tends to slip into assistant mode and start answering its own previous turn. UltraChat counteracts this with explicit personality prompts in Sector I and Sector III, and in Sector II additionally with a reminder of the writing goal at every user-turn step.[^1] During post-processing the authors strip excessively polite tokens like "Thank you," "Thanks," and "You're welcome" so that downstream models do not learn to over-fill their replies with empty courtesy.[^1] A comparison in the paper's appendix (Table 20) contrasts a directly generated multi-turn ChatGPT response with the UltraChat-style iterative output for the same opening prompt, showing the latter is substantially more substantive and structured.[^1]
The motivation for this two-agent design is partly philosophical: directly prompting one LLM to write an entire multi-turn dialogue is fast, but the resulting transcripts are short, lack realistic user behavior, and cannot exploit the RLHF training that already aligns ChatGPT toward human-like, useful responses. The two-agent setup, by contrast, lets the alignment of each API surface in its own turns.[^1]
The paper reports the following headline statistics for UltraChat as released, compared against contemporary instruction corpora.[^1]
| Dataset | Dialogues | Avg. turns | Avg. dialogue length (tokens) | Avg. utterance length (tokens) | Lexical diversity (MTLD) | User simulation |
|---|---|---|---|---|---|---|
| Self-Instruct | 82,439 | 1.0 | 69.8 | 29.2 | 24.9 | No |
| Stanford Alpaca | 52,002 | 1.0 | 91.1 | 64.5 | 42.8 | No |
| SODA | 1,486,869 | 3.6 | 231.8 | 22.5 | 38.6 | No |
| GPT-4-LLM | 61,002 | 1.0 | 179.6 | 142.9 | 48.9 | No |
| BELLE | 1,436,679 | 1.0 | 102.3 | 63.3 | 35.9 | No |
| Baize | 210,311 | 3.1 | 293.9 | 52.8 | 67.1 | Yes |
| GPT4All | 711,126 | 1.0 | 597.7 | 318.9 | 62.7 | No |
| UltraChat | 1,468,352 | 3.8 | 1467.4 | 309.3 | 74.3 | Yes |
Source: Table 5 of Ding et al. (2023).[^1]
UltraChat leads on dialogue length (1,467.4 tokens on average, six times the size of contemporary multi-turn corpora such as SODA), highest average number of turns at 3.8, and highest lexical diversity (74.3 MTLD score). On topic diversity it placed second to GPT4All; the authors attribute this to the longer dialogues regularizing the per-dialogue embedding rather than to a lack of breadth.[^1] Coherence, scored by ChatGPT on a 1-10 scale over 10,000 random samples, tied at 9.06, the highest among the surveyed datasets.[^1]
In aggregate, the paper's official total of "1.5 million high-quality multi-turn dialogues" corresponds to the 1,468,352 dialogues reported in the comparison table, plus subsequent additions. The Hugging Face mirror at stingning/ultrachat exposes the corpus as one JSON object per dialogue, with each data array containing four to fourteen alternating user/assistant utterances.[^7]
To validate UltraChat, the authors fine-tuned LLaMA-13B on the corpus, producing UltraLLaMA. Training used a standard cross-entropy loss masked to assistant tokens only, with each dialogue split into sequences capped at 2,048 tokens so that long conversations contributed multiple training examples while preserving prior context. The training run used 128 NVIDIA A100 GPUs with a total batch size of 512.[^1]
The paper evaluates UltraLLaMA against Alpaca-7B, Vicuna-13B, Koala-13B, Dolly-12B, MPT-7B, OpenAssistant-12B, Baize-13B, and ChatGPT on three setups. In an independent ChatGPT-scored evaluation over 300 questions written by GPT-4 plus the Vicuna benchmark set, UltraLLaMA reaches an overall score of 9.023 +/- 0.952, edging out Vicuna's 8.961 +/- 0.718 (Table 1).[^1] Broken out by task type (Table 7), UltraLLaMA-13B scores 9.02 overall, against Vicuna-13B's 8.96 and ChatGPT's 9.12, and posts best-in-class numbers on World Knowledge-Difficult (9.33), Professional Knowledge-Physics (9.17), and the Vicuna evaluation set itself (8.70 vs. Vicuna's 8.63).[^1] On TruthfulQA's multiple-choice task it ties Vicuna-13B at 54 percent accuracy, the joint best among open models tested.[^1] In pairwise comparison (Figure 2 of the paper), UltraLLaMA wins up to 85 percent of head-to-head matchups against open-source baselines and beats Vicuna with a 13 percent higher win rate.[^1]
A 65B-parameter variant, UltraLM-65B v1.0, followed on 7 August 2023, and UltraLM-13B v2.0 alongside reward and critique models was released on 26 September 2023.[^6]
The most widely deployed re-release of UltraChat is HuggingFaceH4/ultrachat_200k, prepared by the Hugging Face H4 (alignment) team in October 2023 for the Zephyr project.[^3] Despite the "200k" name, the dataset card reports 207,865 dialogues in the train_sft split and 23,110 in the test_sft split, plus a 256,032-row train_gen split and 28,304-row test_gen split intended for rejection-sampling or PPO-style preference generation. The total file size is approximately 1.62 GB in Parquet format under an MIT license.[^3]
The filtering recipe applied to the source UltraChat (described on the dataset card as "1.4M ChatGPT-generated dialogues") involves three steps.[^3] First, only a subset is retained, motivated by the goal of "faster supervised fine-tuning." Second, truecasing is applied, since roughly five percent of the original dialogues contained lower-case starts of sentences such as "Hello. how are you?" rather than the canonical "Hello. How are you?". Third, dialogues are removed where the assistant's reply contains stock evasions such as "I do not have emotions" or "I don't have opinions," even on fact-based prompts where such hedging is unwarranted.[^3] Each row in UltraChat 200k carries a prompt, a SHA-style prompt_id, and a messages array of {content, role} objects following the by-now-standard chat-template schema.[^3]
UltraChat 200k has been downloaded at scale (the Hugging Face Hub reported 72,127 downloads in a single recent month) and has been republished in localized variants such as Vietnamese, Dutch, and multilingual derivatives.[^3]
The first major downstream model trained on UltraChat 200k was Zephyr-7B-beta, released by Hugging Face on 25 October 2023 alongside the paper "Zephyr: Direct Distillation of LM Alignment" (arXiv:2310.16944) by Lewis Tunstall and colleagues.[^4] Zephyr starts from mistralai/Mistral-7B-v0.1 and is trained in two stages: a distilled supervised fine-tuning (dSFT) stage on UltraChat 200k, followed by a distilled direct preference optimization (dDPO) stage on the binarized UltraFeedback preference set.[^4][^5] The Zephyr authors report a final MT-Bench score of 7.34 for Zephyr-7B-beta, surpassing both Llama-2-Chat-70B's 6.86 and Mistral-Instruct-v0.1's 6.84 at the time of release, with an AlpacaEval win rate of 90.60 percent.[^4][^9]
The Hugging Face model card for Zephyr-7B-beta lists the SFT-stage hyperparameters in detail: learning rate 5e-7, train batch size 2 on each of 16 GPUs (effective batch size 32), the Adam optimizer with betas 0.9 and 0.999, linear LR schedule with 0.1 warmup ratio, and 3 epochs at seed 42. The final DPO validation metrics include a loss of 0.7496, rewards/chosen of -4.5221, rewards/rejected of -8.3184, and rewards/accuracies of 0.7812.[^4]
A noteworthy methodological choice in the Zephyr paper is that the team removed in-built alignment from the SFT data: they argue that the deflection patterns surfaced during UltraChat 200k filtering ("I do not have emotions") were artifacts of teacher-model alignment rather than genuine refusals, and removing them improved both MT-Bench and AlpacaEval performance. The corollary, acknowledged in the paper and on the model card, is that Zephyr-7B-beta is more willing to produce problematic outputs when explicitly prompted to do so.[^4]
The natural companion to UltraChat in the OpenBMB ecosystem is UltraFeedback, released in October 2023 with the paper "UltraFeedback: Boosting Language Models with Scaled AI Feedback" (arXiv:2310.01377) by Ganqu Cui and collaborators, also from Tsinghua University.[^8] UltraFeedback was accepted to ICML 2024 and supplies the Direct Preference Optimization (DPO) half of the open chat-model recipe.[^8]
The dataset comprises 64,000 prompts drawn from a mixture of six instruction sources (UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN), each paired with four model completions sampled from a diverse model pool, for a total of approximately 256,000 completions and over one million GPT-4 ratings.[^8] Ratings cover four axes (instruction-following, truthfulness, honesty, helpfulness), and approximately 340,000 pairwise comparison pairs can be derived for reward-model training.[^8] The companion reward model UltraRM-13B and critique model UltraCM-13B were both trained on UltraFeedback and released alongside it.[^6][^8]
Because UltraFeedback shares both the OpenBMB authorship and an explicit overlap of prompts with UltraChat (UltraChat is one of the six prompt sources), the UltraChat + UltraFeedback combination is often referred to as the "Ultra recipe" for open chat-model alignment.[^8]
UltraChat (and especially UltraChat 200k) has been adopted as a default SFT corpus by a long sequence of open-weights chat models. The following table summarizes some of the most prominent.
| Model | Base | UltraChat usage | MT-Bench | Notes |
|---|---|---|---|---|
| UltraLLaMA-13B | LLaMA-13B | Full UltraChat (1.5M) | n/a | Original OpenBMB model; outperformed Vicuna in paper.[^1] |
| UltraLM-65B v1.0 | LLaMA-65B | Full UltraChat | n/a | August 2023; topped open-source AlpacaEval briefly.[^6] |
| Zephyr-7B-alpha | Mistral 7B | UltraChat 200k SFT | 6.88 | Hugging Face H4, October 2023.[^4] |
| Zephyr-7B-beta | Mistral 7B | UltraChat 200k SFT + UltraFeedback DPO | 7.34 | Surpassed Llama-2-Chat-70B (6.86).[^4][^9] |
These models in turn seeded countless community fine-tunes and merges; UltraChat 200k has been one of the most frequently mixed corpora in community SFT recipes since late 2023, often combined with OpenHermes-style synthetic chat or coding-specific corpora.[^3]
UltraChat sits within a broader open-source toolchain published by OpenBMB, the open-source initiative coordinated out of Tsinghua's Natural Language Processing Lab. The ecosystem includes several artifacts the UltraChat group authored or contributed to.
While MiniCPM's SFT mixture is its own proprietary curation rather than UltraChat directly, the lineage is clearly shared: many of the same authors, the same evaluation methodology, and the same emphasis on producing strong open chat models with strong open data.[^11]
UltraChat occupies a particular niche in the instruction-data landscape, and it is useful to contrast it with adjacent corpora.
| Corpus | Year | Size | Multi-turn? | Generated by | Distinctive feature |
|---|---|---|---|---|---|
| Stanford Alpaca | 2023 | 52,002 | No | text-davinci-003 via Self-Instruct | Seminal Self-Instruct dataset; single-turn only.[^1] |
| ShareGPT | 2023 | ~70k+ | Yes | Real ChatGPT users | User-shared transcripts; quality varies; licensing concerns. |
| Baize | 2023 | 210,311 | Yes (3.1 turns) | ChatGPT self-chat | Forerunner of two-agent simulation idea.[^1] |
| GPT4All | 2023 | 711,126 | No | GPT-3.5-Turbo | Long single-turn dialogues.[^1] |
| SODA | 2023 | 1,486,869 | Yes (3.6 turns) | Symbolic-social knowledge graph | Social/banter focus rather than instruction.[^1] |
| UltraChat | 2023 | 1,468,352 | Yes (3.8 turns) | GPT-3.5-Turbo (two-API simulation) | Tripartite taxonomy; long, coherent dialogues; user-simulation prompt.[^1] |
| UltraChat 200k | 2023 | 207,865 train | Yes | (filtered UltraChat) | Truecased, evasion-pruned, MIT.[^3] |
The defining contributions, relative to its peers, are (1) the explicit three-sector taxonomy that explicitly separates information access, generation, and transformation, (2) the iterative two-API user/assistant simulation rather than a single-shot dialogue prompt, and (3) the deliberate scaling to long, multi-round transcripts (1,467 tokens per dialogue on average versus 91 for Alpaca).[^1]
UltraChat's significance stems less from any single number than from its role as connective tissue in the 2023-2024 wave of open chat models. Three concrete applications stand out.[^4][^5][^8]
Distillation of multi-turn behavior into small open models. Before UltraChat, most synthetic instruction corpora were single-turn (Alpaca, GPT-4-LLM) or short multi-turn (Baize at 3.1 average turns). The 3.8-turn average and 1,467-token average dialogue length of UltraChat gave smaller open models a chance to learn extended back-and-forth and topic continuation that previously required either RLHF or access to real user logs.[^1]
Repeatable open recipe. The Zephyr team's dSFT + dDPO pipeline (UltraChat 200k for SFT, UltraFeedback binarized for DPO) is sufficiently reproducible that it has been ported to many subsequent base models. The MT-Bench result of 7.34 from a 7-billion-parameter model trained on this combination served as a public proof point that open chat-model quality could close most of the gap to GPT-3.5-Turbo, which scores 7.94 on the same benchmark.[^4]
Anchor for downstream evaluation. The companion artifacts (UltraRM, UltraCM, UltraEval) provide a self-contained loop for generating data, training models, and evaluating them, all from a single research group. This integrated stack has been used as scaffolding for MiniCPM and other efficient-deployment LLM projects.[^10][^11]
The UltraChat authors are explicit about several limitations, and downstream users have flagged additional concerns.[^1][^3]
English only. The released corpus is English-only. The paper notes that the team is "actively working on collecting and constructing data in other languages, such as Chinese," but at the time of the original release no multilingual version existed.[^1] Community-mirrored multilingual derivatives (Vietnamese, Dutch, etc.) have since appeared on Hugging Face, but they are translations rather than native generation.[^3]
ChatGPT-derived ceiling. Because both the user simulator and the assistant simulator are GPT-3.5 Turbo, UltraChat in effect distills a particular version of ChatGPT's behavior. Models trained on it inherit not only its strengths but also its weaknesses: hedging language, mode collapse on certain refusal patterns, and (where present) factual hallucinations. The Hugging Face H4 team's filtering recipe explicitly targets the most egregious examples of this (the "I do not have emotions" hedging), but the cleanup is heuristic.[^3]
Topic-diversity ceiling. Although UltraChat leads in lexical diversity, the paper acknowledges that on topic-diversity scoring (cosine distance between dialogue embeddings) it falls slightly below GPT4All, an artifact the authors attribute to longer dialogues smoothing the per-dialogue embedding. In practice, the meta-topic list is finite (30 meta-concepts plus 10,000 Wikidata entities), which puts a soft ceiling on out-of-distribution prompts.[^1]
Energy and reproducibility cost. Training UltraLLaMA-13B took 128 A100 GPUs at total batch size 512, which the authors flag as "more energy-intensive than other lightweight models."[^1] The dataset's generation pipeline itself required calling GPT-3.5-Turbo APIs millions of times, which is non-trivial both financially and in terms of OpenAI-derived data dependence.
Evaluation methodology. The paper's headline comparison relies on ChatGPT-as-judge scoring rather than human evaluation, which the authors acknowledge "could produce steady results but is still not as reliable as GPT-4."[^1] Subsequent work in the community has shifted toward MT-Bench and AlpacaEval, where Zephyr-7B-beta's 7.34 / 90.60% numbers are the more frequently cited yardstick for UltraChat-derived models.[^4][^9]
Data contamination risks. Because UltraChat ultimately reflects ChatGPT outputs, any benchmark that ChatGPT memorized (in particular, TruthfulQA-style benchmarks built before May 2023) is at risk of being indirectly leaked through fine-tuning on UltraChat. The paper's own TruthfulQA result of 54 percent for UltraLLaMA matches Vicuna exactly, which is consistent with both models inheriting the same teacher distribution.[^1]
Several adjacent topics in the open-source instruction-data and alignment landscape are directly relevant.