UltraChat

Chinese AI Data & Datasets Large Language Models

22 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v3 · 4,472 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

UltraChat is a large-scale synthetic multi-turn instructional conversation dataset released in May 2023 by the OpenBMB group at Tsinghua University, comprising approximately 1.5 million dialogues generated by iteratively prompting two GPT-3.5 Turbo APIs against one another, with one model simulating a human user and the other producing assistant replies.^[1]^[2] The dataset is organized into three sectors covering "Questions about the World," "Creation and Generation," and "Assistance on Existing Materials," and it accompanied the paper "Enhancing Chat Language Models by Scaling High-quality Instructional Conversations" by Ning Ding and collaborators (arXiv:2305.14233).^[1] UltraChat became one of the most widely reused instruction corpora in the open-weights ecosystem after Hugging Face's H4 alignment team filtered it down to roughly 200,000 highest-quality conversations and used the resulting UltraChat 200k subset to supervise the first stage of Mistral 7B derived chat model Zephyr-7B-beta.^[3]^[4] Alongside the companion preference dataset UltraFeedback, UltraChat is widely credited with anchoring the recipe by which small open chat models began approaching the MT-Bench scores of much larger proprietary systems during 2023 and 2024.^[4]^[5]

Background and motivation

By the spring of 2023 the open-source community had produced several instruction-tuned LLaMA derivatives, including Stanford Alpaca (52,002 single-turn pairs generated by Self-Instruct against text-davinci-003), Vicuna (fine-tuned on ShareGPT user dialogues), Koala, Baize, BELLE, GPT4All, and Dolly.^[1] Each of these projects demonstrated that competitive chat behavior could be elicited from open weights, but the underlying instruction corpora suffered from three recurring weaknesses that the UltraChat authors set out to address.^[1]

First, datasets such as Alpaca were almost entirely single-turn: the average number of conversational rounds was 1, which left fine-tuned models brittle on follow-up questions or multi-step dialogue. Second, corpora harvested from public ChatGPT logs (ShareGPT) or scraped from real user prompts raised privacy and licensing concerns, and they tended to over-represent the topics that early adopters happened to share. Third, the diversity statistics of these corpora lagged: average lexical diversity scores and topic-spread metrics were modest, and the dialogues frequently fell into stylistic ruts.^[1]

The UltraChat authors framed the problem in terms of "the final one mile" of chat-model quality: going from a 0-to-60 instruction-following baseline to a 60-to-100 conversational model. Their thesis, stated in the paper's introduction, was that the most direct lever for that final mile is the quality and diversity of fine-tuning data, not just additional preference learning or scaling.^[1] To reach a million-scale corpus without humans in the loop, the team needed (a) a principled taxonomy that genuinely spans human/AI interaction space, (b) a generation procedure that creates coherent multi-turn dialogues rather than disconnected single-turn pairs, and (c) prompts strong enough to prevent the simulated "user" GPT from collapsing back into assistant mode.^[1]

Paper and release timeline

The paper "Enhancing Chat Language Models by Scaling High-quality Instructional Conversations" was first posted to arXiv on 23 May 2023 (arXiv:2305.14233) with authors Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou, all affiliated with Tsinghua University.^[1] A revised version was accepted to EMNLP 2023 and presented in Singapore in December 2023, where it appears at pages 3029-3051 of the main conference proceedings.^[2]

The dataset itself was released incrementally on GitHub (thunlp/UltraChat) and on Hugging Face (stingning/ultrachat) starting April 2023, with sector-by-sector releases continuing through mid-2023.^[6]^[7] On 27 June 2023 the OpenBMB group released UltraLM-13B, the LLaMA-13B fine-tune trained on the full UltraChat corpus, which briefly topped the AlpacaEval leaderboard among open-source models.^[6] A 65B-parameter UltraLM v1.0 followed on 7 August 2023, and a v2.0 release on 26 September 2023 was accompanied by the launch of the UltraFeedback preference dataset and the UltraRM-13B and UltraCM-13B critique models.^[6]^[8]

A heavily filtered HuggingFaceH4 subset, UltraChat 200k, appeared in late October 2023 in conjunction with the Zephyr-7B-beta release, and quickly became the dominant version of the corpus circulated for supervised fine-tuning of open chat models.^[3]^[4]

Three-sector taxonomy

The intellectual backbone of UltraChat is its tripartite taxonomy, which the authors derived from a principle that every interaction between a human user and an AI assistant can be modeled as a form of "obtaining information." From that observation they distinguish three modes of obtaining information, each of which becomes a sector of the corpus.^[1]

Sector I: Questions about the World

The first sector targets information access, that is, queries about entities, concepts, and objects that already exist in the world. The authors built it from two parallel streams of seed prompts.^[1]

The first stream is topic-centric. The team asked ChatGPT to enumerate 30 high-level meta-concepts spanning daily life ("Technology," "Health and wellness," "Travel and adventure," "Food and drink," "Art and culture," "Science and innovation," "Fashion and style," "Relationships and dating," "Sports and fitness," "Nature and the environment," "Music and entertainment," "Politics and current events," "Education and learning," "Money and finance," "Work and career," "Philosophy and ethics," "History and nostalgia," "Social media and communication," "Creativity and inspiration," "Personal growth and development," "Spirituality and faith," "Pop culture and trends," "Beauty and self-care," "Family and parenting," "Entrepreneurship and business," "Literature and writing," "Gaming and technology," "Mindfulness and meditation," "Diversity and inclusion," "Travel and culture exchange").^[1] Each meta-concept was then expanded into 30 to 50 subtopics or related concepts, yielding roughly 1,100 or more leaf topics. For each subtopic, ten distinct seed questions were generated, and then for every seed question ChatGPT produced ten additional related questions, multiplying the prompt pool further.^[1]

The second stream is entity-centric. The authors took the 10,000 most frequently occurring entities from Wikidata, ranked by their occurrence in Wikipedia articles, and generated five meta-questions per entity, expanded by ten specific and twenty extended follow-up questions. Extended questions were instructed to preserve some thematic similarity with the original while exploring distinct objects or angles, a deliberate hedge against entity-by-entity redundancy.^[1] After filtering and sampling, around 500,000 such questions served as opening lines for Sector I dialogues.^[1]

Sector II: Creation and Generation

The second sector targets conditional information creation, that is, prompts where the user asks the assistant to produce new text under a user-supplied constraint. The authors enumerated 20 categories of written material (Articles and Blog Posts, Job Application Material, Stories, Legal Documents and Contracts, Poems, Educational Content, Screenplays, Scripts for Language Learning, Technical Documents and Reports, Marketing Materials, Social Media Posts, Personal Essays, Emails, Scientific Papers and Summaries, Speeches and Presentations, Recipes and Cooking Instructions, News Articles, Song Lyrics, Product Descriptions and Reviews, Programs and Code).^[1]

For each material type, ChatGPT generated 200 instructions; roughly 80 percent of those instructions were then fed back as in-context expansion seeds to produce more detailed instructions. The expanded instructions served as opening lines for dialogue generation. Throughout each dialogue, the user-simulator prompt continually reinforced the primary goal of the exchange (generating and refining a piece of writing), which the authors found necessary to keep the conversation focused rather than drifting off into open-ended chitchat.^[1]

Sector III: Assistance on Existing Materials

The third sector targets information transformation, that is, operations performed on a piece of existing text such as rewriting, translation, summarization, continuation, or inference. The seed material was drawn from the C4 (Colossal Clean Crawled Corpus) using URL-keyword matching against the 20 material types from Sector II, yielding 10,000 source documents. ChatGPT produced five distinct instructions per document, and each (text, instruction) pair was then concatenated using one of seven manually authored templates (for example, {text}\n{instruction}, or Given the text: {text}\n{instruction}).^[1] In total, 500,000 such concatenated prompts served as opening lines for Sector III dialogues.^[1]

How a dialogue is generated

Across all three sectors, UltraChat dialogues are produced by an iterative two-API simulation that distinguishes UltraChat from one-shot single-turn datasets such as Alpaca and from naively prompted long-form dialogues.^[1]

The procedure works as follows. Two separate ChatGPT (GPT-3.5 Turbo) API endpoints are instantiated: one is prompted to play the role of a user model, the other to play the role of an AI assistant model. The opening line, constructed by one of the three sector pipelines described above, is fed in as the first user turn. The assistant model produces a reply. The user model then sees the running dialogue history, plus an explicit role-conditioning prompt instructing it to behave as a human user, and emits the next user turn. The two models swap turns for approximately three to seven rounds before the dialogue is terminated, after which a post-processing filter prunes unrealistic interactions.^[1]

A specific failure mode the authors highlight is role exchange: if the user model is shown only the dialogue history and not an explicit role-conditioning prompt, it tends to slip into assistant mode and start answering its own previous turn. UltraChat counteracts this with explicit personality prompts in Sector I and Sector III, and in Sector II additionally with a reminder of the writing goal at every user-turn step.^[1] During post-processing the authors strip excessively polite tokens like "Thank you," "Thanks," and "You're welcome" so that downstream models do not learn to over-fill their replies with empty courtesy.^[1] A comparison in the paper's appendix (Table 20) contrasts a directly generated multi-turn ChatGPT response with the UltraChat-style iterative output for the same opening prompt, showing the latter is substantially more substantive and structured.^[1]

The motivation for this two-agent design is partly philosophical: directly prompting one LLM to write an entire multi-turn dialogue is fast, but the resulting transcripts are short, lack realistic user behavior, and cannot exploit the RLHF training that already aligns ChatGPT toward human-like, useful responses. The two-agent setup, by contrast, lets the alignment of each API surface in its own turns.^[1]

Statistics of the released corpus

The paper reports the following headline statistics for UltraChat as released, compared against contemporary instruction corpora.^[1]

Dataset	Dialogues	Avg. turns	Avg. dialogue length (tokens)	Avg. utterance length (tokens)	Lexical diversity (MTLD)	User simulation
Self-Instruct	82,439	1.0	69.8	29.2	24.9	No
Stanford Alpaca	52,002	1.0	91.1	64.5	42.8	No
SODA	1,486,869	3.6	231.8	22.5	38.6	No
GPT-4-LLM	61,002	1.0	179.6	142.9	48.9	No
BELLE	1,436,679	1.0	102.3	63.3	35.9	No
Baize	210,311	3.1	293.9	52.8	67.1	Yes
GPT4All	711,126	1.0	597.7	318.9	62.7	No
UltraChat	1,468,352	3.8	1467.4	309.3	74.3	Yes

Source: Table 5 of Ding et al. (2023).^[1]

UltraChat leads on dialogue length (1,467.4 tokens on average, six times the size of contemporary multi-turn corpora such as SODA), highest average number of turns at 3.8, and highest lexical diversity (74.3 MTLD score). On topic diversity it placed second to GPT4All; the authors attribute this to the longer dialogues regularizing the per-dialogue embedding rather than to a lack of breadth.^[1] Coherence, scored by ChatGPT on a 1-10 scale over 10,000 random samples, tied at 9.06, the highest among the surveyed datasets.^[1]

In aggregate, the paper's official total of "1.5 million high-quality multi-turn dialogues" corresponds to the 1,468,352 dialogues reported in the comparison table, plus subsequent additions. The Hugging Face mirror at stingning/ultrachat exposes the corpus as one JSON object per dialogue, with each data array containing four to fourteen alternating user/assistant utterances.^[7]

UltraLLaMA: the original downstream model

To validate UltraChat, the authors fine-tuned LLaMA-13B on the corpus, producing UltraLLaMA. Training used a standard cross-entropy loss masked to assistant tokens only, with each dialogue split into sequences capped at 2,048 tokens so that long conversations contributed multiple training examples while preserving prior context. The training run used 128 NVIDIA A100 GPUs with a total batch size of 512.^[1]

The paper evaluates UltraLLaMA against Alpaca-7B, Vicuna-13B, Koala-13B, Dolly-12B, MPT-7B, OpenAssistant-12B, Baize-13B, and ChatGPT on three setups. In an independent ChatGPT-scored evaluation over 300 questions written by GPT-4 plus the Vicuna benchmark set, UltraLLaMA reaches an overall score of 9.023 +/- 0.952, edging out Vicuna's 8.961 +/- 0.718 (Table 1).^[1] Broken out by task type (Table 7), UltraLLaMA-13B scores 9.02 overall, against Vicuna-13B's 8.96 and ChatGPT's 9.12, and posts best-in-class numbers on World Knowledge-Difficult (9.33), Professional Knowledge-Physics (9.17), and the Vicuna evaluation set itself (8.70 vs. Vicuna's 8.63).^[1] On TruthfulQA's multiple-choice task it ties Vicuna-13B at 54 percent accuracy, the joint best among open models tested.^[1] In pairwise comparison (Figure 2 of the paper), UltraLLaMA wins up to 85 percent of head-to-head matchups against open-source baselines and beats Vicuna with a 13 percent higher win rate.^[1]

A 65B-parameter variant, UltraLM-65B v1.0, followed on 7 August 2023, and UltraLM-13B v2.0 alongside reward and critique models was released on 26 September 2023.^[6]

UltraChat 200k

The most widely deployed re-release of UltraChat is HuggingFaceH4/ultrachat_200k, prepared by the Hugging Face H4 (alignment) team in October 2023 for the Zephyr project.^[3] Despite the "200k" name, the dataset card reports 207,865 dialogues in the train_sft split and 23,110 in the test_sft split, plus a 256,032-row train_gen split and 28,304-row test_gen split intended for rejection-sampling or PPO-style preference generation. The total file size is approximately 1.62 GB in Parquet format under an MIT license.^[3]

The filtering recipe applied to the source UltraChat (described on the dataset card as "1.4M ChatGPT-generated dialogues") involves three steps.^[3] First, only a subset is retained, motivated by the goal of "faster supervised fine-tuning." Second, truecasing is applied, since roughly five percent of the original dialogues contained lower-case starts of sentences such as "Hello. how are you?" rather than the canonical "Hello. How are you?". Third, dialogues are removed where the assistant's reply contains stock evasions such as "I do not have emotions" or "I don't have opinions," even on fact-based prompts where such hedging is unwarranted.^[3] Each row in UltraChat 200k carries a prompt, a SHA-style prompt_id, and a messages array of {content, role} objects following the by-now-standard chat-template schema.^[3]

UltraChat 200k has been downloaded at scale (the Hugging Face Hub reported 72,127 downloads in a single recent month) and has been republished in localized variants such as Vietnamese, Dutch, and multilingual derivatives.^[3]

Zephyr-7B-beta and the dSFT + dDPO recipe

The first major downstream model trained on UltraChat 200k was Zephyr-7B-beta, released by Hugging Face on 25 October 2023 alongside the paper "Zephyr: Direct Distillation of LM Alignment" (arXiv:2310.16944) by Lewis Tunstall and colleagues.^[4] Zephyr starts from mistralai/Mistral-7B-v0.1 and is trained in two stages: a distilled supervised fine-tuning (dSFT) stage on UltraChat 200k, followed by a distilled direct preference optimization (dDPO) stage on the binarized UltraFeedback preference set.^[4]^[5] The Zephyr authors report a final MT-Bench score of 7.34 for Zephyr-7B-beta, surpassing both Llama-2-Chat-70B's 6.86 and Mistral-Instruct-v0.1's 6.84 at the time of release, with an AlpacaEval win rate of 90.60 percent.^[4]^[9]

The Hugging Face model card for Zephyr-7B-beta lists the SFT-stage hyperparameters in detail: learning rate 5e-7, train batch size 2 on each of 16 GPUs (effective batch size 32), the Adam optimizer with betas 0.9 and 0.999, linear LR schedule with 0.1 warmup ratio, and 3 epochs at seed 42. The final DPO validation metrics include a loss of 0.7496, rewards/chosen of -4.5221, rewards/rejected of -8.3184, and rewards/accuracies of 0.7812.^[4]

A noteworthy methodological choice in the Zephyr paper is that the team removed in-built alignment from the SFT data: they argue that the deflection patterns surfaced during UltraChat 200k filtering ("I do not have emotions") were artifacts of teacher-model alignment rather than genuine refusals, and removing them improved both MT-Bench and AlpacaEval performance. The corollary, acknowledged in the paper and on the model card, is that Zephyr-7B-beta is more willing to produce problematic outputs when explicitly prompted to do so.^[4]

UltraFeedback: the preference companion

The natural companion to UltraChat in the OpenBMB ecosystem is UltraFeedback, released in October 2023 with the paper "UltraFeedback: Boosting Language Models with Scaled AI Feedback" (arXiv:2310.01377) by Ganqu Cui and collaborators, also from Tsinghua University.^[8] UltraFeedback was accepted to ICML 2024 and supplies the Direct Preference Optimization (DPO) half of the open chat-model recipe.^[8]

The dataset comprises 64,000 prompts drawn from a mixture of six instruction sources (UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN), each paired with four model completions sampled from a diverse model pool, for a total of approximately 256,000 completions and over one million GPT-4 ratings.^[8] Ratings cover four axes (instruction-following, truthfulness, honesty, helpfulness), and approximately 340,000 pairwise comparison pairs can be derived for reward-model training.^[8] The companion reward model UltraRM-13B and critique model UltraCM-13B were both trained on UltraFeedback and released alongside it.^[6]^[8]

Because UltraFeedback shares both the OpenBMB authorship and an explicit overlap of prompts with UltraChat (UltraChat is one of the six prompt sources), the UltraChat + UltraFeedback combination is often referred to as the "Ultra recipe" for open chat-model alignment.^[8]

Downstream adoption

UltraChat (and especially UltraChat 200k) has been adopted as a default SFT corpus by a long sequence of open-weights chat models. The following table summarizes some of the most prominent.

Model	Base	UltraChat usage	MT-Bench	Notes
UltraLLaMA-13B	LLaMA-13B	Full UltraChat (1.5M)	n/a	Original OpenBMB model; outperformed Vicuna in paper.^[1]
UltraLM-65B v1.0	LLaMA-65B	Full UltraChat	n/a	August 2023; topped open-source AlpacaEval briefly.^[6]
Zephyr-7B-alpha	Mistral 7B	UltraChat 200k SFT	6.88	Hugging Face H4, October 2023.^[4]
Zephyr-7B-beta	Mistral 7B	UltraChat 200k SFT + UltraFeedback DPO	7.34	Surpassed Llama-2-Chat-70B (6.86).^[4]^[9]

These models in turn seeded countless community fine-tunes and merges; UltraChat 200k has been one of the most frequently mixed corpora in community SFT recipes since late 2023, often combined with OpenHermes-style synthetic chat or coding-specific corpora.^[3]

OpenBMB ecosystem context

UltraChat sits within a broader open-source toolchain published by OpenBMB, the open-source initiative coordinated out of Tsinghua's Natural Language Processing Lab. The ecosystem includes several artifacts the UltraChat group authored or contributed to.

UltraFeedback (arXiv:2310.01377) supplies the preference-data leg of the alignment recipe.^[8]
UltraRM-13B and UltraCM-13B are the reward and critique models trained on UltraFeedback, used both for reward modeling experiments and as judges for downstream evaluation.^[6]^[8]
UltraEval is an evaluation framework released in April 2024 (presented at the ACL 2024 demo track) that standardizes lightweight assessment for foundation models, used as the evaluation backbone for the MiniCPM family of efficient end-device LLMs.^[10]^[11]
MiniCPM, first released on 1 February 2024, is OpenBMB's series of small, efficient LLMs (the most recent versions, MiniCPM4 and MiniCPM4.1, target ultra-efficient end-device deployment) and uses UltraEval for its evaluation pipeline.^[10]^[11]

While MiniCPM's SFT mixture is its own proprietary curation rather than UltraChat directly, the lineage is clearly shared: many of the same authors, the same evaluation methodology, and the same emphasis on producing strong open chat models with strong open data.^[11]

Comparison to other instruction corpora

UltraChat occupies a particular niche in the instruction-data landscape, and it is useful to contrast it with adjacent corpora.

Corpus	Year	Size	Multi-turn?	Generated by	Distinctive feature
Stanford Alpaca	2023	52,002	No	text-davinci-003 via Self-Instruct	Seminal Self-Instruct dataset; single-turn only.^[1]
ShareGPT	2023	~70k+	Yes	Real ChatGPT users	User-shared transcripts; quality varies; licensing concerns.
Baize	2023	210,311	Yes (3.1 turns)	ChatGPT self-chat	Forerunner of two-agent simulation idea.^[1]
GPT4All	2023	711,126	No	GPT-3.5-Turbo	Long single-turn dialogues.^[1]
SODA	2023	1,486,869	Yes (3.6 turns)	Symbolic-social knowledge graph	Social/banter focus rather than instruction.^[1]
UltraChat	2023	1,468,352	Yes (3.8 turns)	GPT-3.5-Turbo (two-API simulation)	Tripartite taxonomy; long, coherent dialogues; user-simulation prompt.^[1]
UltraChat 200k	2023	207,865 train	Yes	(filtered UltraChat)	Truecased, evasion-pruned, MIT.^[3]

The defining contributions, relative to its peers, are (1) the explicit three-sector taxonomy that explicitly separates information access, generation, and transformation, (2) the iterative two-API user/assistant simulation rather than a single-shot dialogue prompt, and (3) the deliberate scaling to long, multi-round transcripts (1,467 tokens per dialogue on average versus 91 for Alpaca).^[1]

Significance and applications

UltraChat's significance stems less from any single number than from its role as connective tissue in the 2023-2024 wave of open chat models. Three concrete applications stand out.^[4]^[5]^[8]

Distillation of multi-turn behavior into small open models. Before UltraChat, most synthetic instruction corpora were single-turn (Alpaca, GPT-4-LLM) or short multi-turn (Baize at 3.1 average turns). The 3.8-turn average and 1,467-token average dialogue length of UltraChat gave smaller open models a chance to learn extended back-and-forth and topic continuation that previously required either RLHF or access to real user logs.^[1]

Repeatable open recipe. The Zephyr team's dSFT + dDPO pipeline (UltraChat 200k for SFT, UltraFeedback binarized for DPO) is sufficiently reproducible that it has been ported to many subsequent base models. The MT-Bench result of 7.34 from a 7-billion-parameter model trained on this combination served as a public proof point that open chat-model quality could close most of the gap to GPT-3.5-Turbo, which scores 7.94 on the same benchmark.^[4]

Anchor for downstream evaluation. The companion artifacts (UltraRM, UltraCM, UltraEval) provide a self-contained loop for generating data, training models, and evaluating them, all from a single research group. This integrated stack has been used as scaffolding for MiniCPM and other efficient-deployment LLM projects.^[10]^[11]

Limitations and criticisms

The UltraChat authors are explicit about several limitations, and downstream users have flagged additional concerns.^[1]^[3]

English only. The released corpus is English-only. The paper notes that the team is "actively working on collecting and constructing data in other languages, such as Chinese," but at the time of the original release no multilingual version existed.^[1] Community-mirrored multilingual derivatives (Vietnamese, Dutch, etc.) have since appeared on Hugging Face, but they are translations rather than native generation.^[3]

ChatGPT-derived ceiling. Because both the user simulator and the assistant simulator are GPT-3.5 Turbo, UltraChat in effect distills a particular version of ChatGPT's behavior. Models trained on it inherit not only its strengths but also its weaknesses: hedging language, mode collapse on certain refusal patterns, and (where present) factual hallucinations. The Hugging Face H4 team's filtering recipe explicitly targets the most egregious examples of this (the "I do not have emotions" hedging), but the cleanup is heuristic.^[3]

Topic-diversity ceiling. Although UltraChat leads in lexical diversity, the paper acknowledges that on topic-diversity scoring (cosine distance between dialogue embeddings) it falls slightly below GPT4All, an artifact the authors attribute to longer dialogues smoothing the per-dialogue embedding. In practice, the meta-topic list is finite (30 meta-concepts plus 10,000 Wikidata entities), which puts a soft ceiling on out-of-distribution prompts.^[1]

Energy and reproducibility cost. Training UltraLLaMA-13B took 128 A100 GPUs at total batch size 512, which the authors flag as "more energy-intensive than other lightweight models."^[1] The dataset's generation pipeline itself required calling GPT-3.5-Turbo APIs millions of times, which is non-trivial both financially and in terms of OpenAI-derived data dependence.

Evaluation methodology. The paper's headline comparison relies on ChatGPT-as-judge scoring rather than human evaluation, which the authors acknowledge "could produce steady results but is still not as reliable as GPT-4."^[1] Subsequent work in the community has shifted toward MT-Bench and AlpacaEval, where Zephyr-7B-beta's 7.34 / 90.60% numbers are the more frequently cited yardstick for UltraChat-derived models.^[4]^[9]

Data contamination risks. Because UltraChat ultimately reflects ChatGPT outputs, any benchmark that ChatGPT memorized (in particular, TruthfulQA-style benchmarks built before May 2023) is at risk of being indirectly leaked through fine-tuning on UltraChat. The paper's own TruthfulQA result of 54 percent for UltraLLaMA matches Vicuna exactly, which is consistent with both models inheriting the same teacher distribution.^[1]

Several adjacent topics in the open-source instruction-data and alignment landscape are directly relevant.

Instruction Tuning: the broader paradigm of fine-tuning language models on instruction-formatted data, of which UltraChat is a large-scale instance.
Vicuna: the open-source ShareGPT-based chat model that UltraLLaMA was designed to surpass.
Mistral 7B: the base model that, together with UltraChat 200k and UltraFeedback, produced the Zephyr-7B family.
Direct Preference Optimization (DPO): the algorithm used in the Zephyr dDPO stage to align UltraChat-SFT models with UltraFeedback preferences.
Reinforcement Learning from Human Feedback (RLHF): the older alignment-training framework that DPO-on-UltraFeedback replaces.
RLAIF: reinforcement learning from AI feedback, the broader class to which UltraFeedback's GPT-4-rated preferences belong.
MT-Bench: the multi-turn benchmark on which UltraChat-trained models such as Zephyr-7B-beta are most frequently scored.
AlpacaEval: the win-rate benchmark on which Zephyr-7B-beta and UltraLM models are also widely reported.
C4 (Colossal Clean Crawled Corpus): the source corpus from which UltraChat Sector III sources its existing-material seeds.
TruthfulQA: the world-knowledge benchmark used in the original UltraLLaMA evaluation.
Knowledge Distillation: the broader technique family of which UltraChat-based dSFT is an instance.
Synthetic data: the general category of training data that UltraChat exemplifies for chat models.
InstructGPT: the OpenAI line of work whose RLHF-aligned descendants (GPT-3.5-Turbo, ChatGPT) generated the UltraChat dialogues.
Tülu 3: a later open-recipe alignment project whose data mixture builds on the UltraFeedback tradition.

References

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, Bowen Zhou, "Enhancing Chat Language Models by Scaling High-quality Instructional Conversations", arXiv, 2023-05-23. https://arxiv.org/abs/2305.14233. Accessed 2026-05-20. ↩
Ning Ding et al., "Enhancing Chat Language Models by Scaling High-quality Instructional Conversations", Proceedings of EMNLP 2023, pp. 3029-3051, ACL Anthology, 2023-12-01. https://aclanthology.org/2023.emnlp-main.183/. Accessed 2026-05-20. ↩
HuggingFaceH4, "UltraChat 200k dataset card", Hugging Face Hub, 2023-10-25. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k. Accessed 2026-05-20. ↩
Lewis Tunstall et al., "Zephyr-7B-beta model card", Hugging Face Hub, 2023-10-25. https://huggingface.co/HuggingFaceH4/zephyr-7b-beta. Accessed 2026-05-20. ↩
Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf, "Zephyr: Direct Distillation of LM Alignment", arXiv, 2023-10-25. https://arxiv.org/abs/2310.16944. Accessed 2026-05-20. ↩
OpenBMB / THUNLP, "UltraChat GitHub repository (thunlp/UltraChat)", GitHub, 2023-04-20. https://github.com/thunlp/UltraChat. Accessed 2026-05-20. ↩
stingning, "ultrachat dataset card", Hugging Face Hub, 2023-05-23. https://huggingface.co/datasets/stingning/ultrachat. Accessed 2026-05-20. ↩
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun, "UltraFeedback: Boosting Language Models with Scaled AI Feedback", arXiv, 2023-10-02. https://arxiv.org/abs/2310.01377. Accessed 2026-05-20. ↩
Hugging Face H4, "Open LLM Leaderboard scores for Zephyr-7B-beta", Hugging Face Hub, 2023-10-25. https://huggingface.co/HuggingFaceH4/zephyr-7b-beta. Accessed 2026-05-20. ↩
OpenBMB, "UltraEval: An open source framework for evaluating foundation models", GitHub, 2024-04-11. https://github.com/OpenBMB/UltraEval. Accessed 2026-05-20. ↩
OpenBMB, "MiniCPM repository", GitHub, 2024-02-01. https://github.com/OpenBMB/MiniCPM. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

LIMA (Less Is More for Alignment)