OpenOrca

Data & Datasets Large Language Models Open Source AI

21 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v3 · 4,230 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OpenOrca is a large open-source instruction-tuning dataset that augments the FLAN Collection with chain-of-thought responses generated by OpenAI's GPT-3.5 and GPT-4 APIs.^[1] Released on Hugging Face under the Open-Orca/OpenOrca repository in 2023, the dataset is a community replication of the unreleased training corpus described in Microsoft Research's June 2023 paper "Orca: Progressive Learning from Complex Explanation Traces of GPT-4."^[1]^[2] OpenOrca contains roughly 1 million GPT-4 completions and 3.2 million GPT-3.5 completions arranged as (id, system_prompt, question, response) tuples, distributed under the MIT license.^[1] It has been used to fine-tune a long line of community models, including OpenOrca-Preview1-13B, OpenOrcaxOpenChat-Preview2-13B, OpenOrca-Platypus2-13B, the LlongOrca long-context models, and Mistral-7B-OpenOrca, several of which briefly held leading positions on the Hugging Face Open LLM Leaderboard for their parameter class.^[1]^[3]^[4]^[5]

Infobox

Attribute	Value
Repository	`Open-Orca/OpenOrca` on Hugging Face^[1]
Released	2023^[1]
Authors (cited)	Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, "Teknium"^[1]
Other listed contributors	Eric Hartford, NanoBit, Pankaj, Winddude, Rohan, plus AlignmentLab.ai team (Autometa, Entropi, AtlasUnified, NeverendingToast)^[1]
Total examples	~4.2M (~1M GPT-4, ~3.2M GPT-3.5)^[1]
Source corpus	FLAN Collection submixes (CoT, FLAN 2021, T0, NIv2)^[1]^[6]
Generators	OpenAI GPT-3.5 and GPT-4 APIs^[1]
Format	Parquet, single train split^[1]
Schema	id, system_prompt, question, response^[1]
License	MIT^[1]
Companion subset	SlimOrca (~518k entries) and SlimOrca-Dedup (~363k)^[7]^[8]

Background and motivation

The OpenOrca project was conceived as a community response to a specific gap left open by Microsoft Research. In June 2023 a Microsoft team consisting of Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah posted "Orca: Progressive Learning from Complex Explanation Traces of GPT-4" to arXiv as preprint 2306.02707.^[2] The paper described a 13-billion-parameter model that learned to imitate the reasoning process of large foundation models from rich GPT-4 signals, including step-by-step thought processes and complex instructions, with intermediate guidance from ChatGPT used as a teaching assistant.^[2] The authors reported that their resulting Orca-13B model exceeded Vicuna-13B by more than 100% on the BIG-Bench Hard benchmark and by 42% on AGIEval, attaining parity with ChatGPT on BBH and showing competitive results on standardized tests such as SAT and LSAT.^[2]

The key contribution of the paper was a training recipe the authors named explanation tuning: instead of imitating only final answers from a teacher model, the student model was trained on full reasoning chains, including the system prompts that conditioned the teacher to produce careful, structured explanations.^[2] However, while Microsoft indicated that it would release a weight-difference for an Orca model in keeping with the LLaMA-1 weight-diff policy of the time, the actual training data, comprising millions of GPT-4 explanation traces against the FLAN Collection, was never released by Microsoft.^[2] That non-release deprived the open community of the central resource needed to reproduce the paper's central claims about explanation tuning.

OpenOrca was launched within weeks of the Orca paper as a grassroots reconstruction. The dataset card on Hugging Face frames the project explicitly as an attempt to reproduce the corpus described by Mukherjee and collaborators, populating the same FLAN Collection submixes with fresh GPT-4 and GPT-3.5 responses elicited under the system prompts indicated by the Orca paper.^[1] By July 2023 the OpenOrca team had published an early model, OpenOrca-Preview1-13B, fine-tuned on a refined 200k-row slice (roughly 6% of the dataset) and trained on eight A100-80G GPUs for fifteen hours at a commodity cost under $200, attaining roughly 60% of the BIG-Bench Hard and AGIEval gains reported in the Orca paper.^[9] That early result showed that the replication strategy was sound and motivated the team to continue scaling GPT-3.5 and GPT-4 generation.

The work was organised by Wing Lian (also known by the handle Caseus), founder of the Axolotl fine-tuning framework and a member of the OpenAccess AI Collective, in collaboration with the AlignmentLab.ai community. The author list on the formal citation reads "Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"," with broader contributor credits given to Eric Hartford, NanoBit, Pankaj, Winddude, Rohan, and several members of AlignmentLab.ai.^[1]

The FLAN Collection backbone

OpenOrca does not generate questions de novo. Instead it takes prompts directly from the FLAN Collection, a Google open-source aggregation of instruction-tuning datasets published in 2023.^[6] The FLAN Collection compiles tasks from FLAN 2021, the P3 prompt collection, Super-Natural Instructions, and several chain-of-thought corpora, then formats them as a mix of zero-shot, few-shot, and chain-of-thought prompts.^[6] The Google authors of the FLAN Collection paper, "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning," divided the data into a few sub-mixtures: flan (FLAN 2021), t0 (P3 excluding FLAN 2021), niv2 (Super-Natural Instructions), cot (chain-of-thought datasets), and dialog.^[6]

OpenOrca draws on the first four of these submixes. Every row in the OpenOrca dataset carries an id field prefixed with one of niv, t0, cot, or flan to indicate which sub-mixture the underlying question came from.^[1] The OpenOrca team has openly acknowledged that the dataset is approximately 1.5 million entries smaller than the corpus described in the original Orca paper. Their public dataset card flags two main gaps: the chain-of-thought sub-mixture contains roughly 75,000 examples rather than the 150,000 zero-shot CoT entries described by Microsoft, and the FLAN 2021 and T0 sub-mixtures are missing roughly 1.25 million and 200,000 entries respectively because the publicly hosted FLAN releases did not include those rows in pre-generated form.^[1]

Building on the FLAN backbone means OpenOrca inherits FLAN's task coverage: text classification, summarization, table question answering, reading comprehension, multilingual reasoning, math word problems via GSM8K-style prompts, and many of the chain-of-thought datasets included in the FLAN CoT sub-mixture.^[6] FLAN's existing human-annotated ground-truth answers also become important later, in the construction of the SlimOrca companion subset described below.

Data generation and schema

Each OpenOrca example is a triple of system prompt, user question, and model response, stored alongside an identifier that traces the example back to its FLAN-Collection origin. The four schema fields are documented as:^[1]

Field	Description
`id`	Unique identifier including a source prefix (`niv`, `t0`, `cot`, or `flan`)
`system_prompt`	The system prompt used for the GPT-3.5 or GPT-4 API query
`question`	The original question from the FLAN Collection
`response`	The completion returned by GPT-3.5 or GPT-4

The Orca paper described a set of system prompts that condition GPT-4 to produce carefully structured explanation traces.^[2] OpenOrca preserves the system_prompt field so that downstream consumers can either retain the Orca-style conditioning or strip it during fine-tuning. The original FLAN question goes through unchanged, while the response field captures whatever the teacher model returned to the API call. Because GPT-3.5 and GPT-4 calls differ in cost by approximately an order of magnitude, the dataset is heavily skewed toward GPT-3.5 traces, with roughly 3.2 million GPT-3.5 rows accompanying the rarer and more expensive ~1 million GPT-4 rows.^[1]

The data is stored in Parquet and exposed as a single training split of approximately 2.94 million rows once data is deduplicated and serialised, with a total file footprint of around 4.1 GB.^[1] The English-only dataset is distributed under the MIT license.^[1]

A practical detail noted by the team is that early generation passes occasionally produced answers prefixed with self-referential phrases such as "As an AI language model...". The team observed that those refusal-style prefixes degraded downstream reasoning quality, and applied light filtering to remove them on the GPT-4 subset before training their preview models.^[9]

Models trained on OpenOrca

OpenOrca was designed to enable open-source instruction tuning at a scale that previously required closed-source resources, and the team itself published a series of fine-tunes that demonstrated the dataset's effectiveness.

OpenOrca-Preview1-13B

The first public model in the series was OpenOrca-Preview1-13B, a LLaMA-13B fine-tune trained on 200,000 GPT-4 entries (about 6% of the full dataset at the time). It was trained for four epochs on eight A100-80G GPUs over fifteen hours at an estimated commodity cost under $200, with the three-epoch snapshot retained as the released weight set.^[9] Despite using a small fraction of the full corpus, the model attained roughly 60% of the BIG-Bench-Hard and AGIEval gains over the LLaMA-13B baseline reported in the Orca paper, validating the OpenOrca replication strategy.^[9] The model used the Alpaca prompt format with ### Instruction: and ### Response: markers and was trained using the Axolotl framework.^[9]

OpenOrcaxOpenChat-Preview2-13B

A subsequent collaboration with the OpenChat team produced OpenOrcaxOpenChat-Preview2-13B, a Llama-2-13B fine-tune that used the OpenChat conditional reinforcement learning packing strategy on top of OpenOrca data.^[4] At its August 10, 2023 release it placed first among all 13B-parameter models on both the Hugging Face Open LLM Leaderboard and the GPT4ALL leaderboard, with performance beyond Falcon-40B-instruct and close to the LLaMA-1-65B base model, achieving roughly 103% of the relative gains reported in the Orca paper while using less than 20% of the original data and less than one-tenth of the training budget.^[4]

OpenOrca-Platypus2-13B

OpenOrca-Platypus2-13B was created on August 11, 2023 by merging OpenOrcaxOpenChat-Preview2-13B with garage-bAInd/Platypus2-13B, a LoRA-tuned LLaMA-2-13B model trained on the STEM-and-logic-oriented Open-Platypus dataset described in the Platypus paper (arXiv 2308.07317) by Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz.^[10]^[11] OpenOrca-Platypus2-13B reached an average score of 64.56 on the Hugging Face Open LLM Leaderboard, decomposing as 59.5 on MMLU (5-shot), 62.88 on ARC-Challenge (25-shot), 83.19 on HellaSwag (10-shot), and 52.69 on TruthfulQA (0-shot), making it the first 13B model to surpass the original LLaMA-1-65B base on that leaderboard.^[3] Independent reproduction of the Orca paper's BIG-Bench-Hard and AGIEval comparisons placed the merged model at 105% and 112% of the LLaMA-2-13B baseline respectively.^[3]

LlongOrca long-context variants

The team also packaged a long-context line of models named LlongOrca, which combined the OpenOrca data with the LongLoRA-style 16,000-token-context recipe to produce LlongOrca-7B-16k and LlongOrca-13B-16k. According to the dataset card, both held leading positions on the Hugging Face Open LLM Leaderboard in the long-context categories for the 7B and 13B classes at their release dates, with the 7B variant retaining over 99% of the top non-long-context 7B model's score and the 13B variant retaining over 97%.^[1]

Mistral-7B-OpenOrca

The most widely adopted OpenOrca-derived model is Mistral-7B-OpenOrca (codenamed "MistralOrca"), released in early October 2023 about a week after Mistral AI's September 27, 2023 publication of the Mistral 7B base model.^[5]^[12] Mistral-7B-OpenOrca is a full fine-tune (not a LoRA) of Mistral-7B-v0.1, trained for four epochs on a curated GPT-4 subset of OpenOrca for 62 hours on eight A6000 GPUs at an estimated commodity cost of approximately $400, using OpenChat-style packing inside the Axolotl framework.^[5] The model uses the ChatML chat-template format.^[5]

On the Hugging Face Open LLM Leaderboard at release time it scored an average of 65.84, with 62.24 on MMLU (5-shot), 64.08 on ARC-Challenge (25-shot), 83.99 on HellaSwag (10-shot), and 53.05 on TruthfulQA (0-shot), reported as 106% of the base Mistral-7B and 98.6% of Llama-2-70B-chat on those metrics, and ranking first for all models under 30B at release time.^[5] On AGIEval the model recorded roughly 129% of the base Mistral-7B average, and on BIG-Bench-Hard 119%.^[5] Mistral-7B-OpenOrca also scored 6.86 on the MT-Bench dialogue benchmark, described by the team as on par with Llama-2-70B-chat on that test.^[5]

Mistral-7B-SlimOrca

A companion release, Mistral-7B-SlimOrca, applied the same fine-tuning recipe but on the smaller SlimOrca subset (described below). With approximately 500,000 entries it trained for four epochs in 40 hours on eight A6000 GPUs at a commodity cost near $240. The resulting Open LLM Leaderboard scores were essentially identical: 65.85 average, with 62.77 on MMLU, 62.54 on ARC, 83.86 on HellaSwag, and 54.23 on TruthfulQA.^[13] The near-equivalence with Mistral-7B-OpenOrca despite roughly one-third less training compute is the central demonstration of SlimOrca's data-efficiency claim.

SlimOrca and SlimOrca-Dedup

SlimOrca is a cleaned subset of the OpenOrca GPT-4 split, intended as a more compute-efficient training corpus. The dataset card describes the filter as an additional pass that uses GPT-4 to remove answers that appear incorrect when compared against the human-annotated ground-truth labels already present in the underlying FLAN Collection examples.^[7] In effect, FLAN provides a gold answer for each prompt, GPT-4 generated a free-form chain-of-thought response, and SlimOrca discards rows where the GPT-4 answer disagrees with the FLAN gold label.^[7]

The resulting dataset contains approximately 518,000 entries (517,982 rows by the dataset card's accounting) at roughly 986 MB in JSON form, distributed under the MIT license.^[7] The SlimOrca authors are listed as Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium."^[7] The team argues that SlimOrca permits training to similar quality as the full OpenOrca data at approximately two-thirds the compute, a claim supported by the near-identical Mistral-7B-SlimOrca and Mistral-7B-OpenOrca leaderboard scores cited above.^[5]^[13]

A further-cleaned variant, SlimOrca-Dedup, deduplicates SlimOrca using minhash and Jaccard similarity, drops RLHF instances, and yields approximately 363,000 unique conversations formatted as ShareGPT-style role-tagged messages.^[8] The Open-Orca organization also publishes a "slimorca-deduped-cleaned-corrected" variant that strips redundant system prompts and removes soft-prompted refusal patterns from a roughly half-sized cut of SlimOrca-Dedup.^[8]

Downstream impact and adoption

The community impact of OpenOrca and its derivatives is substantial. Hugging Face lists Mistral-7B-OpenOrca and Mistral-7B-SlimOrca as base models for many subsequent fine-tunes, with quantizations made widely available by TheBloke (AWQ, GPTQ, and GGUF formats) and packaging by Ollama under the mistral-openorca tag.^[5]^[13] The Jackalope-7B model from OpenAccess AI Collective is one published example explicitly built on SlimOrca data.^[7]

Beyond direct fine-tunes, the OpenOrca corpus and its SlimOrca subset have served as training data for an extensive set of community models tuned with Axolotl, the open-source fine-tuning framework also written by Wing Lian.^[5]^[13] The OpenOrca dataset card cites Hugging Face counts of more than 600 models trained on SlimOrca alone.^[7]

The success of these models, in particular the way Mistral-7B-OpenOrca briefly held the top sub-30B-parameter Hugging Face Open LLM Leaderboard position with a fully open dataset and a 62-hour eight-GPU run, became a frequently cited demonstration that the dataset gap between closed-source frontier teams and the open community could be partially closed by carefully curated GPT-augmented FLAN data.^[5]

A practical consequence of the OpenOrca rollout was that mid-2023 commodity GPU rentals became sufficient to train competitive sub-30B chat models. The OpenOrca dataset card reports that OpenOrca-Preview1-13B was produced in fifteen hours of eight-A100 time for under $200 in cloud commodity cost, OpenOrca-Platypus2-13B was produced with a single A100-80G LoRA run, and Mistral-7B-OpenOrca was produced in 62 hours of eight-A6000 time for roughly $400.^[1]^[3]^[5]^[9] These numbers, repeated across model cards and downstream coverage, helped to normalise expectations that small, independent teams could publish leaderboard-competitive chat models without large fundraising, provided they had access to a teacher-augmented dataset such as OpenOrca.^[5]

Comparison table: OpenOrca-derived models

Model	Base	OpenOrca data used	Hardware / time	Reported Open LLM Leaderboard average
OpenOrca-Preview1-13B^[9]	LLaMA-13B	~200k GPT-4 rows	8x A100-80G, 15h	~60% of Orca-paper gains on BBH+AGIEval
OpenOrcaxOpenChat-Preview2-13B^[4]	Llama 2-13B	Curated GPT-4 OpenOrca	Released 2023-08-10	First 13B above LLaMA-1-65B base
OpenOrca-Platypus2-13B^[3]	Llama 2-13B (merged)	OpenOrca + Open-Platypus	Single A100-80G, LoRA	64.56
LlongOrca-7B-16k / 13B-16k^[1]	Llama 2 + LongLoRA	OpenOrca	Long-context fine-tune	Top long-context 7B / 13B at release
Mistral-7B-OpenOrca^[5]	Mistral 7B v0.1	Curated GPT-4 OpenOrca	8x A6000, 62h, ~$400	65.84
Mistral-7B-SlimOrca^[13]	Mistral 7B v0.1	SlimOrca (~500k rows)	8x A6000, 40h, ~$240	65.85

Benchmark details

The Open LLM Leaderboard scores cited above are taken from the model cards on Hugging Face and are reported using the leaderboard's standard configuration: five-shot MMLU, 25-shot ARC-Challenge, ten-shot HellaSwag, and zero-shot TruthfulQA.^[3]^[5]^[13] OpenOrca-Platypus2-13B's reported numbers are 59.5 on MMLU, 62.88 on ARC, 83.19 on HellaSwag, and 52.69 on TruthfulQA, for an average of 64.56.^[3] Mistral-7B-OpenOrca reported 62.24, 64.08, 83.99, and 53.05 (average 65.84), and Mistral-7B-SlimOrca reported 62.77, 62.54, 83.86, and 54.23 (average 65.85).^[5]^[13]

Beyond the leaderboard, the Open-Orca team has published comparisons against the Microsoft Orca paper using two of the benchmarks Microsoft highlighted: BIG-Bench Hard and AGIEval. OpenOrca-Preview1-13B reported a BIG-Bench-Hard average of 0.3753 and an AGIEval average of 0.3638, framed as approximately 60% of the gain reported in the Orca paper.^[9] Mistral-7B-OpenOrca's AGIEval and BIG-Bench-Hard reports were 129% and 119% of the base Mistral-7B respectively, with an average AGIEval score of 0.397 and average BBH score of 0.416.^[5] The Mistral-7B-OpenOrca card additionally reports a MT-Bench score of 6.86 and a GPT4ALL benchmark average of 72.38.^[5]

The dataset card for OpenOrca itself does not include first-party AlpacaEval numbers; AlpacaEval is mentioned as a relevant benchmark for the model class but the cited results come from secondary scorecards rather than the OpenOrca card.^[5]

Reproduction of the Orca paper

OpenOrca exists primarily so that researchers can revisit and probe the central claims of the Orca paper without depending on a corpus held by Microsoft Research. The Orca paper described two design choices that distinguish it from earlier student-from-teacher recipes: a focus on explanation traces (the student is trained on full GPT-4 reasoning chains, not just final answers) and progressive learning (the student is first warmed up on ChatGPT outputs and then exposed to GPT-4 outputs).^[2] OpenOrca captures the first design choice directly through the dataset's system_prompt plus response schema, with the system prompt selected from the small bank of explanation-eliciting prompts described in the Orca paper.^[1] The second choice is implementable from the dataset because OpenOrca explicitly separates GPT-3.5 (ChatGPT) traces from GPT-4 traces, allowing downstream users to recreate the progressive learning schedule by mixing the two splits during training.^[1]

The OpenOrca dataset card frames OpenOrca-Platypus2-13B and Mistral-7B-OpenOrca as evidence that the explanation-tuning recipe is reproducible and generalises to base models that did not exist when Microsoft wrote the original paper.^[1] Mistral-7B-OpenOrca in particular was trained on a Mistral-7B base released after the Orca paper had appeared, so its reported leaderboard performance can be read as a test of whether explanation tuning still helps when the underlying pretrained model is stronger than LLaMA-1-13B.^[5]

Two methodological differences between OpenOrca and the corpus described in the Microsoft paper are worth flagging for researchers who want to repeat the experiments. First, the Microsoft team described a roughly four-to-one ratio of GPT-3.5 to GPT-4 examples and used the GPT-3.5 examples in a warm-up phase before the GPT-4 phase. OpenOrca achieves a similar ratio in aggregate (roughly 3.2:1 GPT-3.5 to GPT-4) but exposes both splits side-by-side rather than imposing the staged schedule, leaving the choice of warm-up to downstream consumers.^[1] Second, the FLAN-shortfall numbers acknowledged on the OpenOrca dataset card mean that OpenOrca-trained replications of the Orca paper begin with a smaller and slightly differently distributed corpus than the original, so any quantitative claim of "reproducing Orca" should be qualified accordingly.^[1] In practice the community has tended to focus on Hugging Face Open LLM Leaderboard scores and BIG-Bench-Hard or AGIEval comparisons that put OpenOrca-trained models alongside the Orca paper's reported deltas rather than reproducing the Microsoft numbers identically.^[3]^[5]^[9]

Limitations and criticisms

OpenOrca's documentation and downstream commentary identify several limitations.

First, OpenOrca is incomplete relative to the dataset Microsoft used. The dataset card itself flags missing rows: only roughly 75,000 CoT entries against an intended 150,000, around 1.25 million missing FLAN 2021 rows, and approximately 200,000 missing T0 rows, leaving the public dataset about 1.5 million rows short of the Microsoft target.^[1] This shortfall is attributed primarily to limitations in publicly hosted FLAN Collection releases at the time, rather than to a deliberate sampling choice.^[1]

Second, OpenOrca inherits GPT-4 and GPT-3.5 errors. The SlimOrca cleaning pass exists precisely because spot checks showed that some GPT-4 responses disagreed with FLAN ground-truth labels, and removing those rows reduced training data by roughly 88% without obviously hurting downstream evaluation scores.^[7] That ratio implies most OpenOrca rows are still affected by some form of teacher-model error, even after filtering "As an AI language model..."-style prefixes during early training runs.^[9]

Third, the dataset's English-only scope and FLAN-derived task distribution leave gaps in multilingual coverage, coding tasks, and modern conversational behaviours such as tool use and function calling.^[1] Downstream models built on OpenOrca have been augmented with additional datasets (Open-Platypus for STEM coverage in OpenOrca-Platypus2-13B, for example) precisely to address those gaps.^[3]

Fourth, and most prominently, OpenOrca raises a licensing question that the team has acknowledged but not resolved definitively. OpenAI's terms of service prohibit using the outputs of its models to develop products that compete with OpenAI.^[14] OpenOrca is distributed under the MIT license, which by design imposes no such restriction on downstream users.^[1] The team's broader community reasoning, articulated separately by contributor Eric Hartford in an essay on OpenAI dataset licensing, is that the contractual restriction binds the user who generates the outputs through their account agreement with OpenAI, not the data itself, so once outputs are produced and owned by the user, they may be relicensed and redistributed independently.^[15] That argument has not been tested in court, and downstream users who choose to fine-tune commercial models on OpenOrca should be aware that they may inherit some legal exposure even if the dataset itself is MIT-licensed.^[14]^[15]

Fifth, the rate at which the open community has moved beyond OpenOrca means that newer instruction-tuned chat models often outperform Mistral-7B-OpenOrca on the same Hugging Face Open LLM Leaderboard categories, in part because DPO and other preference-optimisation methods now use OpenOrca-style synthetic data as one ingredient among many rather than as the sole training corpus.^[16]

Successor and parallel projects

Microsoft followed its original Orca paper with Orca 2 on November 20, 2023, two seven- and 13-billion-parameter LLaMA-2 fine-tunes trained on tailored synthetic data that emphasised "prompt erasure," meaning the student saw the teacher's task and response but not the system prompt that elicited the response.^[17] Microsoft reported that Orca 2 surpassed similarly sized models, including the original Orca, on advanced reasoning tasks in zero-shot settings, and made the weights available for research use.^[17] Orca 2 did not release its training corpus either, so OpenOrca remains the open-community analogue rather than being replaced by the official Microsoft data.

OpenOrca itself has been complemented rather than displaced by later open instruction-tuning datasets. Subsequent synthetic-data corpora produced by community fine-tuners have continued to draw on OpenOrca for its strong FLAN-style task coverage, often combining OpenOrca or SlimOrca with conversational sources, code-instruction datasets, and DPO preference data.^[16]

OpenOrca is one node in a growing family of open instruction-tuning resources. Related entries on this wiki include:

Instruction tuning, the general technique that OpenOrca exemplifies on the data side.^[6]
Synthetic data, the broader practice of using large model outputs to bootstrap training corpora.
Knowledge distillation from teacher to student models, which the Orca paper formalises into explanation tuning.^[2]
Supervised fine-tuning (SFT), the optimisation target for models trained on OpenOrca.
Axolotl, the fine-tuning framework most commonly used to train OpenOrca-derived models.
Chain-of-thought reasoning, the response format the FLAN CoT submix and Orca paper system prompts try to elicit.^[6]^[2]
Mistral 7B, the base model that powered the most widely adopted OpenOrca fine-tune.
Vicuna and the broader family of GPT-distilled chat models that motivated the Orca paper's critique of style-only imitation.^[2]
Direct Preference Optimization, used in successor recipes that often build on OpenOrca SFT checkpoints.

References

Open-Orca Team (Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, "Teknium"), "Open-Orca/OpenOrca Dataset Card", Hugging Face, 2023. https://huggingface.co/datasets/Open-Orca/OpenOrca. Accessed 2026-05-20. ↩
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah, "Orca: Progressive Learning from Complex Explanation Traces of GPT-4", arXiv preprint 2306.02707, 2023-06-05. https://arxiv.org/abs/2306.02707. Accessed 2026-05-20. ↩
Open-Orca Team, "Open-Orca/OpenOrca-Platypus2-13B Model Card", Hugging Face, 2023. https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B. Accessed 2026-05-20. ↩
Open-Orca Team, "Open-Orca/OpenOrcaxOpenChat-Preview2-13B Model Card", Hugging Face, 2023-08-10. https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B. Accessed 2026-05-20. ↩
Open-Orca Team, "Open-Orca/Mistral-7B-OpenOrca Model Card", Hugging Face, 2023. https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca. Accessed 2026-05-20. ↩
Shayne Longpre et al., "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning", Google Research, 2023. https://research.google/pubs/the-flan-collection-designing-data-and-methods-for-effective-instruction-tuning/. Accessed 2026-05-20. ↩
Open-Orca Team, "Open-Orca/SlimOrca Dataset Card", Hugging Face, 2023. https://huggingface.co/datasets/Open-Orca/SlimOrca. Accessed 2026-05-20. ↩
Open-Orca Team, "Open-Orca/SlimOrca-Dedup Dataset Card", Hugging Face, 2023. https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup. Accessed 2026-05-20. ↩
Open-Orca Team, "Open-Orca/OpenOrca-Preview1-13B Model Card", Hugging Face, 2023. https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B. Accessed 2026-05-20. ↩
Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz, "Platypus: Quick, Cheap, and Powerful Refinement of LLMs", arXiv preprint 2308.07317, 2023-08-14. https://arxiv.org/abs/2308.07317. Accessed 2026-05-20. ↩
garage-bAInd, "Platypus2-13B Model Card", Hugging Face, 2023. https://huggingface.co/garage-bAInd/Platypus2-13B. Accessed 2026-05-20. ↩
Mistral AI, "Announcing Mistral 7B", Mistral AI Blog, 2023-09-27. https://mistral.ai/news/announcing-mistral-7b. Accessed 2026-05-20. ↩
Open-Orca Team, "Open-Orca/Mistral-7B-SlimOrca Model Card", Hugging Face, 2023. https://huggingface.co/Open-Orca/Mistral-7B-SlimOrca. Accessed 2026-05-20. ↩
OpenAI, "Terms of Use", OpenAI Policies, 2024. https://openai.com/policies/row-terms-of-use/. Accessed 2026-05-20. ↩
Eric Hartford, "Demystifying OpenAI's Terms of Use with Regards to Dataset Licenses", erichartford.com, 2023. https://erichartford.com/demystifying-openais-terms-of-use-with-regards-to-dataset-licenses. Accessed 2026-05-20. ↩
Hugging Face, "HuggingFaceH4/orca_dpo_pairs Dataset Card", Hugging Face, 2023. https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs/blob/main/README.md. Accessed 2026-05-20. ↩
Microsoft Research, "Orca 2: Teaching Small Language Models How to Reason", Microsoft Research Blog, 2023-11-20. https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Dolma

Infobox

Background and motivation

The FLAN Collection backbone

Data generation and schema

Models trained on OpenOrca

OpenOrca-Preview1-13B

OpenOrcaxOpenChat-Preview2-13B

OpenOrca-Platypus2-13B

LlongOrca long-context variants

Mistral-7B-OpenOrca

Mistral-7B-SlimOrca

SlimOrca and SlimOrca-Dedup

Downstream impact and adoption

Comparison table: OpenOrca-derived models

Benchmark details

Reproduction of the Orca paper

Limitations and criticisms

Successor and parallel projects

Related Work

See also

References

Improve this article

Related Articles

Dolma

RefinedWeb

SlimPajama

Cosmopedia

TxT360

The Pile (dataset)

What links here

Related Articles

Dolma

RefinedWeb

SlimPajama

Cosmopedia

TxT360

The Pile (dataset)