LIMA (Less Is More for Alignment)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,195 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,195 words
Add missing citations, update stale details, or suggest a clearer explanation.
LIMA, short for "Less Is More for Alignment," is a 2023 research paper by Chunting Zhou and colleagues at Meta AI, Carnegie Mellon University, the University of Southern California, and Tel Aviv University that introduced a 65-billion parameter language model fine-tuned on only 1,000 carefully curated prompt-response pairs, with no reinforcement learning and no human preference data.[^1] The paper formalised the "Superficial Alignment Hypothesis," the claim that almost all knowledge in a large language model is acquired during pretraining and that alignment merely teaches the model which subdistribution of formats to use when interacting with users.[^1] LIMA's controlled comparisons showed that a small, high-quality supervised fine-tuning (SFT) set could produce responses preferred over those from GPT-4 in 43 percent of human pairwise judgments, and over Bard and OpenAI's text-davinci-003 in 58 percent and 65 percent of cases respectively.[^1] Published at NeurIPS 2023, the paper became one of the most cited touchstones in the post-ChatGPT discourse on instruction tuning data, motivating a wave of follow-up work on data curation, quality filtering, and the limits of stylistic versus reasoning-oriented alignment.[^1][^2]
By spring 2023 the dominant recipe for turning a base language model into a chat assistant had two costly stages after pretraining: large-scale SFT on hundreds of thousands to millions of instruction-response pairs, followed by reinforcement learning from human feedback (RLHF) using pairwise preference data and an algorithm such as Proximal Policy Optimization.[^3] OpenAI's InstructGPT established this pipeline in early 2022, and the subsequent rollout of ChatGPT and GPT-4 suggested that scale of preference data was central to producing usable assistants.[^4] Public efforts to reproduce the recipe followed: Stanford's Alpaca distilled around 52,000 instructions from text-davinci-003 onto a 7-billion parameter LLaMA base, Vicuna used roughly 70,000 ShareGPT conversations, and other open-source projects pushed in the same direction of larger instruction sets generated by stronger teachers.[^5]
The LIMA authors questioned whether the second stage really needed to be so heavy. Their team, which included Mike Lewis, Luke Zettlemoyer, Omer Levy, and Susan Zhang at Meta AI, together with collaborators at Carnegie Mellon University, had access to Meta's 65B parameter LLaMA base model and a long internal history of work on instruction-tuned text generation.[^1] The paper they posted on arXiv on 18 May 2023, identifier 2305.11206, set out to disentangle what pretraining had already taught a model from what instruction tuning added.[^1] It was accepted to NeurIPS 2023 as a poster presentation.[^2]
The central conceptual contribution of the paper is a single sentence that the authors call the Superficial Alignment Hypothesis: "A model's knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users."[^1] In this framing, the long secondary phase of instruction tuning and preference optimization is not adding capability so much as selecting a writing style, a layout convention, and a register that match the expectations of an end user.[^1]
Two empirical predictions follow from this hypothesis. First, if alignment is principally about format selection, then a small set of high-quality demonstrations should suffice to elicit useful behaviour from a sufficiently strong base model, because the model already knows what to say and only needs to be shown how to present it. Second, scaling up the alignment dataset without scaling up the diversity or quality of its prompts should yield rapidly diminishing returns, because additional examples mostly repeat the same stylistic lessons. The paper presents controlled experiments designed to test both predictions.[^1]
The hypothesis is deliberately strong. It is not merely the claim that pretraining contributes substantially to a model's behaviour, which would be uncontroversial; it is the claim that alignment teaches "almost" no new knowledge and is primarily a selection over already-acquired distributions.[^1] Read literally, this implies that the order of magnitude of usable instruction data should not matter once a basic stylistic template has been conveyed, and that hand-curated demonstrations should compete with much larger noisy datasets. Both claims are testable, and both are tested in the paper's ablation section. The hypothesis also implies a sharp asymmetry between domains the base model already handles (general knowledge, common reasoning patterns, everyday writing) and domains where the base model is weak (specialised reasoning, novel skills, multi-step planning). The paper does not test this asymmetry directly, but it is a key crack in the framing that follow-up work would pry open.[^1][^13]
The training set comprises exactly 1,000 prompt-response pairs totalling roughly 750,000 tokens.[^1] The authors split the data into two sources: 750 community-sourced examples and 250 manually authored examples.[^1]
| Source | Count | Notes |
|---|---|---|
| Stack Exchange (STEM) | 200 | Top-voted question and answer pairs from technical sites |
| Stack Exchange (Other) | 200 | Top-voted pairs from non-technical sites |
| wikiHow | 200 | One article per category, title as prompt, body as response |
| Pushshift r/WritingPrompts | 150 | Highest-scoring story prompts and replies |
| Paper Authors (Group A, training) | 200 | Hand-written prompts and responses |
| Super-Natural Instructions | 50 | Sampled and lightly rewritten task instances |
| Total training | 1,000 | Roughly 750,000 tokens across 1,000 sequences |
The Stack Exchange portion was selected with explicit length and quality filters. The authors discarded answers shorter than 1,200 characters or longer than 4,096 characters, removed answers written in the first person ("I," "my") or that referenced other answers, and kept only questions whose titles were self-contained.[^1] Within each subset they sampled 200 questions with the highest aggregate score, using a strata that balanced STEM communities such as Stack Overflow against humanities and lifestyle sites.[^1]
For wikiHow, the authors first sampled one of 19 top-level categories and then drew an article within it, using the article title as the user prompt and the article body, rewritten lightly to remove site-specific scaffolding, as the assistant response.[^1] The Pushshift r/WritingPrompts subset took the top-scoring prompt-and-reply pairs from the Reddit creative writing community.[^1]
The remaining 250 hand-crafted examples were written by the paper's authors. The "Group A" set of 200 covered a broad spectrum of tasks (advice, brainstorming, planning, role play, factual question answering, mild safety probes) and was the primary training portion; the smaller "Group B" set of 230 was held out for evaluation along with 70 prompts sampled from r/AskReddit.[^1] A further 50 examples were drawn and adapted from the Super-Natural Instructions benchmark of academic NLP tasks.[^1] Throughout, the authors emphasised uniform stylistic conventions across responses (helpful tone, clear structure, neutral formality) so that the model would learn a consistent voice rather than an averaged mush of community styles.[^1]
The deliberate compactness of the dataset has two consequences. The first is that every example bears a high marginal weight: a single low-quality response can measurably drag down model behaviour during training, which raised the curation bar far above what large noisy sets demand. The second is that the dataset has a recognisable authorial voice: responses are roughly the length of a thoughtful blog comment, structured with explicit headings or bullet points where appropriate, and avoid hedging boilerplate. The paper argues that this stylistic consistency is itself part of what the model learns and that mixing voices from many different annotators would dilute the signal.[^1] The breakdown is summarised in the paper's Table 1, which the authors treat as a recipe other groups can copy: roughly three-quarters scraped community content and one-quarter author-written exemplars, with explicit length and topical filters in between.[^1]
LIMA fine-tunes LLaMA 65B with standard cross-entropy on the 1,000-example set.[^1] No reward model is trained; there is no PPO phase, no preference dataset, and no Direct Preference Optimization (DPO) loss.[^1][^6] The optimizer is AdamW with beta_1 of 0.9 and beta_2 of 0.95, weight decay of 0.1, and a learning rate that starts at 1e-5 and decays linearly to 1e-6.[^1] The batch size is 32 sequences (64 for the smaller ablation models), the maximum sequence length is 2,048 tokens, and the model trains for 15 epochs, with the authors selecting checkpoints between epochs 5 and 10 by manual inspection rather than by a held-out validation loss.[^1] Residual dropout is applied with a per-layer rate scaled from 0.0 at the bottom of the stack to 0.3 at the top, a regulariser the authors found important given the tiny dataset size.[^1]
To delimit speaker turns the paper introduces an end-of-turn (EOT) token distinct from the standard end-of-sequence (EOS) token, so that during inference the model can stop after the assistant's reply without terminating the entire dialogue.[^1] The recipe is otherwise unremarkable: a standard SFT objective on a few thousand high-signal examples, run for a few hours on a small slice of the cluster that produced the LLaMA base.[^1]
The authors evaluated LIMA against five baselines using a test set of 300 prompts drawn from the held-out Group B authors' prompts and r/AskReddit samples.[^1] For each prompt LIMA's reply was compared head-to-head against a reply from one of the baselines: Alpaca 65B (Meta's LLaMA 65B fine-tuned by the LIMA authors on the 52,000-example Alpaca dataset), OpenAI's text-davinci-003, Google's Bard, Anthropic's Claude, and GPT-4.[^1] Two parallel annotator pools rated which reply was better or whether the pair tied: human crowd workers and authors on one track, GPT-4 used as an automatic judge on the other.[^1]
Inter-annotator agreement was 82 percent between crowd workers and 81 percent between crowd workers and the paper authors, with author-author agreement at 78 percent and crowd-GPT-4 agreement at 78 percent.[^1] These rates are comparable to the agreement levels reported in earlier preference-evaluation work and gave the authors confidence that the human-versus-GPT-4 judge tracks were measuring similar signals.[^1]
| Baseline | Human win rate for LIMA |
|---|---|
| Alpaca 65B | 65% |
| text-davinci-003 | 65% |
| Bard | 58% |
| Claude | 46% |
| GPT-4 | 43% |
The headline number, that a 65B model fine-tuned on 1,000 examples could match or exceed GPT-4 in 43 percent of pairwise judgments, surprised many readers because GPT-4 had been fine-tuned and aligned with full-scale RLHF on orders of magnitude more data.[^1] The result against Claude, very near parity at 46 percent, was similarly attention-getting.[^1] The win rates against text-davinci-003 and Alpaca 65B were the largest, suggesting that careful curation could compensate for substantial differences in instruction data scale.[^1]
The paper also tested LIMA on a small multi-turn dialogue benchmark of 10 hand-crafted conversations. Without any dialogue training, LIMA's responses were judged "excellent" 45.2 percent of the time and produced 15 outright failures across 42 turns.[^1] After adding 30 multi-turn conversation chains to the training set, 10 written by the authors and 20 drawn from Stack Exchange comment threads, the excellent-response rate rose to 76.1 percent and only one failure occurred across 46 turns.[^1] In direct comparisons the dialogue-augmented model was significantly better than the dialogue-naive model in 7 of 10 conversations and tied in 3.[^1] The finding that 30 additional examples could lift multi-turn coherence so sharply reinforced the paper's general thesis.
The most influential portion of the paper is its three-way ablation on the variables that drive instruction-tuning effectiveness.[^1] The authors performed each ablation by training a fresh LLaMA 7B model on different subsets and grading responses on a 6-point scale using ChatGPT (gpt-3.5-turbo) as the judge.
The quantity ablation doubled the training set repeatedly, from 2,000 up to 32,000 examples sampled from Stack Exchange.[^1] Despite the 16-fold scaling, ChatGPT helpfulness scores plateaued and did not improve meaningfully, supporting the prediction that additional examples without new diversity yield no benefit.[^1] The figure became a touchstone in subsequent literature for the claim that "data quantity dominates only up to a point."[^1]
The quality ablation compared filtered Stack Exchange data, which passes the length, first-person, and reference filters described above, against unfiltered data of the same size.[^1] The filtered set produced a roughly 0.5-point gain on the 6-point ChatGPT scale, a margin the authors described as significant given the otherwise identical training conditions.[^1]
The diversity ablation pitted two equally large (2,000 example) datasets against each other: filtered Stack Exchange, which spans a wide range of communities and topics, and a wikiHow set, where prompts are largely "how to do X" requests with similar structure.[^1] Despite both having high-quality responses, the Stack Exchange set produced markedly better scores, isolating prompt diversity as a primary driver of generalisation.[^1] The authors interpreted the trio of results as: scaling beats quantity only when both quality and diversity rise together.[^1]
LIMA appeared less than three months after the public release of GPT-4 and one month before Meta's Llama 2 launch, a window in which the open-source community was hungry for cheaper alignment recipes. The arXiv paper accumulated thousands of citations within a year and was discussed across the alignment forum, NLP venues, and industry blogs.[^7] Practitioners read it as a practical argument that they could build credible chat assistants from a strong base model and a small, hand-tuned dataset without the operational overhead of RLHF.[^1]
The paper's influence is most visible in the data-curation choices made by later instruction-tuning projects. The UltraChat dataset, released later in 2023, applied diversity- and quality-focused filtering to a much larger synthetic conversation corpus, citing LIMA in its motivation for principled curation over indiscriminate scraping.[^8] The Zephyr 7B Beta model, released in October 2023, distilled UltraChat into a small model using a SFT-then-DPO recipe that explicitly invoked LIMA's data-quality-first framing.[^9] OpenHermes 2.5, released in early 2024, organised its million-example mix around a similar quality-and-diversity story, although it scaled the absolute count up by three orders of magnitude beyond LIMA.[^10]
The Allen Institute's Tülu 3 project, released in November 2024, took the methodological lessons of LIMA further. Its 939,000-example SFT mixture was built through iterative ablations on data sources and skill coverage, blending real-world chat logs from WildChat with synthetic persona-driven prompts targeted at math, coding, and instruction following.[^11] The Tülu 3 technical report explicitly grounds its approach in the LIMA-style observation that prompt diversity and response quality matter more than raw token count, even as the project rejects LIMA's specific claim that 1,000 examples are sufficient for frontier-quality chat behaviour.[^11]
| Dataset | Year | Approximate size | Approach |
|---|---|---|---|
| LIMA | 2023 | 1,000 examples | Hand-curated, no RL |
| Alpaca | 2023 | 52,000 examples | Self-Instruct distillation from text-davinci-003 |
| Vicuna ShareGPT | 2023 | ~70,000 conversations | Scraped real-world chats with ChatGPT |
| UltraChat | 2023 | ~1.5 million dialogues | Systematic synthetic generation |
| OpenHermes 2.5 | 2024 | ~1 million examples | Curated mix of open instruction sets |
| Tülu 3 SFT | 2024 | ~939,000 examples | Iterative skill-targeted mixture |
The pattern across the table is that LIMA's pure "less is more" stance has not won out in absolute terms: state-of-the-art open-weight chat models continue to use hundreds of thousands of examples. What did endure is LIMA's methodological emphasis: aggressive deduplication, length and quality filters, skill-coverage analysis, and explicit diversity of prompt distributions are now standard practice in instruction-tuning pipelines.[^8][^11]
The most influential challenge to LIMA's framing came from researchers exploring the limits of stylistic versus reasoning alignment. Lin et al.'s URIAL paper, posted in December 2023, pushed the Superficial Alignment Hypothesis to an extreme by showing that a base LLaMA model could be "aligned" purely through in-context learning with three stylistic exemplars and a system prompt, without any fine-tuning at all.[^12] URIAL's analysis of token-distribution shifts between base and aligned models found that the bulk of changes occurred on a small set of stylistic tokens (greetings, hedges, structural connectives) while the vast majority of token positions were nearly identical in distribution.[^12] The result corroborated LIMA's core claim that alignment is largely about format selection.[^12]
A countervailing critique appeared in Ghosh et al.'s 2024 paper "Revisiting the Superficial Alignment Hypothesis," which argued that LIMA-style evaluations conflate two distinct things: the ability to produce a chatbot-shaped reply and the ability to actually solve a task.[^13] On objective reasoning benchmarks such as GSM8k for grade-school math and SubQA for multi-hop question answering, the authors found that performance continued to scale with fine-tuning data well beyond the 1,000-example regime that saturated preference-based win rates.[^13] They concluded that the Superficial Alignment Hypothesis is, at best, an oversimplification: it holds for stylistic chatbot benchmarks but breaks down on tasks that require reasoning the base model has not yet mastered.[^13] The same paper warned that GPT-4-as-judge preferences can mislead because the judge rewards chatbot-style presentation even when the underlying answer is mathematically wrong.[^13]
A separate methodological critique focused on the specifics of the LIMA-versus-Alpaca comparison. Zhao et al.'s 2024 "Long Is More for Alignment" paper showed that simply selecting the 1,000 Alpaca instructions with the longest responses could match or surpass LIMA on standard preference benchmarks.[^14] A 2025 follow-up published as "Call for Rigor in Reporting Quality of Instruction Tuning Data" argued that LIMA-style claims of dataset superiority depend heavily on validation hyperparameters and that minor recipe changes can flip the conclusion either way.[^15] The cumulative effect of these critiques is not to refute LIMA's headline result but to circumscribe it: 1,000 examples are sufficient for chat-style alignment of a strong base model, the underlying data-quality lesson is robust, but quantitative claims about which curated set is "best" are sensitive to evaluation setup.[^13][^15]
A complementary line of work explored whether LIMA-scale fine-tuning generalises to safety. Several papers have argued for a "Superficial Safety Alignment Hypothesis," noting that even small numbers of adversarial fine-tuning examples can rapidly undo months of preference-based safety training, consistent with the idea that the safety layer is itself superficial.[^16] These findings cut both ways for LIMA's thesis: they confirm that alignment is shallow, but they raise the concern that a fine-tuning regime as light as LIMA's may also be easy to subvert.[^16]
The LIMA paper itself is candid about its limitations.[^1] The authors note that constructing 1,000 high-quality examples required substantial human effort, that the mental load of curation is "significant and difficult to scale up," and that LIMA "is not as robust as product-grade models."[^1] Even when LIMA produces good responses on average, an unlucky decoding sample or an adversarial prompt can elicit a weak or undesirable reply, a property the paper attributes to the absence of preference-based training and to the small dataset's incomplete coverage of edge cases.[^1] The paper also notes that LIMA's safety behaviour, although decent on common probes, was not optimised through any explicit red-teaming or safety dataset, leaving it more brittle than RLHF-tuned counterparts.[^1]
The choice of LLaMA 65B as the base also conditions the result. The paper does not claim that LIMA-style alignment would work on a substantially weaker base model: the Superficial Alignment Hypothesis presumes that the model already has the relevant knowledge, and a 7B base may simply lack the latent capability that the 1,000 examples are supposed to surface.[^1] Subsequent attempts to replicate LIMA on smaller bases have shown the recipe transfers but with predictably lower absolute quality.[^11]
A final limitation, raised more sharply by follow-up critiques than by the original paper, is the narrow evaluation surface. LIMA's main metric is pairwise preference between two model replies on open-ended prompts. This is well-suited for measuring chat usefulness but says little about reasoning depth, factual accuracy, or tool-use competence, all of which the field has since identified as more demanding axes of post-training.[^13][^15] In the absence of objective benchmarks, the LIMA win rates may overstate how much capability is genuinely transferred by the small dataset, as opposed to how much chatbot polish is added on top of base capabilities.[^13]
LIMA's lasting significance lies less in its specific model than in the framing it gave to a field that was, in early 2023, drowning in instruction data of inconsistent quality. By isolating one extreme of the design space, 1,000 examples, no preferences, no reinforcement learning, the paper made it possible to ask cleanly which post-training ingredients were actually doing the work.[^1] The Superficial Alignment Hypothesis gave researchers and engineers a sharp, falsifiable claim to push against, and the resulting literature on quality, diversity, and reasoning-vs-style decomposition has structured much of the subsequent work on SFT data design.[^11][^13][^15]
The paper's recipe also had immediate practical consequences for the open-source ecosystem. Lab teams without the resources to run RLHF pipelines could now point to a peer-reviewed result showing that careful SFT alone could produce credible chat assistants, lowering the perceived cost of entry to instruction-tuned model production.[^2] Several open releases in 2023 and 2024 explicitly adopted a "SFT-first, RL later, or maybe never" stance traceable to LIMA's influence.[^9][^11]
The methodology has also influenced how datasets are documented. Subsequent instruction-tuning releases routinely report per-source counts, length distributions, deduplication rates, and skill-coverage breakdowns in the LIMA mold, a level of transparency that was rare in 2022-era instruction sets.[^11] LIMA thus serves both as a model artefact and as a template for principled data accounting in the post-training era.
| Approach | Data scale | Method | Year | Comparison to LIMA |
|---|---|---|---|---|
| InstructGPT | ~13k SFT + 33k preferences | SFT + RLHF | 2022 | The recipe LIMA challenged |
| Alpaca | 52k instructions | SFT (Self-Instruct) | 2023 | LIMA outperformed Alpaca 65B in 65% of human comparisons |
| Vicuna | ~70k conversations | SFT on ShareGPT | 2023 | Larger, less curated, similar SFT-only philosophy |
| URIAL | 0 fine-tuning examples | In-context prompting | 2023 | Extreme version of LIMA's hypothesis |
| Zephyr 7B Beta | UltraChat + UltraFeedback | SFT + DPO | 2023 | Combines LIMA-style curation with preference learning |
| Tülu 3 SFT | ~939k examples | Iterative SFT mixture | 2024 | Inherits LIMA's quality emphasis but rejects 1,000-example sufficiency |
The progression in the table reflects how LIMA reshaped the design space without winning every argument. The pure 1,000-example recipe did not become the dominant production approach, but the underlying commitments to careful curation, diversity-aware sampling, and explicit ablations on quality versus quantity have become standard.[^11]