26 Principles of Good Prompts
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,476 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,476 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Prompt engineering for text generation and Guides
The 26 principles of good prompts are a set of practical rules for writing prompts to large language models, introduced in the December 2023 paper Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 by Sondos Mahmoud Bsharat, Aidar Myrzakhan, and Zhiqiang Shen of the VILA Lab at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in Abu Dhabi. The paper argues that simple, well-chosen instructions can substantially improve the quality and accuracy of responses from systems like GPT-4, GPT-3.5, and the LLaMA family without any fine tuning. The principles cover formatting tricks, role assignments, audience targeting, monetary incentives, penalties, and reasoning patterns, and they have been widely circulated in prompt engineering tutorials and training material.[1][2]
The paper appeared on arXiv as preprint 2312.16171 on December 26, 2023, with a revised version on January 18, 2024. The authors evaluated each principle on a custom benchmark called ATLAS, which contains roughly 13,000 instruction-response data points and 20 hand-curated test questions per principle. They tested seven models: LLaMA-1 and LLaMA-2 at 7B, 13B, and 70B parameter scales, plus GPT-3.5 and GPT-4. On GPT-4 the authors reported an average improvement of 57.7% in response quality (which they call boosting) and 36.4% in correctness. They also found that larger models benefited more than smaller ones, with relative accuracy gains above 20% on the bigger systems. The work has been cited dozens of times and the ATLAS repository has a substantial GitHub following, although the methodology has drawn criticism.[1][3][4]
The central claim is that you do not need to retrain a model to get better answers; you can rewrite the prompt instead. The 26 principles are presented as concrete patterns a non-specialist user can copy. Applying the principles raised quality by an average of 57.7% on GPT-4 and lifted correctness by 36.4%. Across all tested models, principled prompts produced roughly a 50% average improvement.[1][3]
The ATLAS dataset is split into a general portion and a per-principle portion, with two measurement categories. The boosting metric tracks how much a principle improves the perceived quality of a response on the same question; correctness tracks factual accuracy. For each principle the authors picked 20 questions designed to expose the prompt pattern, then ran each model with and without the principled rewrite and compared outputs. On LLaMA-2-7B and 13B, correctness improvements often sat in the 10% to 40% range. On the 70B model and on GPT-3.5 and GPT-4 the relative gains commonly exceeded 40%, the basis for the claim that scale and principled prompting compound.[1][3]
The authors group the 26 principles into five categories: prompt structure and clarity; specificity and information; user interaction and engagement; content and language style; and complex tasks and coding prompts. The structure category covers formatting markers, delimiters, output primers, and audience targeting. Specificity covers few-shot examples, explicit requirements, anti-bias instructions, and stylistic matching. Interaction covers clarifying questions and teach-and-test formats. Language style covers role assignment, directive phrasing, penalties, tipping, and the no-politeness rule. The complex-task bucket covers decomposition, multi-file code generation, and combining chain-of-thought with few-shot prompts.[1][5]
The table below lists each principle with the exact intent described in the paper. The wording is condensed where the original is long; the numbering matches the paper.[1]
| # | Principle | What it does |
|---|---|---|
| 1 | Skip politeness fillers like "please," "thank you," or "if you don't mind" and get straight to the point. | Concise and direct. |
| 2 | Tell the model who the audience is, for example "the audience is an expert in the field." | Calibrates depth and vocabulary. |
| 3 | Break complex tasks into a sequence of simpler prompts in an interactive conversation. | Reduces error compounding. |
| 4 | Use affirmative directives like "do" and avoid negative phrasing like "don't." | Easier for the model to follow. |
| 5 | Ask for simple explanations, for example "explain to me like I'm 11 years old" or "explain to me as if I'm a beginner." | Forces accessible language. |
| 6 | Add an incentive phrase such as "I'm going to tip $xxx for a better solution!" | Claimed to raise effort, disputed (see below). |
| 7 | Use few-shot prompting with worked examples. | Shows the desired output shape. |
| 8 | Use formatting markers such as ###Instruction###, ###Example###, and ###Question### with line breaks between sections. | Separates context from task. |
| 9 | Include the phrases "Your task is" and "You MUST." | Anchors the obligation. |
| 10 | Include the phrase "You will be penalized." | Counterpart to tipping. |
| 11 | Add "Answer a question given in a natural, human-like manner." | Reduces robotic phrasing. |
| 12 | Use leading phrases such as "think step by step." | Triggers zero-shot chain-of-thought reasoning. |
| 13 | Add "Ensure that your answer is unbiased and does not rely on stereotypes." | Anti-bias guardrail. |
| 14 | Tell the model to ask you questions until it has enough information, for example "from now on, I would like you to ask me questions to..." | Surfaces ambiguity. |
| 15 | Ask the model to teach you a topic and quiz you on it without revealing answers first. | Active recall study aid. |
| 16 | Assign a role to the model, for example "you are a senior tax accountant." | Conditions tone and expertise. |
| 17 | Use delimiters such as triple backticks or XML tags around inputs. | Prevents prompt confusion. |
| 18 | Repeat a specific word or phrase several times within a prompt. | Emphasizes priorities. |
| 19 | Combine chain-of-thought reasoning with few-shot examples. | Reasoning plus demonstration. |
| 20 | End the prompt with the beginning of the desired output (an output primer). | Pulls the model into the right shape. |
| 21 | For long writing requests, instruct "write a detailed essay/text/paragraph for me on [topic] by adding all the information necessary." | Counters terse default outputs. |
| 22 | When asking for edits, instruct the model to improve only grammar and vocabulary, not style or formality. | Preserves voice. |
| 23 | For multi-file coding tasks, ask for a script that creates or modifies the necessary files. | Avoids partial dumps. |
| 24 | When continuing a draft, supply the opening and ask the model to finish in the same flow. | Maintains tone. |
| 25 | State requirements clearly as keywords, rules, hints, or instructions. | Reduces drift. |
| 26 | Ask the model to match the language style of a provided sample. | Stylistic imitation. |
Many of these will look familiar to anyone who has read earlier prompting research. Principle 12 ("think step by step") is the zero-shot chain-of-thought trick popularized by Kojima and colleagues in Large Language Models are Zero-Shot Reasoners (2022). Principles 7 and 19 are recombinations of the few-shot and chain-of-thought patterns from the original GPT-3 paper and from Wei et al.'s 2022 chain-of-thought work. The contribution of Bsharat et al. is not so much inventing each technique as collecting, naming, and benchmarking them in one place.[6]
One of the more interesting empirical results is that the 26 principles do not pay off equally across model sizes. On LLaMA-2-7B, many principles produced modest correctness gains, sometimes under 20%. On LLaMA-2-70B, GPT-3.5, and GPT-4 the same prompt patterns produced larger and more consistent improvements. The authors read this as evidence that bigger models are better at following instruction-level cues; smaller models lack the headroom to act on them. A 7B local model may shrug off elaborate role assignments and penalty language, while a frontier model picks them up reliably. It also suggests principle effectiveness is a moving target: as models get better at following natural language, some of these tricks may stop adding value because the model already infers the request.[1][3]
Several of the 26 are either reframings of older techniques or specific empirical claims that have been tested independently.
Principle 1 (skip politeness) is one of the most quoted lines and also one of the least settled. A 2024 ACL workshop paper by Ziqi Yin and colleagues, Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance, found that the relationship between politeness and output quality is language dependent. In English, GPT-3.5 actually performed best with highly polite prompts on the benchmarks tested. In Japanese, less polite phrasing tended to help; in Chinese, more polite phrasing helped. Overly rude prompts hurt performance in every language tested. "Get to the point" is sound; "never say please" is not.[7]
Principle 6 (monetary tipping) went viral on Twitter in late 2023 after Denis Shiryaev posted that it appeared to lengthen GPT-4 responses. Max Woolf ran a follow-up study in February 2024 using a 200-character constraint task plus quality scoring with GPT-4 as an editor across 100 combinations of incentives and threats. He concluded the effect was inconclusive: a few phrases produced small distribution shifts, most p-values were too high to claim statistical significance, and he called it "definitely a lottery." A 2025 arXiv preprint by Lennart Meincke and colleagues, Prompting Science Report 3: I'll pay you or I'll kill you, but will you care?, ran threats and bribes on harder reasoning tasks and found no significant effect, with tipping sometimes slightly reducing quality. The tipping principle has not held up under independent scrutiny.[8][9]
Principle 12 (think step by step) is the zero-shot chain-of-thought prompt from Kojima et al. (2022). On reasoning benchmarks like GSM8K and MultiArith, the original paper reported large accuracy gains on InstructGPT and PaLM by appending "Let's think step by step" before the answer. The technique is widely used and well replicated.[6]
Principle 16 (role assignment) was a staple of community guides like the Awesome ChatGPT Prompts repository before the paper appeared. Independent evaluations have produced mixed results: role assignment helps with stylistic conditioning but does not consistently improve factual accuracy. Treat it as a tone control, not a capability boost.
Principle 8 (structured delimiters) is similar to OpenAI's own recommendation to separate instructions from data using clear markers like triple quotes or XML tags. Anthropic's documentation for Claude recommends XML tags such as and for the same reason: the model parses the prompt without confusing its parts.
The paper has been criticized on several methodological grounds, most prominently in GitHub issue #3 on the ATLAS repository.[10]
The authors acknowledge in the paper's limitations section that very complex or specialized questions may not benefit from the principles, and that other model families like WizardLM or Orca were not tested.[1] None of this means the principles are wrong. Several of them (chain-of-thought, few-shot prompting, delimited structured prompts) have strong independent evidence behind them. But the headline numbers should not be read as gospel.
The 26 principles work best as a checklist rather than a recipe. Most prompts need only three or four. A reasonable approach:
Skip or use with caution: principle 6 (tipping) has not survived replication; principle 1 (no politeness) is too strong as written; principle 10 ("you will be penalized") may cause models to refuse or moralize. Principle 18 (repetition) can degrade quality on capable models that interpret repetition as confusion.
As prompt engineering research the paper is uneven. As a checklist for new users of ChatGPT or Claude it is genuinely useful, because it surfaces a lot of practical tricks at once. Many corporate prompt engineering courses, Medium posts, and LinkedIn write-ups in 2024 used the list as a starting curriculum, and the ATLAS GitHub repository continues to be referenced as a benchmark. Subsequent work has refined or contradicted individual claims, but the framing (that low-cost prompt edits produce measurable behavioral changes) has held up.[2][5][10]
The most durable contribution is the consolidation itself. Before the paper, advice like "use few-shot examples," "assign a role," "add chain-of-thought," and "use delimiters" lived in scattered blog posts, OpenAI documentation, and Twitter threads. Putting them in one numbered table with an associated benchmark gave practitioners a single artifact to reference and argue with. The arguments have produced better follow-up work on politeness, tipping, repetition, and role assignment.
[1] Bsharat, S. M., Myrzakhan, A., and Shen, Z. (2023). Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4. arXiv:2312.16171. https://arxiv.org/abs/2312.16171
[2] MarkTechPost. (2024, January 4). This Paper from MBZUAI Introduces 26 Guiding Principles Designed to Streamline the Process of Querying and Prompting Large Language Models. https://www.marktechpost.com/2024/01/04/this-paper-from-mbzuai-introduces-26-guiding-principles-designed-to-streamline-the-process-of-querying-and-prompting-large-language-models/
[3] VILA-Lab. (2024). ATLAS: A principled instruction benchmark. GitHub. https://github.com/VILA-Lab/ATLAS
[4] Hugging Face Papers. (2023). Paper page: Principled Instructions Are All You Need. https://huggingface.co/papers/2312.16171
[5] Codingscape. (2024). 26 principles for prompt engineering to increase LLM accuracy 57%. https://codingscape.com/blog/26-principles-for-prompt-engineering-to-increase-llm-accuracy
[6] Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916. https://arxiv.org/abs/2205.11916
[7] Yin, Z. et al. (2024). Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance. Proceedings of SICon 2024. https://aclanthology.org/2024.sicon-1.2/
[8] Woolf, M. (2024). Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis. https://minimaxir.com/2024/02/chatgpt-tips-analysis/
[9] Meincke, L. et al. (2025). Prompting Science Report 3: I'll pay you or I'll kill you, but will you care? arXiv:2508.00614. https://arxiv.org/abs/2508.00614
[10] VILA-Lab/ATLAS Issue #3. (2024). Numerous egregious issues with this paper. https://github.com/VILA-Lab/ATLAS/issues/3