26 Principles of Good Prompts

The 26 principles of good prompts are a set of practical rules for writing prompts to large language models, introduced in the December 2023 paper Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 by Sondos Mahmoud Bsharat, Aidar Myrzakhan, and Zhiqiang Shen of the VILA Lab at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in Abu Dhabi. The paper argues that simple, well-chosen instructions can substantially improve the quality and accuracy of responses from systems like GPT-4, GPT-3.5, and the LLaMA family without any fine tuning. The principles cover formatting tricks, role assignments, audience targeting, monetary incentives, penalties, and reasoning patterns, and they have been widely circulated in prompt engineering tutorials and training material.^[1]^[2]

The paper appeared on arXiv as preprint 2312.16171 on December 26, 2023, with a revised version on January 18, 2024. The authors evaluated each principle on a custom benchmark called ATLAS, which contains roughly 13,000 instruction-response data points and 20 hand-curated test questions per principle. They tested seven models: LLaMA-1 and LLaMA-2 at 7B, 13B, and 70B parameter scales, plus GPT-3.5 and GPT-4. On GPT-4 the authors reported an average improvement of 57.7% in response quality (which they call boosting) and 36.4% in correctness. They also found that larger models benefited more than smaller ones, with relative accuracy gains above 20% on the bigger systems. The work has been cited dozens of times and the ATLAS repository has a substantial GitHub following, although the methodology has drawn criticism.^[1]^[3]^[4]

the paper and its benchmark

The central claim is that you do not need to retrain a model to get better answers; you can rewrite the prompt instead. The 26 principles are presented as concrete patterns a non-specialist user can copy. Applying the principles raised quality by an average of 57.7% on GPT-4 and lifted correctness by 36.4%. Across all tested models, principled prompts produced roughly a 50% average improvement.^[1]^[3]

The ATLAS dataset is split into a general portion and a per-principle portion, with two measurement categories. The boosting metric tracks how much a principle improves the perceived quality of a response on the same question; correctness tracks factual accuracy. For each principle the authors picked 20 questions designed to expose the prompt pattern, then ran each model with and without the principled rewrite and compared outputs. On LLaMA-2-7B and 13B, correctness improvements often sat in the 10% to 40% range. On the 70B model and on GPT-3.5 and GPT-4 the relative gains commonly exceeded 40%, the basis for the claim that scale and principled prompting compound.^[1]^[3]

The authors group the 26 principles into five categories: prompt structure and clarity; specificity and information; user interaction and engagement; content and language style; and complex tasks and coding prompts. The structure category covers formatting markers, delimiters, output primers, and audience targeting. Specificity covers few-shot examples, explicit requirements, anti-bias instructions, and stylistic matching. Interaction covers clarifying questions and teach-and-test formats. Language style covers role assignment, directive phrasing, penalties, tipping, and the no-politeness rule. The complex-task bucket covers decomposition, multi-file code generation, and combining chain-of-thought with few-shot prompts.^[1]^[5]

the 26 principles in full

The table below lists each principle with the exact intent described in the paper. The wording is condensed where the original is long; the numbering matches the paper.^[1]

#	Principle	What it does
1	Skip politeness fillers like "please," "thank you," or "if you don't mind" and get straight to the point.	Concise and direct.
2	Tell the model who the audience is, for example "the audience is an expert in the field."	Calibrates depth and vocabulary.
3	Break complex tasks into a sequence of simpler prompts in an interactive conversation.	Reduces error compounding.
4	Use affirmative directives like "do" and avoid negative phrasing like "don't."	Easier for the model to follow.
5	Ask for simple explanations, for example "explain to me like I'm 11 years old" or "explain to me as if I'm a beginner."	Forces accessible language.
6	Add an incentive phrase such as "I'm going to tip $xxx for a better solution!"	Claimed to raise effort, disputed (see below).
7	Use few-shot prompting with worked examples.	Shows the desired output shape.
8	Use formatting markers such as ###Instruction###, ###Example###, and ###Question### with line breaks between sections.	Separates context from task.
9	Include the phrases "Your task is" and "You MUST."	Anchors the obligation.
10	Include the phrase "You will be penalized."	Counterpart to tipping.
11	Add "Answer a question given in a natural, human-like manner."	Reduces robotic phrasing.
12	Use leading phrases such as "think step by step."	Triggers zero-shot chain-of-thought reasoning.
13	Add "Ensure that your answer is unbiased and does not rely on stereotypes."	Anti-bias guardrail.
14	Tell the model to ask you questions until it has enough information, for example "from now on, I would like you to ask me questions to..."	Surfaces ambiguity.
15	Ask the model to teach you a topic and quiz you on it without revealing answers first.	Active recall study aid.
16	Assign a role to the model, for example "you are a senior tax accountant."	Conditions tone and expertise.
17	Use delimiters such as triple backticks or XML tags around inputs.	Prevents prompt confusion.
18	Repeat a specific word or phrase several times within a prompt.	Emphasizes priorities.
19	Combine chain-of-thought reasoning with few-shot examples.	Reasoning plus demonstration.
20	End the prompt with the beginning of the desired output (an output primer).	Pulls the model into the right shape.
21	For long writing requests, instruct "write a detailed essay/text/paragraph for me on [topic] by adding all the information necessary."	Counters terse default outputs.
22	When asking for edits, instruct the model to improve only grammar and vocabulary, not style or formality.	Preserves voice.
23	For multi-file coding tasks, ask for a script that creates or modifies the necessary files.	Avoids partial dumps.
24	When continuing a draft, supply the opening and ask the model to finish in the same flow.	Maintains tone.
25	State requirements clearly as keywords, rules, hints, or instructions.	Reduces drift.
26	Ask the model to match the language style of a provided sample.	Stylistic imitation.

Many of these will look familiar to anyone who has read earlier prompting research. Principle 12 ("think step by step") is the zero-shot chain-of-thought trick popularized by Kojima and colleagues in Large Language Models are Zero-Shot Reasoners (2022). Principles 7 and 19 are recombinations of the few-shot and chain-of-thought patterns from the original GPT-3 paper and from Wei et al.'s 2022 chain-of-thought work. The contribution of Bsharat et al. is not so much inventing each technique as collecting, naming, and benchmarking them in one place.^[6]

why scale matters

One of the more interesting empirical results is that the 26 principles do not pay off equally across model sizes. On LLaMA-2-7B, many principles produced modest correctness gains, sometimes under 20%. On LLaMA-2-70B, GPT-3.5, and GPT-4 the same prompt patterns produced larger and more consistent improvements. The authors read this as evidence that bigger models are better at following instruction-level cues; smaller models lack the headroom to act on them. A 7B local model may shrug off elaborate role assignments and penalty language, while a frontier model picks them up reliably. It also suggests principle effectiveness is a moving target: as models get better at following natural language, some of these tricks may stop adding value because the model already infers the request.^[1]^[3]

context for the most discussed principles

Several of the 26 are either reframings of older techniques or specific empirical claims that have been tested independently.

Principle 1 (skip politeness) is one of the most quoted lines and also one of the least settled. A 2024 ACL workshop paper by Ziqi Yin and colleagues, Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance, found that the relationship between politeness and output quality is language dependent. In English, GPT-3.5 actually performed best with highly polite prompts on the benchmarks tested. In Japanese, less polite phrasing tended to help; in Chinese, more polite phrasing helped. Overly rude prompts hurt performance in every language tested. "Get to the point" is sound; "never say please" is not.^[7]

Principle 6 (monetary tipping) went viral on Twitter in late 2023 after Denis Shiryaev posted that it appeared to lengthen GPT-4 responses. Max Woolf ran a follow-up study in February 2024 using a 200-character constraint task plus quality scoring with GPT-4 as an editor across 100 combinations of incentives and threats. He concluded the effect was inconclusive: a few phrases produced small distribution shifts, most p-values were too high to claim statistical significance, and he called it "definitely a lottery." A 2025 arXiv preprint by Lennart Meincke and colleagues, Prompting Science Report 3: I'll pay you or I'll kill you, but will you care?, ran threats and bribes on harder reasoning tasks and found no significant effect, with tipping sometimes slightly reducing quality. The tipping principle has not held up under independent scrutiny.^[8]^[9]

Principle 12 (think step by step) is the zero-shot chain-of-thought prompt from Kojima et al. (2022). On reasoning benchmarks like GSM8K and MultiArith, the original paper reported large accuracy gains on InstructGPT and PaLM by appending "Let's think step by step" before the answer. The technique is widely used and well replicated.^[6]

Principle 16 (role assignment) was a staple of community guides like the Awesome ChatGPT Prompts repository before the paper appeared. Independent evaluations have produced mixed results: role assignment helps with stylistic conditioning but does not consistently improve factual accuracy. Treat it as a tone control, not a capability boost.

Principle 8 (structured delimiters) is similar to OpenAI's own recommendation to separate instructions from data using clear markers like triple quotes or XML tags. Anthropic's documentation for Claude recommends XML tags such as and for the same reason: the model parses the prompt without confusing its parts.

methodological criticism

The paper has been criticized on several methodological grounds, most prominently in GitHub issue #3 on the ATLAS repository.^[10]

Unclear evaluation criteria. The paper reports boosting (quality) and correctness numbers but does not fully describe the rubric or the evaluator selection. It is unclear whether evaluation was blind or what counted as an "improved" response.
Asymmetric prompting protocol. GPT-3.5 and GPT-4 were prompted ten times per question, while the open-source models were prompted once. The selection rule used to pick a representative GPT response was not described.
Missing baselines. The principle 1 example shows polite phrasing in both conditions, which raises the question of what the baseline actually is.
Missing data. Correctness numbers were not reported for principles 14, 15, and 21 through 23.
Limited domain coverage. With only 20 questions per principle, the per-principle effect estimates have wide confidence intervals.

The authors acknowledge in the paper's limitations section that very complex or specialized questions may not benefit from the principles, and that other model families like WizardLM or Orca were not tested.^[1] None of this means the principles are wrong. Several of them (chain-of-thought, few-shot prompting, delimited structured prompts) have strong independent evidence behind them. But the headline numbers should not be read as gospel.

how to use the list

The 26 principles work best as a checklist rather than a recipe. Most prompts need only three or four. A reasonable approach:

State the task plainly (principles 4, 25).
Add audience and role context when depth matters (principles 2, 16).
Use delimiters and a structured header for anything beyond a one-line query (principles 8, 17).
For reasoning problems, append "think step by step" or supply a worked example (principles 12, 7).
For long outputs, set the shape with an output primer or explicit length request (principles 20, 21).

Skip or use with caution: principle 6 (tipping) has not survived replication; principle 1 (no politeness) is too strong as written; principle 10 ("you will be penalized") may cause models to refuse or moralize. Principle 18 (repetition) can degrade quality on capable models that interpret repetition as confusion.

legacy and use today

As prompt engineering research the paper is uneven. As a checklist for new users of ChatGPT or Claude it is genuinely useful, because it surfaces a lot of practical tricks at once. Many corporate prompt engineering courses, Medium posts, and LinkedIn write-ups in 2024 used the list as a starting curriculum, and the ATLAS GitHub repository continues to be referenced as a benchmark. Subsequent work has refined or contradicted individual claims, but the framing (that low-cost prompt edits produce measurable behavioral changes) has held up.^[2]^[5]^[10]

The most durable contribution is the consolidation itself. Before the paper, advice like "use few-shot examples," "assign a role," "add chain-of-thought," and "use delimiters" lived in scattered blog posts, OpenAI documentation, and Twitter threads. Putting them in one numbered table with an associated benchmark gave practitioners a single artifact to reference and argue with. The arguments have produced better follow-up work on politeness, tipping, repetition, and role assignment.

references

Bsharat, S. M., Myrzakhan, A., and Shen, Z. (2023). *Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4.* arXiv:2312.16171. https://arxiv.org/abs/2312.16171
MarkTechPost. (2024, January 4). *This Paper from MBZUAI Introduces 26 Guiding Principles Designed to Streamline the Process of Querying and Prompting Large Language Models.* https://www.marktechpost.com/2024/01/04/this-paper-from-mbzuai-introduces-26-guiding-principles-designed-to-streamline-the-process-of-querying-and-prompting-large-language-models/
VILA-Lab. (2024). *ATLAS: A principled instruction benchmark.* GitHub. https://github.com/VILA-Lab/ATLAS
Hugging Face Papers. (2023). *Paper page: Principled Instructions Are All You Need.* https://huggingface.co/papers/2312.16171
Codingscape. (2024). *26 principles for prompt engineering to increase LLM accuracy 57%.* https://codingscape.com/blog/26-principles-for-prompt-engineering-to-increase-llm-accuracy
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). *Large Language Models are Zero-Shot Reasoners.* arXiv:2205.11916. https://arxiv.org/abs/2205.11916
Yin, Z. et al. (2024). *Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance.* Proceedings of SICon 2024. https://aclanthology.org/2024.sicon-1.2/
Woolf, M. (2024). *Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis.* https://minimaxir.com/2024/02/chatgpt-tips-analysis/
Meincke, L. et al. (2025). *Prompting Science Report 3: I'll pay you or I'll kill you, but will you care?* arXiv:2508.00614. https://arxiv.org/abs/2508.00614
VILA-Lab/ATLAS Issue #3. (2024). *Numerous egregious issues with this paper.* https://github.com/VILA-Lab/ATLAS/issues/3

26 Principles of Good Prompts

the paper and its benchmark

the 26 principles in full

why scale matters

context for the most discussed principles

methodological criticism

how to use the list

legacy and use today

references

Improve this article

What links here

the paper and its benchmark

the 26 principles in full

why scale matters

context for the most discussed principles

methodological criticism

how to use the list

legacy and use today

references

What links here

the paper and its benchmark

the 26 principles in full

why scale matters

context for the most discussed principles

methodological criticism

how to use the list

legacy and use today

references

Improve this article

Related Articles

Prompt

Prompt engineering for image generation

Prompt engineering for text generation

Agentic Context Engineering

CustomGPT Instructions for Knowledge (Uploaded Files)

Fine-tune ChatGPT with Perplexity, Burstiness, Professionalism, Randomness and Sentimentality Guide

What links here

the paper and its benchmark

the 26 principles in full

why scale matters

context for the most discussed principles

methodological criticism

how to use the list

legacy and use today

references

Related Articles

Prompt

Prompt engineering for image generation

Prompt engineering for text generation

Agentic Context Engineering

CustomGPT Instructions for Knowledge (Uploaded Files)

Fine-tune ChatGPT with Perplexity, Burstiness, Professionalism, Randomness and Sentimentality Guide

What links here