| A Survey of Techniques for Maximizing LLM Performance (OpenAI Dev Day 2023) | |
|---|
Load video YouTube YouTube might collect personal data. Privacy Policy **ContinueDismiss | |
| Information | |
| Name | A Survey of Techniques for Maximizing LLM Performance |
| Type | Technical |
| Event | OpenAI DevDay 2023 |
| Organization | OpenAI |
| Channel | OpenAI |
| Presenter | John Allard, Colin Jarvis |
| Description | A breakout session surveying prompt engineering, retrieval augmented generation, and fine-tuning as the three primary techniques for maximizing the performance of large language models in production. |
| Date | November 6, 2023 |
| Location | San Francisco, California |
| Website | https://www.youtube.com/watch?v=ahnGLM-RC1Y |
A Survey of Techniques for Maximizing LLM Performance is a presentation delivered by John Allard and Colin Jarvis at OpenAI's first developer conference, OpenAI DevDay 2023, held on November 6, 2023 in San Francisco. The talk was a breakout session aimed at developers and machine learning engineers who needed to push an LLM past its default behavior on a specific task. It became one of the most widely watched DevDay 2023 sessions; the recording, posted by OpenAI on YouTube on November 14, 2023, has been cited in dozens of follow up blog posts, internal training decks, and the company's own production guidance on optimizing model accuracy.[1][2][3]
TLDR
Optimizing an LLM for a production task is an iterative loop, not a single technique. Allard and Jarvis argue that teams should always start with prompt engineering to set a baseline, then layer on retrieval augmented generation (RAG) when the model lacks the right knowledge, and turn to fine-tuning when the model has the knowledge but is not behaving the way the task requires. The three approaches are additive, not exclusive, and the right combination depends on whether the underlying failure is a context problem or a behavior problem.[1][2]
Speakers
John Allard was an engineering lead on OpenAI's fine-tuning product team at the time of the talk. He had joined OpenAI from Snorkel AI and worked on the Applied AI organization's Developer Platform team, which owns the public api.openai.com surface. Allard later led the GPT-3.5 Turbo fine-tuning launch in August 2023 and the GPT-4o fine-tuning rollout in 2024.[4][5]
Colin Jarvis led OpenAI's Solutions practice in Europe, the Middle East, and Africa. His team worked directly with enterprise customers to push prototypes into production. In January 2025 he moved on to lead OpenAI's Forward Deployed Engineering function, focused on getting customers to production at scale.[6]
Why the talk matters
In late 2023 most developers were still treating GPT-4 and GPT-3.5 as drop-in oracles. When the model got something wrong, the most common response was to keep editing the system prompt until it stopped failing on the test cases the developer happened to remember. Allard and Jarvis pushed back on that habit. They framed LLM optimization as a measurement problem first and a tooling problem second: you cannot pick the right intervention until you know whether the model is missing knowledge or producing the wrong shape of output. That framing is now repeated almost verbatim in OpenAI's official guide to optimizing LLM accuracy.[2][7]
The two axis optimization framework
The core conceptual contribution of the talk is a two by two grid. The horizontal axis is context optimization and asks whether the model has access to the knowledge it needs. The vertical axis is LLM optimization and asks whether the model is producing the right behavior, format, tone, or reasoning pattern.[1][2]
| Failure mode | What the model is missing | Right intervention |
|---|
| Wrong knowledge, right behavior | Domain facts, proprietary documents, fresh information | Retrieval augmented generation |
| Right knowledge, wrong behavior | A consistent output format, tone, or instruction-following pattern | Fine-tuning |
| Wrong knowledge, wrong behavior | Both of the above | Combine RAG with fine-tuning |
| Right knowledge, right behavior | Probably just a sharper prompt | More careful prompt engineering |
Allard and Jarvis stressed that this is a diagnostic grid, not a sequence. A team should figure out which quadrant their failures sit in before they reach for a tool. They also warned that the four quadrants are not mutually exclusive in production: most real applications need movement on both axes at once.[1][2]
Prompt engineering
The talk treats prompt engineering as the place every project should begin. It is the fastest way to set a baseline and the cheapest way to iterate. Allard and Jarvis recommended four tactics on top of writing a clear instruction:
- Write clear, unambiguous instructions and put the most important constraints at the top of the prompt.
- Break complex tasks into smaller subtasks the model can handle one at a time.
- Give the model time to think, for example by asking for a plan before the final answer (a form of chain of thought prompting).
- Test changes systematically against an evaluation set, rather than eyeballing single examples.
They were also honest about the ceiling. Prompt engineering is bad at introducing new information the model never saw, bad at reliably reproducing complex styles or methods, and bad at minimizing token usage at scale. Once you hit those walls, the next move is to ask which axis is failing.[1][2]
Few shot prompting and evaluation
A recurring theme throughout the talk is that evaluation has to come before optimization. Jarvis put it bluntly: if you cannot measure it, you cannot improve it. The recommended order of operations is to write a prompt, build a small evaluation set, run the prompt against that set, and only then decide whether to add few shot examples, RAG, or fine-tuning.
Few shot prompting (also called in-context learning) is the natural extension of prompt engineering. Adding two to five carefully chosen examples to the prompt can move a model significantly on a narrow task without any retraining. The downside is that few shot examples consume the context window, so they trade tokens for performance.[1][2]
Retrieval augmented generation (RAG)
RAG extends an LLM by giving it access to an external knowledge store at query time. The pipeline is straightforward: the user query is turned into an embedding, the system retrieves the most relevant documents from a vector store, and those documents are pasted into the prompt alongside the question. The model then answers using both its parametric knowledge and the retrieved context.
The talk highlights two strengths of RAG. It introduces information the model could not have seen during training, including private corporate data, and it reduces hallucinations by anchoring answers in retrieved evidence. It is the right move when the failure is a context problem.[1][2]
Allard and Jarvis also walked through the failure modes. RAG cannot teach the model a new output format or a new behavior. It cannot give the model a deep understanding of a broad domain; the model still only sees whatever chunks the retriever happens to pull. And it adds latency and complexity to the system, because every query now depends on a working retrieval layer.
Evaluating a RAG system
One of the most cited slides from the talk separates RAG evaluation into two questions. First, did the retriever fetch the right context? Second, did the model use that context faithfully? The first is measured with classic information retrieval metrics, precision and recall over the retrieved chunks. The second is measured with metrics like faithfulness, answer relevance, context relevance, and context recall, which are commonly packaged in the open source Ragas library.[1][2][8]
This is the talk that publicly put Ragas on the map for many developers. Jarvis recommended it directly from the OpenAI stage, and Ragas adoption climbed sharply in the following months.[8]
Fine-tuning
Fine-tuning is the second axis of optimization. Where RAG changes what the model knows at inference time, fine-tuning changes what the model is. A base model is trained for a few extra epochs on a curated dataset of input output examples, and the resulting weights are tuned toward a specific style, format, or task.
Allard, speaking from his role as the fine-tuning product lead, listed three places fine-tuning genuinely helps:
- Emphasizing existing knowledge. When a model already has the right information in its parameters but does not surface it consistently, fine-tuning can sharpen the response.
- Modifying output structure or tone. Fine-tuning is the best tool for teaching a model to return strict JSON, follow a brand voice, or stick to a specific reasoning pattern.
- Teaching complex instructions. When an instruction is too long or too nuanced to fit in every prompt, baking it into the weights with examples is more reliable and cheaper at scale.
The other half of the slide was a warning. Fine-tuning is bad at teaching the model new factual knowledge; in many cases attempts to do so make hallucinations worse. It is bad at quick iteration on early stage problems, because every change requires a new training run. And it requires at least hundreds of high quality examples to make a difference. Allard's slide on this point, framed as a cautionary tale, has been widely shared in the year since the talk.[1][2][9]
Combined RAG plus fine-tuning
The most powerful production setups, Allard and Jarvis argued, fine-tune the model on the structural and stylistic patterns of the task while leaving fresh factual knowledge to RAG. A customer support bot, for example, might be fine-tuned to always answer in a fixed tone and format, then served retrieved policy documents at inference time. The fine-tune handles behavior; the retriever handles context.[1][2]
Spider 1.0 case study
The most concrete demonstration in the talk is the team's own attempt to push GPT-4 to the top of the Spider 1.0 text-to-SQL benchmark. Spider 1.0, released by Yale researchers in 2018, asks a model to translate natural language questions into SQL queries against 200 databases spanning 138 domains. It was a popular benchmark for measuring how well an LLM could reason about schemas it had never seen.
The OpenAI team's journey looked like this:
- Start with a baseline GPT-4 prompt and measure accuracy.
- Add few shot examples and structured schema descriptions, which lifted accuracy noticeably.
- Layer on RAG. Naive similarity search retrieved schema chunks for each question; later iterations used hypothetical document embeddings (HyDE) and a reranking step to improve the quality of retrieved context.
- Add a classification step that routed each question to the right table subset before retrieval.
- Finally, fine-tune GPT-4 on a curated dataset built in partnership with Scale AI, pushing the system close to state of the art on the benchmark.
The story Allard and Jarvis told around the numbers was the most important part. The path was not linear. Some optimizations regressed performance on subsets of the benchmark even while improving overall accuracy. The team had to iterate, evaluate, and sometimes back out of changes. That messy reality, they argued, is what production LLM work actually looks like.[1][2]
Icelandic correction case study
A second case study, later turned into a recurring example in OpenAI's official optimization guide, focused on a Government of Iceland project to correct grammatical errors in Icelandic text. The team measured performance with BLEU score against a held out set. The progression looked roughly like this:[7]
| Approach | BLEU |
|---|
| GPT-4 zero shot | 62 |
| GPT-4 with three few shot examples | 70 |
| GPT-3.5 Turbo fine-tuned on 1,000 examples | 78 |
| GPT-4 fine-tuned on 1,000 examples | 87 |
The Icelandic example was useful for two reasons. It showed that fine-tuned GPT-3.5 Turbo could comfortably outperform few shot GPT-4 on a narrow stylistic task, which mattered for cost sensitive deployments. It also showed a counterintuitive failure mode: adding RAG to the Icelandic pipeline actually decreased BLEU score, because retrieval introduced noise on a task that did not need outside knowledge. That observation is the most quoted line from the talk: use the right optimization tool for the right job.[7]
Key takeaways
Allard and Jarvis closed with a short set of practical recommendations that has been echoed in nearly every summary of the talk:[1][2][3]
- LLM optimization is hard because signal and noise look similar. Build an evaluation harness before anything else.
- Always start with prompt engineering. Revisit it after every other change.
- Use RAG when the failure is a context problem. Use fine-tuning when the failure is a behavior problem.
- RAG and fine-tuning are additive. The best production systems usually combine both.
- Fine-tuning needs hundreds of clean examples. Without that data, do not bother.
- The optimization journey is not linear. Expect to iterate, and expect some regressions on the way to a working system.
Reception and influence
The talk was singled out by Humanloop, Klu, and the OpenAI developer community as one of the most actionable sessions of DevDay 2023.[3][10] OpenAI later turned the framework into a permanent guide in its public API documentation, the "Optimizing LLM Accuracy" page, which uses the same two axis grid and Icelandic case study.[7] Ragas adoption climbed after Jarvis recommended it from stage, and the term "context optimization" entered the working vocabulary of LLM application developers.
In the longer arc, the talk also marked a turning point in how OpenAI positioned fine-tuning. Through most of 2023 the company had been quiet about fine-tuning's role; this session, alongside the GPT-3.5 Turbo fine-tuning launch a few months earlier and the custom models program announced at DevDay, signalled that OpenAI saw model customization as a first class part of the developer platform.[4][9]
See also
References
- OpenAI. "A Survey of Techniques for Maximizing LLM Performance." YouTube, posted November 14, 2023. https://www.youtube.com/watch?v=ahnGLM-RC1Y
- Humanloop. "How to Maximize LLM Performance (Lessons from OpenAI DevDay)." https://humanloop.com/blog/optimizing-llms
- OpenAI Developer Community. "OpenAI Dev-Day 2023: Breakout Sessions!" https://community.openai.com/t/openai-dev-day-2023-breakout-sessions/505213
- OpenAI. "GPT-3.5 Turbo fine-tuning and API updates." August 22, 2023. https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/
- John Allard, LinkedIn profile. https://www.linkedin.com/in/jhallard/
- Colin Jarvis, X post on joining Forward Deployed Engineering at OpenAI, January 2025. https://x.com/colintjarvis/status/1879532522956329135
- OpenAI. "Optimizing LLM Accuracy." OpenAI API documentation. https://platform.openai.com/docs/guides/optimizing-llm-accuracy
- Ragas. "Ragas: Evaluation framework for RAG pipelines." https://www.ragas.io/
- OpenAI. "Introducing improvements to the fine-tuning API and expanding our custom models program." April 2024. https://openai.com/index/introducing-improvements-to-the-fine-tuning-api-and-expanding-our-custom-models-program/
- Klu. "Optimizing LLM Apps." https://klu.ai/blog/optimizing-llm-app-features