| The New Stack and Ops for AI (OpenAI Dev Day 2023) | |
|---|
| Information | |
| Name | The New Stack and Ops for AI |
| Type | Technical breakout session |
| Event | OpenAI DevDay 2023 |
| Organization | OpenAI |
| Channel | OpenAI (YouTube) |
| Presenters | Sherwin Wu, Shyamal Hitesh Anadkat |
| Description | A framework for navigating the unique considerations of scaling non-deterministic apps from prototype to production. |
| Conference date | November 6, 2023 |
| Video published | November 13, 2023 |
| Location | San Francisco, California |
| Video link | YouTube |
The New Stack and Ops for AI is a technical breakout session presented by Sherwin Wu and Shyamal Hitesh Anadkat at OpenAI DevDay 2023, held on November 6, 2023, in San Francisco. The talk introduced a framework for taking applications built on top of large language models from prototype to production, with a focus on what the speakers called LLM Ops: the operational discipline needed to run AI applications reliably at scale.
OpenAI uploaded the recorded session to its official YouTube channel on November 13, 2023. It quickly became one of the most cited talks from the conference, in part because it pulled together advice that had been scattered across the OpenAI Cookbook, blog posts, and the company's internal experience working with hundreds of enterprise customers.
Context and the DevDay 2023 backdrop
DevDay 2023 was OpenAI's first developer conference. The morning keynote, delivered by CEO Sam Altman with a surprise on-stage appearance by Microsoft CEO Satya Nadella, introduced a long list of product changes that reshaped how developers built on the platform. The headline launches included GPT-4 Turbo with a 128,000 token context window and knowledge through April 2023, JSON mode for guaranteed valid JSON outputs, a new seed parameter for reproducible outputs, the Assistants API with built-in code interpreter, retrieval, and function calling, an updated Whisper v3 model, a refreshed text to speech API, and the GPT Store with revenue sharing for builders of custom GPTs. Altman also said that more than 2 million developers were active on the API and that 92 percent of Fortune 500 companies were using OpenAI products.
The afternoon was given over to breakout sessions for in person attendees. "The New Stack and Ops for AI" sat alongside other technical talks including "A Survey of Techniques for Maximizing LLM Performance" by John Allard and Colin Jarvis, "New Products: A Deep Dive," "Research x Product," and "The Business of AI." The breakouts were aimed at developers who had already shipped something with the API and were trying to figure out what the next stage of maturity looked like.
Where the keynote was about new toys, this session was about what to do once the novelty wears off. Sherwin Wu, an early engineer at OpenAI who would later lead the API engineering team, and Shyamal Hitesh Anadkat, a member of the Applied AI team who worked directly with companies and startups bringing AI products to market, walked through the failure modes they kept seeing across the customer base.
The core thesis: non-determinism changes everything
The talk's opening framing was simple. Classical software is deterministic. The same input produces the same output. Test suites work because of this. Caches work because of this. Debugging works because of this. LLMs are different. The same prompt can produce different outputs across calls, and a tiny change in wording can produce dramatically different behavior. That single property cascades through the stack and forces a rethink of how the application is built, tested, monitored, and scaled.
The speakers walked through what changes when you build with non-deterministic components, and what stays the same. Their argument was that most teams could ship a working prototype in a weekend, but the gap between a flashy demo and something a paying customer can rely on for years is much larger than people expect. The framework they laid out was meant as a checklist for closing that gap.
The four part framework
The presentation organized advice into four buckets: user experience, model consistency, evaluations, and orchestration for scale. The order was deliberate. Each step depends on the ones before it, and the speakers argued that teams skip steps at their own cost.
Building a user experience that respects uncertainty
The first bucket was user experience. The speakers' argument was that because the model will sometimes be wrong, the interface has to make that fact visible and recoverable. They pointed to patterns OpenAI had observed in production deployments, including showing citations so users can verify a claim, exposing intermediate reasoning steps, using streaming so the user sees progress instead of staring at a spinner, designing easy regenerate and edit affordances, and avoiding the trap of presenting model output with the same confidence as a deterministic database query.
The broader idea was that AI features work best when they augment a human in the loop rather than replace that human. Guardrails for steerability and safety were part of the same picture: input filters, output filters, and clear boundaries around what the assistant will and will not do.
Managing model consistency
The second bucket dealt with the model itself. Because outputs vary, the speakers showed several techniques for narrowing that variance.
The new seed parameter, launched the same day at the keynote, made it possible to get mostly reproducible outputs across API calls when temperature and other parameters were also held constant. The session pitched this as useful for debugging, unit tests, and bug reports where a colleague needs to reproduce the same model behavior.
JSON mode, also launched at the keynote, guaranteed that the model would return syntactically valid JSON when invoked in that mode. Before this, developers wrote ad hoc parsing code to clean up half formed JSON responses.
Grounding was the third consistency lever. The speakers argued that the most effective way to reduce hallucinations was not better prompt engineering but better retrieval. By giving the model the right context at inference time, either through a vector database, through structured API calls, or through the new retrieval tool in the Assistants API, you reduce the model's need to guess. They positioned retrieval augmented generation as the default architecture for any application that needs to answer questions about specific data.
Function calling, refined and expanded in the same DevDay cycle, was treated as a structural fix. Instead of asking the model to compute an answer that depends on real time data, you let the model decide which function to call, and the function returns a verified fact. The model is then only responsible for the language part.
Evaluation driven development
The third bucket was the one the speakers spent the most time on, and it was probably the most important contribution of the talk. They argued that evaluations, or evals, are the closest thing LLM development has to unit tests, and that teams who skip building an eval suite end up shipping changes blindly.
The recommended workflow looked like this. Start by collecting a small set of representative inputs and the outputs you want for each. This is your golden dataset. Run your current prompt or chain against the dataset. Score the outputs, either through exact match for simple cases, through programmatic checks, or by using GPT-4 itself as a model graded judge for the harder subjective cases. Track the scores over time. When you change a prompt, change a model, or update retrieval, rerun the suite and look at the delta before deciding to ship.
The speakers pointed at OpenAI's open source Evals framework on GitHub as one starting point. They also showed how to use GPT-4 as a grader for outputs from cheaper models like GPT-3.5, an idea that had quietly become the dominant pattern across the industry. The cost of running GPT-4 as judge over a few hundred examples was small relative to the value of catching a regression before it shipped.
The deeper point was cultural. If a team has no eval suite, every model swap, every prompt tweak, and every retrieval change is a leap of faith. With an eval suite in place, those changes become measurable and the team can move faster, not slower, because they trust the numbers.
Orchestrating for scale: latency and cost
The fourth bucket dealt with what happens once an application has real traffic. Latency and cost dominate as products grow, and the speakers ran through the standard playbook.
Semantic caching was the first technique. Instead of caching by exact key, cache by the embedding similarity of the input. Two questions that mean the same thing return the same cached answer. For high traffic FAQs and lookup style queries, this can cut request volume to the model dramatically.
Model routing was the second. Not every query needs the strongest model. The speakers argued for a tiered approach: classify the difficulty of an incoming request, send the easy ones to a cheaper model like GPT-3.5, and only escalate to GPT-4 when the task warrants it. A classifier model can sit in front of the routing decision and learn from past traces.
Fine tuning was the third. A fine tuned GPT-3.5 Turbo can match GPT-4 on narrow tasks at a fraction of the inference cost, and the cost difference at scale is meaningful. The recommended workflow was to use GPT-4 to generate or curate the training data, fine tune GPT-3.5 on that dataset, and then evaluate the fine tuned model against GPT-4 using the same eval suite the team had already built. If the fine tuned model passes, the team can swap it in and reap the cost savings.
Prompt engineering and prompt compression rounded out the list. Shorter, tighter prompts cost less and are often faster to process. The speakers noted that there was no shortage of advice on prompt engineering, but emphasized that without an eval suite to measure the impact of a prompt change, it was very hard to know whether a new prompt was actually better.
What LLM Ops actually means
The phrase LLM Ops was used through the talk as a shorthand for the operational discipline that emerges once you take all four buckets together. The speakers borrowed the term from the older MLOps tradition, with the modification that the model is now usually a third party API rather than something the team trains end to end.
In their framing, LLM Ops covers the lifecycle of an AI application: collecting traces, building eval suites, monitoring for drift and regressions, controlling cost, controlling latency, managing prompt versions, handling fine tuning runs, and coordinating across product, engineering, and safety teams. The talk did not pitch a specific tool. It treated LLM Ops as a set of practices that teams need to build muscle around, regardless of whether they end up using a managed platform, open source tooling, or something they wrote themselves.
The talk landed at a moment when this category of tooling was just starting to take shape. Companies like Humanloop, LangSmith, Weights and Biases, and others were building dashboards, evaluation harnesses, and prompt management products for exactly this workflow. The session implicitly endorsed the broader direction without naming specific vendors.
The presenters
Sherwin Wu joined OpenAI early and was part of the team building the API platform. He went on to lead OpenAI's API engineering organization and later played a prominent role at DevDay 2025 alongside Christina Huang. His perspective on the talk was shaped by the volume and variety of integrations he had seen developers attempt against the API.
Shyamal Hitesh Anadkat sat on the Applied AI team at OpenAI, where his work centered on partnering with companies and startups building on top of the platform. He has been an active contributor to the openai/openai-cookbook repository, including notebooks on evaluations, embeddings, and function calling. His view into the talk came from watching what worked and what failed across many production deployments.
The pairing was deliberate. Wu brought the platform engineer view of what is actually possible at scale, and Anadkat brought the field view of what customers were actually doing with it.
Reception and lasting influence
The talk was well received both in person and after release on YouTube, and it became one of the canonical references that early stage AI engineering teams pointed each other at through 2024. The four part framework was simple enough to remember, and concrete enough to act on. The specific advice on using GPT-4 as a judge over cheaper models became close to standard practice in the year that followed, and the emphasis on building an eval suite before optimizing anything else aged well as the field matured.
In retrospect, the session also marked an inflection point in how OpenAI talked about its product. Earlier OpenAI communication had focused on what the models could do. By DevDay 2023, the company was actively coaching developers on how to operate AI in production. That shift mirrored a broader industry move from demos to durable products, and the talk captured the playbook for that transition more clearly than most.
| Session | Presenters | Focus |
|---|
| A Survey of Techniques for Maximizing LLM Performance | John Allard, Colin Jarvis | Prompt engineering, RAG, fine tuning trade offs |
| New Products: A Deep Dive | Krithika Muthukumar and others | Live demos of GPTs and the Assistants API |
| Research x Product | Various | Dialogue interfaces from GPT-3.5 to GPT-4 |
| The Business of AI | Panel | AI in real business models |
References
- OpenAI. "The New Stack and Ops for AI." YouTube. https://www.youtube.com/watch?v=XGJNo8TpuVA
- OpenAI. "New models and developer products announced at DevDay." November 6, 2023. https://openai.com/index/new-models-and-developer-products-announced-at-devday/
- OpenAI Developer Community. "OpenAI Dev-Day 2023: Breakout Sessions!" https://community.openai.com/t/openai-dev-day-2023-breakout-sessions/505213
- Sherwin Wu. Post on X about the DevDay talk. https://x.com/sherwinwu/status/1724270986101690796
- OpenAI Cookbook. "How to make your completions outputs consistent with the new seed parameter." https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter
- OpenAI Evals repository. https://github.com/openai/evals
- Yoav Grossman. "OpenAI's latest strategies for scaling AI." LinkedIn post. https://www.linkedin.com/posts/yoavgrossman_new-stack-and-ops-for-ai-activity-7128380760574623744-W1js
- TechCrunch. "Everything announced at OpenAI's first developer event." November 6, 2023. https://techcrunch.com/2023/11/06/everything-announced-at-openais-first-developer-event/
- CNBC. "Microsoft CEO Nadella makes surprise appearance at OpenAI event." November 6, 2023. https://www.cnbc.com/2023/11/06/microsoft-ceo-nadella-makes-surprise-appearance-at-openai-event.html
- TechRepublic. "OpenAI DevDay: OpenAI Announces GPT-4 Turbo and GPT Tool Builder Store." https://www.techrepublic.com/article/openai-dev-day-gpt4-turbo/