Skeleton-of-Thought

AI Inference Prompt Engineering

13 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v4 · 2,519 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Skeleton-of-Thought (SoT) is a prompting technique for large language models that reduces end-to-end generation latency by first eliciting a short outline of the answer (the "skeleton") and then expanding each outline point in parallel through batched decoding or concurrent API calls.^[1]^[2] Proposed by Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang at Tsinghua University, Microsoft Research, and KU Leuven, the paper "Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation" appeared on arXiv on 28 July 2023 and was accepted to the International Conference on Learning Representations (ICLR) 2024 as a poster.^[1]^[3] Unlike most prior chain-style prompting research, which targets reasoning accuracy, SoT explicitly targets wall-clock inference efficiency while attempting to preserve or improve answer quality.^[2]^[4] Across 12 evaluated models, the authors report up to a 2.39x speedup, with eight models exceeding 2x acceleration on the Vicuna-80 benchmark.^[1]^[2] An extension, SoT with Router (SoT-R), learns or prompts a classifier that decides per query whether SoT is appropriate, falling back to standard decoding for tasks that require sequential reasoning.^[1]^[2]

Background

Most state-of-the-art LLMs are autoregressive Transformers that decode one token at a time, conditioning each new token on the prefix of previously generated tokens.^[1]^[4] Sequential decoding produces high end-to-end latency for long outputs because each forward pass must wait for the previous step to finish, regardless of how many parallel compute resources are available.^[1] At the time the SoT paper was written, the dominant approaches for accelerating LLM decoding focused either on model-level changes (quantization, smaller architectures, distillation), system-level changes (continuous batching, kernel optimization), or specialized decoding strategies such as speculative decoding and lookahead decoding, which alter or augment the decoding loop itself.^[1]^[4]

SoT departs from those traditions by leaving the model and decoding kernel untouched. Instead, it manipulates the prompt and control flow around the model so that the latency-dominant generation step splits into many shorter, independent generations that can run concurrently.^[1]^[2] The authors describe SoT as a "data-centric optimization for inference efficiency," distinguishing it from algorithm-, system-, or hardware-level techniques.^[1]^[2]

The motivating analogy is human writing. When a person is asked an open-ended question, they often jot down a brief outline ("introduction, three reasons, conclusion") and then expand each bullet, sometimes drafting them in any order rather than strictly left to right.^[1]^[4] Sequential token decoding mirrors a strict left-to-right scribe; SoT mirrors the outline-then-expand workflow and exploits the fact that expansion of independent bullets is, in principle, embarrassingly parallel.^[1]^[2]

Method

SoT runs in two stages, each implemented as an ordinary natural-language prompt.^[1]^[2]^[5]

Stage 1: Skeleton generation

The user query q is wrapped in a "skeleton prompt" that instructs the model to act as an organizer and produce only a list of short skeleton points rather than a full answer.^[1]^[4] In the open-source variant of the released prompt, the model is told to produce 3 to 10 numbered points, each only 3 to 5 words long, written in a list format.^[4]^[5] The prompt explicitly forbids the model from writing the full content for each point at this stage.^[4]^[5] The output of Stage 1 is a list B = (B_1, B_2, ..., B_N).

Stage 2: Parallel point expansion

For each skeleton point B_i, a separate "point-expanding prompt" is constructed that supplies the original question q, the full skeleton B, the index i, and instructions to continue the writing of "one and only one point" in roughly 1 to 2 sentences.^[4]^[5] The N point-expansion requests are dispatched concurrently. For API-served models such as GPT-4, GPT-3.5, and Claude, this is implemented as N parallel HTTP requests; for local open-source models, the N prompts are formed into a single batch and decoded in parallel through batched inference on the same GPU.^[1]^[2] After all N expansions return, the orchestrator concatenates the point labels with their expansions to produce the final answer.^[1]^[2]

The skeleton step itself is still sequential, but it is engineered to be short, so it represents a small fraction of total wall-clock time.^[1]^[2]

Why latency drops

For a standard generation, latency scales roughly with the number of output tokens because decoding is sequential.^[1] Under SoT, the long answer is split into N independent shorter sub-answers, and the dominant cost becomes the latency of the longest sub-answer rather than the sum of all of them. For API models, parallelism comes from issuing concurrent requests; for local models, parallelism comes from the GPU processing the batch of N decoding streams.^[1]^[2]^[4] Because each batched sequence still pays its own decoding cost, the achievable speedup depends on (i) how evenly the points are sized, (ii) how short each expansion is, and (iii) how well the model follows the brevity instructions in the point-expanding prompt.^[1]^[4]

Experimental results

The authors evaluated SoT on Vicuna-80, a question set of 80 prompts spanning nine categories (writing, roleplay, knowledge, generic, common-sense, coding, math, counterfactual, and Fermi questions) introduced for Vicuna evaluations.^[1]^[4] Subsequent experiments also used WizardLM and LIMA-derived prompts.^[1]^[2] Twelve LLMs were tested: nine open-source (including Vicuna variants, LLaMA 2 Chat, StableVicuna-13B, OpenChat-13B, UltraLM-13B) and three API-served (Claude, GPT-3.5, and GPT-4).^[1]^[2]^[4]

Latency speedups

Speedup is measured as the wall-clock latency ratio of normal generation to SoT generation on the same model and prompt set.^[1] The headline result is that SoT obtains a greater than 2x end-to-end speedup, up to 2.39x, on eight of the twelve models.^[1]^[2]^[4] The official site and the Microsoft Research blog report a per-model range of roughly 1.13x to 2.39x.^[2]^[4] On open-source LLaMA-class models the reported speedups exceed 3x in one configuration, and on API models the reported acceleration is around 2x.^[2] The principal model where SoT failed to deliver acceleration is StableVicuna-13B, which did not consistently honour the brevity instructions and generated long point expansions as if they were normal answers, eroding the latency advantage.^[4]

Quality

Answer quality was scored automatically using two GPT-4-based judges: the FastChat pairwise protocol and the LLMZoo multi-dimension protocol (covering coherence, diversity, immersion, integrity, and relevance).^[2]^[4] On FastChat, SoT recorded win/tie/lose rates of approximately 29.5 percent / 29.3 percent / 41.2 percent against normal generation; on LLMZoo the corresponding rates were roughly 45.8 percent / 19.6 percent / 34.5 percent.^[2] The authors summarize the overall result as SoT being comparable to or better than normal generation in about 60 percent of cases on the LLMZoo metric.^[2]^[4] Quality varied by category: writing, roleplay, and generic-knowledge categories tended to benefit, while math, coding, and tasks requiring step-by-step dependencies tended to lose quality relative to the sequential baseline.^[2]^[4]

SoT with Router (SoT-R)

The authors acknowledge that SoT is unsuitable for some categories of question, particularly those whose correct answer must unfold step by step with strong dependencies between intermediate computations.^[1]^[2] They propose SoT-R, a per-query routing mechanism that decides whether to apply SoT or to fall back to standard generation.^[1]^[2]^[4]

Two router implementations are described:^[1]^[2]^[4]

A prompting router, in which a frontier model (GPT-4 in the paper) is asked, via an instruction prompt, whether the input question is likely to admit a list-style answer; the answer determines whether SoT is invoked.^[1]^[2]^[4]
A trained router, a RoBERTa classifier (approximately 120 million parameters) fine-tuned on labelled prompts derived from the LIMA dataset and evaluated on Vicuna-80 and WizardLM. The classifier adds only a small constant latency overhead.^[4]

Both routers retain end-to-end speedup above 1x for most models while restoring quality on categories where SoT alone underperformed.^[1]^[2]^[4] The trade-off is straightforward: SoT-R sacrifices peak speedup (since SoT is not fired on every query) in return for avoiding degradation on sequential-reasoning prompts.^[2]^[4]

Implementation

The reference implementation is open-sourced under an MIT licence at the GitHub repository imagination-research/sot.^[3] Its structure is summarized in the table below.

Directory	Purpose
`sot/`	Core SoT and SoT-R logic for skeleton prompting and parallel expansion
`prompts/`	Skeleton and point-expansion prompt templates for open-source models and for GPT-4
`data/`	Processed prompt sets for Vicuna-80, WizardLM, and LIMA
`scripts/`	Answer dumping and GPT-4-as-judge evaluation pipelines for Vicuna and WizardLM
`demo/`	A Gradio web demo built on FastChat that runs SoT interactively

The repository also ships a router-training pipeline that fine-tunes a RoBERTa classifier on annotated LIMA examples and exports the resulting model for use as the trained-router backend.^[3] Implementations exist for both the prompting and trained variants of SoT-R.^[3]

Relation to other prompting and decoding methods

Tree of Thoughts and Graph of Thoughts

Tree of Thoughts (ToT) generalizes Chain-of-Thought prompting by exploring a tree of intermediate thoughts with backtracking and evaluation steps, prioritising accuracy on multi-step reasoning rather than latency.^[4]^[6] SoT can be viewed as a much shallower structural prompt: rather than searching a deep tree of reasoning steps, it generates a single one-level outline and expands each branch in parallel, then concatenates.^[4]^[6] Several secondary analyses of structured prompting describe SoT in this way, presenting it as a degenerate-tree special case used for efficient parallel decoding rather than for search.^[6] Graph of Thoughts (GoT) by Besta et al., a contemporary work cited in the SoT paper, treats reasoning as a directed graph of nodes with arbitrary edges encoding dependencies and aggregation, again aimed at accuracy and aggregation flexibility rather than latency.^[4]^[6] The SoT authors note that GoT-style dependencies could in principle be added on top of SoT to model dependencies between skeleton points, at the cost of giving up some of the parallelism.^[4]

The table below compares the three structured-prompting families along the dimensions most relevant to SoT's positioning.

Method	Primary goal	Structure	Parallelism	Reference
Chain-of-Thought	Reasoning accuracy	Linear chain	None	Wei et al., 2022
Tree of Thoughts (ToT)	Reasoning via search	Tree, with backtracking	Branches may be explored separately, but result selection is sequential	Yao et al., 2023^[6]
Graph of Thoughts (GoT)	Reasoning via aggregation	Arbitrary DAG of thoughts	Limited by graph dependencies	Besta et al., 2023^[6]
Skeleton-of-Thought (SoT)	Inference latency	One-level outline, parallel expansion	All N expansions run concurrently	Ning et al., 2023^[1]

Parallel decoding

SoT is sometimes described as a "data-centric" form of parallel decoding because it allows multiple tokens (in fact, multiple subsentences) to be produced concurrently without changing the model weights or the decoding algorithm.^[1]^[2] This places it conceptually adjacent to model-level parallel decoding methods such as speculative decoding, Medusa, Lookahead Decoding, and EAGLE, all of which seek to generate multiple tokens per forward pass through model-level changes.^[1]^[4] The trade-off is different: SoT requires no draft model and no modifications to the decoding kernel, but it changes the content of the produced answer because the answer is now a concatenation of independent expansions, whereas speculative methods are designed to produce exactly the same distribution as the target model under standard decoding.^[1]^[4] SoT is therefore complementary rather than competitive with those methods, and can in principle be stacked on top of them.^[1]^[4]

Other prompting families

SoT is distinct from prompting techniques that improve reasoning quality without addressing latency, such as self-consistency, Self-Refine, Auto-CoT, Step-Back Prompting, and Self-Discover. Those methods generally increase total compute (by sampling multiple chains, refining iteratively, or generating reasoning skeletons specifically for accuracy), whereas SoT trades off content composition for reduced wall-clock latency.^[1]^[4]

Significance

SoT is widely cited in the LLM efficiency literature as an example of how prompting alone can change the time profile of generation rather than only its accuracy.^[4]^[6] By demonstrating sizeable speedups across heterogeneous closed and open-source models, including Claude, GPT-3.5, and GPT-4, the authors made the case that response latency in chat applications is partly a property of how the model is asked, not only how it is implemented.^[1]^[2] The SoT-R variant subsequently became the more practically deployed configuration, since it confines SoT to the categories of question where it actually helps and avoids degrading reasoning tasks.^[1]^[2]^[4] The paper's broader claim, that "explicitly planning the answer structure in language" is itself a useful optimisation knob for inference, has informed later work on structured generation and outline-first generation pipelines.^[1]^[4]

Limitations

The authors and follow-up commentary identify several limitations.^[1]^[2]^[4]

Inappropriateness for sequential reasoning. Math, coding, and other categories whose answer is fundamentally chained suffer from quality loss because point expansions cannot depend on the contents of other expansions.^[2]^[4]
Instruction-following dependence. The realized speedup depends heavily on the model honouring the brevity constraints in the point-expanding prompt. Models such as StableVicuna-13B that ignore these constraints retain near-baseline latency.^[4]
Concatenation artefacts. Because each expansion is generated independently, the resulting answer can suffer reduced coherence, immersion, or stylistic uniformity. The LLMZoo multi-dimension evaluation surfaces this on several categories.^[2]^[4]
Static skeleton. The skeleton produced in Stage 1 is fixed for the duration of Stage 2 and cannot be revised in light of what the expansions produce. The authors discuss this as a route to future work, including dynamic Graph-of-Thoughts style dependencies, at the cost of some parallelism.^[4]
API-cost characteristics. SoT issues N parallel API calls per query. End-to-end latency drops, but total API throughput consumed per question can grow because each expansion carries its own prompt and overhead.^[1]^[2]

Reception

The work was first released on arXiv on 28 July 2023, with subsequent revisions through 2 March 2024 (v3).^[1] A short version appeared at the NeurIPS 2023 ENLSP workshop on efficient NLP and speech processing.^[7] The full paper was accepted to ICLR 2024 as a poster.^[3]^[8] The work received public discussion on Microsoft Research's research blog in November 2023, framing SoT as an example of how data-centric prompting changes inference economics.^[2] As of the ICLR 2024 publication, the open-source repository imagination-research/sot provided the reference implementation, prompt templates, and evaluation harness used to reproduce the published speedups.^[3]

References

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang, "Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation", arXiv, 2023-07-28 (v1) to 2024-03-02 (v3). https://arxiv.org/abs/2307.15337. Accessed 2026-05-21. ↩
Xuefei Ning and Zinan Lin, "Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output", Microsoft Research Blog, 2023-11-17. https://www.microsoft.com/en-us/research/blog/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output/. Accessed 2026-05-21. ↩
Imagination Research (Xuefei Ning et al.), "imagination-research/sot: Skeleton-of-Thought (ICLR 2024)", GitHub, 2024. https://github.com/imagination-research/sot. Accessed 2026-05-21. ↩
Ning et al., "Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation (HTML version of v3)", arXiv, 2024-03-02. https://arxiv.org/html/2307.15337v3. Accessed 2026-05-21. ↩
Project authors, "Skeleton-of-Thought project site", Google Sites, 2023-2024. https://sites.google.com/view/sot-llm. Accessed 2026-05-21. ↩
Besta et al., "Demystifying Chains, Trees, and Graphs of Thoughts", arXiv, 2024. https://arxiv.org/html/2401.14295v3. Accessed 2026-05-21. ↩
Ning et al., "Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding (ENLSP 2023 workshop version)", NeurIPS 2023 ENLSP Workshop, 2023. https://neurips2023-enlsp.github.io/papers/paper_33.pdf. Accessed 2026-05-21. ↩
OpenReview, "Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation (ICLR 2024 poster)", OpenReview, 2024. https://openreview.net/forum?id=mqVgBbNCm9. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Chain-of-Thought