Skeleton-of-Thought
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,521 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,521 words
Add missing citations, update stale details, or suggest a clearer explanation.
Skeleton-of-Thought (SoT) is a prompting technique for large language models that reduces end-to-end generation latency by first eliciting a short outline of the answer (the "skeleton") and then expanding each outline point in parallel through batched decoding or concurrent API calls.[^1][^2] Proposed by Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang at Tsinghua University, Microsoft Research, and KU Leuven, the paper "Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation" appeared on arXiv on 28 July 2023 and was accepted to the International Conference on Learning Representations (ICLR) 2024 as a poster.[^1][^3] Unlike most prior chain-style prompting research, which targets reasoning accuracy, SoT explicitly targets wall-clock inference efficiency while attempting to preserve or improve answer quality.[^2][^4] Across 12 evaluated models, the authors report up to a 2.39x speedup, with eight models exceeding 2x acceleration on the Vicuna-80 benchmark.[^1][^2] An extension, SoT with Router (SoT-R), learns or prompts a classifier that decides per query whether SoT is appropriate, falling back to standard decoding for tasks that require sequential reasoning.[^1][^2]
Most state-of-the-art LLMs are autoregressive Transformers that decode one token at a time, conditioning each new token on the prefix of previously generated tokens.[^1][^4] Sequential decoding produces high end-to-end latency for long outputs because each forward pass must wait for the previous step to finish, regardless of how many parallel compute resources are available.[^1] At the time the SoT paper was written, the dominant approaches for accelerating LLM decoding focused either on model-level changes (quantization, smaller architectures, distillation), system-level changes (continuous batching, kernel optimization), or specialized decoding strategies such as speculative decoding and lookahead decoding, which alter or augment the decoding loop itself.[^1][^4]
SoT departs from those traditions by leaving the model and decoding kernel untouched. Instead, it manipulates the prompt and control flow around the model so that the latency-dominant generation step splits into many shorter, independent generations that can run concurrently.[^1][^2] The authors describe SoT as a "data-centric optimization for inference efficiency," distinguishing it from algorithm-, system-, or hardware-level techniques.[^1][^2]
The motivating analogy is human writing. When a person is asked an open-ended question, they often jot down a brief outline ("introduction, three reasons, conclusion") and then expand each bullet, sometimes drafting them in any order rather than strictly left to right.[^1][^4] Sequential token decoding mirrors a strict left-to-right scribe; SoT mirrors the outline-then-expand workflow and exploits the fact that expansion of independent bullets is, in principle, embarrassingly parallel.[^1][^2]
SoT runs in two stages, each implemented as an ordinary natural-language prompt.[^1][^2][^5]
The user query q is wrapped in a "skeleton prompt" that instructs the model to act as an organizer and produce only a list of short skeleton points rather than a full answer.[^1][^4] In the open-source variant of the released prompt, the model is told to produce 3 to 10 numbered points, each only 3 to 5 words long, written in a list format.[^4][^5] The prompt explicitly forbids the model from writing the full content for each point at this stage.[^4][^5] The output of Stage 1 is a list B = (B_1, B_2, ..., B_N).
For each skeleton point B_i, a separate "point-expanding prompt" is constructed that supplies the original question q, the full skeleton B, the index i, and instructions to continue the writing of "one and only one point" in roughly 1 to 2 sentences.[^4][^5] The N point-expansion requests are dispatched concurrently. For API-served models such as GPT-4, GPT-3.5, and Claude, this is implemented as N parallel HTTP requests; for local open-source models, the N prompts are formed into a single batch and decoded in parallel through batched inference on the same GPU.[^1][^2] After all N expansions return, the orchestrator concatenates the point labels with their expansions to produce the final answer.[^1][^2]
The skeleton step itself is still sequential, but it is engineered to be short, so it represents a small fraction of total wall-clock time.[^1][^2]
For a standard generation, latency scales roughly with the number of output tokens because decoding is sequential.[^1] Under SoT, the long answer is split into N independent shorter sub-answers, and the dominant cost becomes the latency of the longest sub-answer rather than the sum of all of them. For API models, parallelism comes from issuing concurrent requests; for local models, parallelism comes from the GPU processing the batch of N decoding streams.[^1][^2][^4] Because each batched sequence still pays its own decoding cost, the achievable speedup depends on (i) how evenly the points are sized, (ii) how short each expansion is, and (iii) how well the model follows the brevity instructions in the point-expanding prompt.[^1][^4]
The authors evaluated SoT on Vicuna-80, a question set of 80 prompts spanning nine categories (writing, roleplay, knowledge, generic, common-sense, coding, math, counterfactual, and Fermi questions) introduced for Vicuna evaluations.[^1][^4] Subsequent experiments also used WizardLM and LIMA-derived prompts.[^1][^2] Twelve LLMs were tested: nine open-source (including Vicuna variants, LLaMA 2 Chat, StableVicuna-13B, OpenChat-13B, UltraLM-13B) and three API-served (Claude, GPT-3.5, and GPT-4).[^1][^2][^4]
Speedup is measured as the wall-clock latency ratio of normal generation to SoT generation on the same model and prompt set.[^1] The headline result is that SoT obtains a greater than 2x end-to-end speedup, up to 2.39x, on eight of the twelve models.[^1][^2][^4] The official site and the Microsoft Research blog report a per-model range of roughly 1.13x to 2.39x.[^2][^4] On open-source LLaMA-class models the reported speedups exceed 3x in one configuration, and on API models the reported acceleration is around 2x.[^2] The principal model where SoT failed to deliver acceleration is StableVicuna-13B, which did not consistently honour the brevity instructions and generated long point expansions as if they were normal answers, eroding the latency advantage.[^4]
Answer quality was scored automatically using two GPT-4-based judges: the FastChat pairwise protocol and the LLMZoo multi-dimension protocol (covering coherence, diversity, immersion, integrity, and relevance).[^2][^4] On FastChat, SoT recorded win/tie/lose rates of approximately 29.5 percent / 29.3 percent / 41.2 percent against normal generation; on LLMZoo the corresponding rates were roughly 45.8 percent / 19.6 percent / 34.5 percent.[^2] The authors summarize the overall result as SoT being comparable to or better than normal generation in about 60 percent of cases on the LLMZoo metric.[^2][^4] Quality varied by category: writing, roleplay, and generic-knowledge categories tended to benefit, while math, coding, and tasks requiring step-by-step dependencies tended to lose quality relative to the sequential baseline.[^2][^4]
The authors acknowledge that SoT is unsuitable for some categories of question, particularly those whose correct answer must unfold step by step with strong dependencies between intermediate computations.[^1][^2] They propose SoT-R, a per-query routing mechanism that decides whether to apply SoT or to fall back to standard generation.[^1][^2][^4]
Two router implementations are described:[^1][^2][^4]
Both routers retain end-to-end speedup above 1x for most models while restoring quality on categories where SoT alone underperformed.[^1][^2][^4] The trade-off is straightforward: SoT-R sacrifices peak speedup (since SoT is not fired on every query) in return for avoiding degradation on sequential-reasoning prompts.[^2][^4]
The reference implementation is open-sourced under an MIT licence at the GitHub repository imagination-research/sot.[^3] Its structure is summarized in the table below.
| Directory | Purpose |
|---|---|
sot/ | Core SoT and SoT-R logic for skeleton prompting and parallel expansion |
prompts/ | Skeleton and point-expansion prompt templates for open-source models and for GPT-4 |
data/ | Processed prompt sets for Vicuna-80, WizardLM, and LIMA |
scripts/ | Answer dumping and GPT-4-as-judge evaluation pipelines for Vicuna and WizardLM |
demo/ | A Gradio web demo built on FastChat that runs SoT interactively |
The repository also ships a router-training pipeline that fine-tunes a RoBERTa classifier on annotated LIMA examples and exports the resulting model for use as the trained-router backend.[^3] Implementations exist for both the prompting and trained variants of SoT-R.[^3]
Tree of Thoughts (ToT) generalizes Chain-of-Thought prompting by exploring a tree of intermediate thoughts with backtracking and evaluation steps, prioritising accuracy on multi-step reasoning rather than latency.[^4][^6] SoT can be viewed as a much shallower structural prompt: rather than searching a deep tree of reasoning steps, it generates a single one-level outline and expands each branch in parallel, then concatenates.[^4][^6] Several secondary analyses of structured prompting describe SoT in this way, presenting it as a degenerate-tree special case used for efficient parallel decoding rather than for search.[^6] Graph of Thoughts (GoT) by Besta et al., a contemporary work cited in the SoT paper, treats reasoning as a directed graph of nodes with arbitrary edges encoding dependencies and aggregation, again aimed at accuracy and aggregation flexibility rather than latency.[^4][^6] The SoT authors note that GoT-style dependencies could in principle be added on top of SoT to model dependencies between skeleton points, at the cost of giving up some of the parallelism.[^4]
The table below compares the three structured-prompting families along the dimensions most relevant to SoT's positioning.
| Method | Primary goal | Structure | Parallelism | Reference |
|---|---|---|---|---|
| Chain-of-Thought | Reasoning accuracy | Linear chain | None | Wei et al., 2022 |
| Tree of Thoughts (ToT) | Reasoning via search | Tree, with backtracking | Branches may be explored separately, but result selection is sequential | Yao et al., 2023[^6] |
| Graph of Thoughts (GoT) | Reasoning via aggregation | Arbitrary DAG of thoughts | Limited by graph dependencies | Besta et al., 2023[^6] |
| Skeleton-of-Thought (SoT) | Inference latency | One-level outline, parallel expansion | All N expansions run concurrently | Ning et al., 2023[^1] |
SoT is sometimes described as a "data-centric" form of parallel decoding because it allows multiple tokens (in fact, multiple subsentences) to be produced concurrently without changing the model weights or the decoding algorithm.[^1][^2] This places it conceptually adjacent to model-level parallel decoding methods such as speculative decoding, Medusa, Lookahead Decoding, and EAGLE, all of which seek to generate multiple tokens per forward pass through model-level changes.[^1][^4] The trade-off is different: SoT requires no draft model and no modifications to the decoding kernel, but it changes the content of the produced answer because the answer is now a concatenation of independent expansions, whereas speculative methods are designed to produce exactly the same distribution as the target model under standard decoding.[^1][^4] SoT is therefore complementary rather than competitive with those methods, and can in principle be stacked on top of them.[^1][^4]
SoT is distinct from prompting techniques that improve reasoning quality without addressing latency, such as self-consistency, Self-Refine, Auto-CoT, Step-Back Prompting, and Self-Discover. Those methods generally increase total compute (by sampling multiple chains, refining iteratively, or generating reasoning skeletons specifically for accuracy), whereas SoT trades off content composition for reduced wall-clock latency.[^1][^4]
SoT is widely cited in the LLM efficiency literature as an example of how prompting alone can change the time profile of generation rather than only its accuracy.[^4][^6] By demonstrating sizeable speedups across heterogeneous closed and open-source models, including Claude, GPT-3.5, and GPT-4, the authors made the case that response latency in chat applications is partly a property of how the model is asked, not only how it is implemented.[^1][^2] The SoT-R variant subsequently became the more practically deployed configuration, since it confines SoT to the categories of question where it actually helps and avoids degrading reasoning tasks.[^1][^2][^4] The paper's broader claim, that "explicitly planning the answer structure in language" is itself a useful optimisation knob for inference, has informed later work on structured generation and outline-first generation pipelines.[^1][^4]
The authors and follow-up commentary identify several limitations.[^1][^2][^4]
The work was first released on arXiv on 28 July 2023, with subsequent revisions through 2 March 2024 (v3).[^1] A short version appeared at the NeurIPS 2023 ENLSP workshop on efficient NLP and speech processing.[^7] The full paper was accepted to ICLR 2024 as a poster.[^3][^8] The work received public discussion on Microsoft Research's research blog in November 2023, framing SoT as an example of how data-centric prompting changes inference economics.[^2] As of the ICLR 2024 publication, the open-source repository imagination-research/sot provided the reference implementation, prompt templates, and evaluation harness used to reproduce the published speedups.[^3]