SpiRit-LM
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,541 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,541 words
Add missing citations, update stale details, or suggest a clearer explanation.
SpiRit-LM (also written Spirit LM) is a large language model from Meta AI's Fundamental AI Research (FAIR) group that handles spoken and written language inside a single model. Instead of treating speech and text as separate systems joined by a pipeline, SpiRit-LM trains on sequences in which text tokens and speech tokens are interleaved at the word level, so the model can read text, listen to speech, and freely move between the two within one generation. The paper, "SpiRit-LM: Interleaved Spoken and Written Language Model" (Nguyen et al.), was first posted to arXiv on 8 February 2024, and Meta released the model weights and inference code openly on 18 October 2024 [1][2][3].
The model comes in two versions. The Base version represents speech with phonetic units derived from HuBERT. The Expressive version adds pitch and style tokens on top of those phonetic units so that it can capture emotion, tone, and speaking style rather than just the words [1][3].
Conventional approaches to combining speech and language chain together separate components: a speech recognition system converts audio to text, a text language model processes that text, and a text-to-speech system turns the response back into audio. Each handoff discards information. Once speech becomes text, prosody, emotion, and speaker style are gone, and the language model never sees them. SpiRit-LM was built to keep speech and text in the same representational space throughout, training one model that understands and produces both, and (in the Expressive version) preserves the expressive qualities of the spoken input [1][2].
The approach follows a line of "textless NLP" and generative spoken language modeling work at Meta. SpiRit-LM extends that by grafting speech onto a strong pretrained text model rather than training a speech model from scratch, which lets it inherit the text model's language abilities [1].
SpiRit-LM is initialized from the Llama 2 7B text model and then continued-trained on a mix of text, speech, and aligned speech-text data. Both released versions contain 7 billion parameters [3].
Text is encoded with Llama 2's standard subword byte-pair-encoding (BPE) tokens. Speech is converted into discrete tokens by a separate tokenizer stack, and these speech tokens are added to the model's vocabulary alongside the text tokens so that a single token stream can carry both modalities [1][3].
The Base model uses one kind of speech token; the Expressive model uses three. The components are:
To turn speech tokens back into audio, SpiRit-LM uses a HiFi-GAN vocoder/decoder (roughly 14 to 15 million parameters). For the Expressive model the decoder is conditioned on phonetic, pitch, and style tokens together, which is how the spoken output recovers tone and emotion [1][3].
Because of the extra token types, the two versions have slightly different vocabulary sizes: 32,512 for Base and 32,768 for Expressive [3].
The central training idea is word-level interleaving. Using speech-text aligned corpora, the training data is built so that the token stream switches between text spans and speech spans at word boundaries, with switches triggered randomly. A single example can therefore start in text, continue in speech, and return to text, all describing one continuous utterance. A simplified illustration from the paper is a sequence like "[Text] the cat [Speech] (HuBERT units for "sat on") [Text] the mat," where part of the sentence is written and part is encoded as speech tokens [1].
By seeing many such mixed sequences, the model learns to align the two modalities at a fine granularity. This is what lets it convert between speech and text and perform tasks in either modality without task-specific architectures: at inference time a user can prompt it with text and ask for speech, prompt with speech and ask for text, or mix them [1][2].
| Feature | SpiRit-LM Base | SpiRit-LM Expressive |
|---|---|---|
| Base text model | Llama 2 7B | Llama 2 7B |
| Parameters | 7B | 7B |
| Speech token types | Phonetic (HuBERT) only | Phonetic + pitch + style |
| Vocabulary size | 32,512 | 32,768 |
| Captures expressivity | No | Yes (pitch and style) |
| Pitch tokens | N/A | VQ-VAE on F0, codebook 64, ~12.5/sec |
| Style tokens | N/A | wav2vec2 features, k-means 100 units |
| Audio decoder | HiFi-GAN | HiFi-GAN conditioned on phonetic, pitch, style |
The practical difference is that Base treats speech mainly as a carrier of words, while Expressive also models how something is said. On Meta's sentiment-preservation evaluation, the Expressive model keeps the emotional tone of a prompt when generating, whereas the Base model and a cascaded baseline tend to lose it [1][3].
SpiRit-LM was trained between October and December 2023 on a combination of text-only, speech-only, and aligned speech-text datasets. The reported scale includes about 307 billion text-only tokens; roughly 458,000 hours of speech-only audio, corresponding to about 28.2 billion speech tokens; and roughly 111,000 hours of aligned speech-and-text data, corresponding to about 7.0 billion speech tokens plus 1.4 billion text tokens [3].
A key claim of the paper is that SpiRit-LM can learn new tasks in a few-shot fashion across modalities, including automatic speech recognition (ASR), text-to-speech, and speech classification, without being fine-tuned for them [1][2].
Reported few-shot results for the Base model, alongside a cascade baseline (for example Whisper plus Llama 2 for ASR), include the following. As expected for a single open model competing against a strong specialized pipeline, the cascade still wins on raw transcription accuracy, but SpiRit-LM performs these tasks from within one model [1]:
| Task (setting) | SpiRit-LM Base | Cascade baseline |
|---|---|---|
| ASR, LibriSpeech clean, 10-shot (WER, lower is better) | 21.9 | 3.7 |
| TTS, LibriSpeech clean, 10-shot (CER, lower is better) | 45.5 | 4.0 |
| Intent classification, 30-shot (accuracy, higher is better) | 71.9 | 89.6 |
To measure expressivity, the authors introduced the Speech-Text Sentiment Preservation (STSP) benchmark, which checks whether a model keeps the sentiment of a prompt when continuing it, both within a modality and across modalities (speech to speech, speech to text, text to speech, text to text). The Expressive model outperforms both the Base model and the cascade on preserving sentiment in every direction except text-to-text, where the three are comparable. The paper presents SpiRit-LM as the first language model able to preserve the sentiment of text and speech prompts both within and across modalities [1][3].
Representative STSP sentiment-preservation scores (higher is better) [1]:
| Direction | SpiRit-LM Base | SpiRit-LM Expressive | Cascade |
|---|---|---|---|
| Text to text | 0.65 | 0.63 | 0.65 |
| Speech to speech | 0.33 | 0.54 | 0.33 |
| Text to speech | 0.33 | 0.38 | 0.36 |
| Speech to text | 0.34 | 0.36 | 0.33 |
Meta released SpiRit-LM as an open model on 18 October 2024, publishing the research paper, inference code, and weights for both the Base and Expressive 7B versions through the facebookresearch/spiritlm repository on GitHub. The release is governed by the FAIR Noncommercial Research License, which restricts use to noncommercial research [2][3][4].
The model card states that SpiRit-LM is intended for noncommercial research use in English and should not be deployed in consumer-facing applications. As with other large language models, the authors note that it can produce inaccurate, biased, or otherwise objectionable output, and that, because it is derived from Llama 2, it can generate harmful content unless paired with safety instruction-tuning similar to Llama 2-Chat [1][3].
A peer-reviewed version of the paper was subsequently published in the Transactions of the Association for Computational Linguistics (TACL) [5].