# SpiRit-LM

> Source: https://aiwiki.ai/wiki/spirit_lm
> Updated: 2026-06-03
> Categories: Large Language Models, Meta AI, Speech & Audio AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# SpiRit-LM

**SpiRit-LM** (also written **Spirit LM**) is a [large language model](/wiki/large_language_model) from [Meta AI](/wiki/meta_ai)'s Fundamental AI Research (FAIR) group that handles spoken and written language inside a single model. Instead of treating speech and text as separate systems joined by a pipeline, SpiRit-LM trains on sequences in which text tokens and speech tokens are interleaved at the word level, so the model can read text, listen to speech, and freely move between the two within one generation. The paper, "SpiRit-LM: Interleaved Spoken and Written Language Model" (Nguyen et al.), was first posted to arXiv on 8 February 2024, and Meta released the model weights and inference code openly on 18 October 2024 [1][2][3].

The model comes in two versions. The **Base** version represents speech with phonetic units derived from [HuBERT](/wiki/hubert). The **Expressive** version adds pitch and style tokens on top of those phonetic units so that it can capture emotion, tone, and speaking style rather than just the words [1][3].

## Background and motivation

Conventional approaches to combining speech and language chain together separate components: a [speech recognition](/wiki/speech_recognition) system converts audio to text, a text language model processes that text, and a [text-to-speech](/wiki/text_to_speech) system turns the response back into audio. Each handoff discards information. Once speech becomes text, prosody, emotion, and speaker style are gone, and the language model never sees them. SpiRit-LM was built to keep speech and text in the same representational space throughout, training one model that understands and produces both, and (in the Expressive version) preserves the expressive qualities of the spoken input [1][2].

The approach follows a line of "textless NLP" and generative spoken language modeling work at Meta. SpiRit-LM extends that by grafting speech onto a strong pretrained text model rather than training a speech model from scratch, which lets it inherit the text model's language abilities [1].

## Architecture

SpiRit-LM is initialized from the [Llama 2](/wiki/llama_2) 7B text model and then continued-trained on a mix of text, speech, and aligned speech-text data. Both released versions contain 7 billion parameters [3].

Text is encoded with Llama 2's standard subword byte-pair-encoding (BPE) tokens. Speech is converted into discrete tokens by a separate tokenizer stack, and these speech tokens are added to the model's vocabulary alongside the text tokens so that a single token stream can carry both modalities [1][3].

### Speech tokenization

The Base model uses one kind of speech token; the Expressive model uses three. The components are:

- **Phonetic tokens.** A HuBERT speech encoder turns the waveform into discrete units. SpiRit-LM uses the same HuBERT model as the earlier TWIST work, producing 501 phonetic units. Consecutive repeated units are deduplicated to improve modeling quality. The HuBERT encoder has roughly 96 million parameters [1][3].
- **Pitch tokens** (Expressive only). A VQ-VAE model trained on the fundamental frequency (F0) of the input speech produces pitch tokens, using a codebook of size 64 and yielding about 12.5 pitch tokens per second. This component is small, about 0.2 million parameters [1][3].
- **Style tokens** (Expressive only). Style features are extracted with a wav2vec2-based model (about 95 million parameters) and clustered with k-means into 100 units, computed on the Expresso expressive-speech dataset [1][3].

To turn speech tokens back into audio, SpiRit-LM uses a HiFi-GAN vocoder/decoder (roughly 14 to 15 million parameters). For the Expressive model the decoder is conditioned on phonetic, pitch, and style tokens together, which is how the spoken output recovers tone and emotion [1][3].

Because of the extra token types, the two versions have slightly different vocabulary sizes: 32,512 for Base and 32,768 for Expressive [3].

### Word-level interleaving

The central training idea is word-level interleaving. Using speech-text aligned corpora, the training data is built so that the token stream switches between text spans and speech spans at word boundaries, with switches triggered randomly. A single example can therefore start in text, continue in speech, and return to text, all describing one continuous utterance. A simplified illustration from the paper is a sequence like "[Text] the cat [Speech] (HuBERT units for "sat on") [Text] the mat," where part of the sentence is written and part is encoded as speech tokens [1].

By seeing many such mixed sequences, the model learns to align the two modalities at a fine granularity. This is what lets it convert between speech and text and perform tasks in either modality without task-specific architectures: at inference time a user can prompt it with text and ask for speech, prompt with speech and ask for text, or mix them [1][2].

## Base versus Expressive

| Feature | SpiRit-LM Base | SpiRit-LM Expressive |
|---|---|---|
| Base text model | Llama 2 7B | Llama 2 7B |
| Parameters | 7B | 7B |
| Speech token types | Phonetic (HuBERT) only | Phonetic + pitch + style |
| Vocabulary size | 32,512 | 32,768 |
| Captures expressivity | No | Yes (pitch and style) |
| Pitch tokens | N/A | VQ-VAE on F0, codebook 64, ~12.5/sec |
| Style tokens | N/A | wav2vec2 features, k-means 100 units |
| Audio decoder | HiFi-GAN | HiFi-GAN conditioned on phonetic, pitch, style |

The practical difference is that Base treats speech mainly as a carrier of words, while Expressive also models how something is said. On Meta's sentiment-preservation evaluation, the Expressive model keeps the emotional tone of a prompt when generating, whereas the Base model and a cascaded baseline tend to lose it [1][3].

## Training data

SpiRit-LM was trained between October and December 2023 on a combination of text-only, speech-only, and aligned speech-text datasets. The reported scale includes about 307 billion text-only tokens; roughly 458,000 hours of speech-only audio, corresponding to about 28.2 billion speech tokens; and roughly 111,000 hours of aligned speech-and-text data, corresponding to about 7.0 billion speech tokens plus 1.4 billion text tokens [3].

## Tasks and evaluation

A key claim of the paper is that SpiRit-LM can learn new tasks in a few-shot fashion across modalities, including automatic speech recognition (ASR), text-to-speech, and speech classification, without being fine-tuned for them [1][2].

Reported few-shot results for the Base model, alongside a cascade baseline (for example Whisper plus Llama 2 for ASR), include the following. As expected for a single open model competing against a strong specialized pipeline, the cascade still wins on raw transcription accuracy, but SpiRit-LM performs these tasks from within one model [1]:

| Task (setting) | SpiRit-LM Base | Cascade baseline |
|---|---|---|
| ASR, LibriSpeech clean, 10-shot (WER, lower is better) | 21.9 | 3.7 |
| TTS, LibriSpeech clean, 10-shot (CER, lower is better) | 45.5 | 4.0 |
| Intent classification, 30-shot (accuracy, higher is better) | 71.9 | 89.6 |

To measure expressivity, the authors introduced the Speech-Text Sentiment Preservation (STSP) benchmark, which checks whether a model keeps the sentiment of a prompt when continuing it, both within a modality and across modalities (speech to speech, speech to text, text to speech, text to text). The Expressive model outperforms both the Base model and the cascade on preserving sentiment in every direction except text-to-text, where the three are comparable. The paper presents SpiRit-LM as the first language model able to preserve the sentiment of text and speech prompts both within and across modalities [1][3].

Representative STSP sentiment-preservation scores (higher is better) [1]:

| Direction | SpiRit-LM Base | SpiRit-LM Expressive | Cascade |
|---|---|---|---|
| Text to text | 0.65 | 0.63 | 0.65 |
| Speech to speech | 0.33 | 0.54 | 0.33 |
| Text to speech | 0.33 | 0.38 | 0.36 |
| Speech to text | 0.34 | 0.36 | 0.33 |

## Release and license

Meta released SpiRit-LM as an open model on 18 October 2024, publishing the research paper, inference code, and weights for both the Base and Expressive 7B versions through the `facebookresearch/spiritlm` repository on GitHub. The release is governed by the FAIR Noncommercial Research License, which restricts use to noncommercial research [2][3][4].

The model card states that SpiRit-LM is intended for noncommercial research use in English and should not be deployed in consumer-facing applications. As with other large language models, the authors note that it can produce inaccurate, biased, or otherwise objectionable output, and that, because it is derived from Llama 2, it can generate harmful content unless paired with safety instruction-tuning similar to Llama 2-Chat [1][3].

A peer-reviewed version of the paper was subsequently published in the Transactions of the Association for Computational Linguistics (TACL) [5].

## References

1. Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux. "SpiRit-LM: Interleaved Spoken and Written Language Model." arXiv:2402.05755 (8 February 2024; revised 18 October 2024). https://arxiv.org/abs/2402.05755
2. Meta AI. "Sharing new research, models, and datasets from Meta FAIR" (FAIR news roundup announcing Meta Spirit LM, Segment Anything 2.1, Layer Skip, and others), 18 October 2024. https://ai.meta.com/blog/fair-news-segment-anything-2-1-meta-spirit-lm-layer-skip-salsa-sona/
3. facebookresearch/spiritlm. "MODEL_CARD.md" (Spirit LM model card). GitHub. https://github.com/facebookresearch/spiritlm/blob/main/MODEL_CARD.md
4. facebookresearch/spiritlm. "Inference code for the paper 'Spirit-LM Interleaved Spoken and Written Language Model'." GitHub repository. https://github.com/facebookresearch/spiritlm
5. Tu Anh Nguyen et al. "SpiRit-LM: Interleaved Spoken and Written Language Model." Transactions of the Association for Computational Linguistics (TACL). https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00728/127457/SpiRit-LM-Interleaved-Spoken-and-Written-Language

