Large Concept Model
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 2,019 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 2,019 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Large Concept Model (LCM) is a research approach to language modeling, introduced by Meta AI's Fundamental AI Research (FAIR) group in December 2024, that generates text one sentence at a time instead of one token at a time. Rather than predicting the next subword from a fixed vocabulary, an LCM predicts the next "concept", defined in the paper as an abstract atomic idea that, in practice, corresponds to a whole sentence. Each sentence is represented as a fixed-size vector in the SONAR sentence-embedding space, and the model predicts the next vector autoregressively before SONAR's decoder turns the predicted vectors back into text or speech [1].
The work was published as the paper "Large Concept Models: Language Modeling in a Sentence Representation Space" on arXiv on 12 December 2024 (revised 15 December), with accompanying training code released under an MIT license at the facebookresearch/large_concept_model repository [1][2]. The authors describe the project as a proof of feasibility for an alternative to token-level large language models, not a finished system that matches flagship LLMs [1].
The paper's starting argument is that humans do not plan and write at the level of individual words. A researcher preparing a fifteen-minute talk outlines a flow of higher-level ideas; those ideas stay the same whether the talk is delivered in English or another language, and whether spoken or written. Standard LLMs, by contrast, are token-based and heavily English-centric, and they model the reasoning process only implicitly. The LCM authors stipulate that an explicit hierarchical architecture, operating on an abstract semantic representation that is independent of any particular language or modality, is better suited to producing coherent long-form output [1].
The architecture is related to the Joint Embedding Predictive Architecture (JEPA) proposed by Yann LeCun, which also predicts the representation of the next observation in an embedding space. The paper notes a difference in emphasis: JEPA focuses on learning a representation space in a self-supervised way, whereas the LCM focuses on accurate prediction within an existing, fixed embedding space [1].
The LCM does not learn its own embedding space. It operates directly on SONAR, a multilingual and multimodal fixed-size sentence embedding space released by Meta in 2023 (Duquenne, Schwenk, and Sagot). SONAR was trained as an encoder/decoder model with a fixed-size bottleneck (rather than cross-attention), combining a machine-translation objective with denoising auto-encoding and an MSE loss at the bottleneck. Its text encoder and decoder were initialized from No Language Left Behind (NLLB) weights, and the speech side was added with a teacher-student approach [1][3].
In the LCM setup, SONAR provides text input and output for 200 languages, speech input for 76 languages, and speech output in English; the authors also mention an experimental encoder for American Sign Language that is not used in the paper's experiments [1]. Because the encoder and decoder are frozen and never trained alongside the LCM, the same generated sequence of concepts can be decoded into different languages or modalities without rerunning the model, and a reasoning operation such as summarization can in principle be applied zero-shot to input in any supported language. SONAR's broad language coverage substantially exceeds that of contemporary LLMs; the paper reports that Llama 3 (400B) covers 8 languages in text and Gemini 47, against SONAR's 200 [1].
The pipeline is therefore: segment a document into sentences, encode each with SONAR into a concept, let the LCM generate a new sequence of concepts, then decode those concepts back into subwords with SONAR [1].
Converting raw corpora into concept sequences requires reliable sentence segmentation. The authors compared the SpaCy segmenter against Segment any Text (SaT), each with and without a maximum sentence-length cap, and measured reconstruction quality with an AutoBLEU score (the BLEU between a segment and the text decoded from its SONAR embedding). They found that very long sentences degrade SONAR encoding and that capping sentence length at roughly 200 characters helps; LCM training data was prepared with SaT Capped [1].
A core difficulty is that the next concept must be a continuous vector, and many semantically different continuations can be plausible. The paper explores several ways to model this distribution, all evaluated at 1.6B parameters with training data on the order of 1.3 trillion tokens [1].
| Variant | Generation method | Notes |
|---|---|---|
| Base-LCM | MSE regression | Decoder-only transformer with a PreNet (normalizes and maps SONAR vectors into the hidden dimension) and a PostNet (maps back). Trained to regress the next embedding under mean-squared-error loss. Tends to average over plausible continuations, which hurts quality. |
| One-Tower diffusion | Diffusion | A single transformer backbone iteratively denoises a noisy next-concept embedding, conditioned on the clean preceding concepts. Trained efficiently by interleaving clean and noisy embeddings with a causal attention mask. |
| Two-Tower diffusion | Diffusion | Separates a "contextualizer" (a causal decoder-only transformer that encodes preceding context) from a "denoiser" (a stack of transformer blocks with cross-attention that refines the noisy next concept). Layers use adaptive layer norm (AdaLN) conditioned on the diffusion timestep. |
| Quant-LCM | Quantized SONAR | Quantizes SONAR embeddings into discrete units using residual vector quantization (RVQ), then predicts the next quantized concept. Two flavors: Quant-LCM-c and Quant-LCM-d. |
The diffusion model variants use a variance-preserving forward noising process and predict the clean state directly (x0-prediction). The paper studies several noise schedules (cosine, quadratic, sigmoid), classifier-free guidance, and inference settings such as the guidance scale and initial noise level. To enable variable-length generation, training documents are suffixed with the sentence "End of text.", and inference stops when a generated embedding is close enough to that end marker or to the previous output [1].
In the 1.6B ablations, the diffusion-based variants clearly outperformed Base-LCM and Quant-LCM on next-sentence-prediction metrics such as mutual information and contrastive accuracy. On an instruction-tuned story-generation test, the diffusion variants scored well on coherence, though a 1.4B Llama-style baseline ("smaLlama") edged them on ROUGE-L and coherence, which the authors attribute to LLMs producing more fluent text [1]:
| Model (1.6B class) | ROUGE-L | Coherence |
|---|---|---|
| Base-LCM | 23.69 | 0.482 |
| One-Tower | 33.40 | 0.968 |
| Two-Tower | 33.64 | 0.938 |
| Quant-LCM-c | 30.87 | 0.847 |
| Quant-LCM-d | 28.01 | 0.704 |
| smaLlama (1.4B) | 34.88 | 0.984 |
Because the diffusion variants performed best, the authors scaled the Two-Tower design to 7B parameters, choosing it over One-Tower for its smaller memory footprint on long contexts. The 7B model uses a 5-layer contextualizer and a 14-layer denoiser with a hidden dimension of 4096 and 32 attention heads. It was pre-trained on 2.3B documents (about 2.7 trillion tokens, or 142.4B sentences/concepts) on 256 NVIDIA A100 GPUs, with the context extended to 2048 concepts. The pre-trained model is called Two-Tower-7B, and the instruction-tuned version Two-Tower-7B-IT [1].
Evaluation focused on generative tasks, since long-form generation is the main challenge for the approach. The team used multiple complementary automatic metrics: ROUGE-L, source-overlap (OVL-3), repetition (REP-4), a fluency classifier score (CoLA), and the SEAHORSE-based attribution and coverage scores (SH-4 and SH-5) [1].
On CNN DailyMail and XSum, Two-Tower-7B-IT produced ROUGE-L scores competitive with instruction-tuned LLMs of similar size. On CNN DailyMail it reached 36.47, behind the specifically fine-tuned T5-3B (37.56) and close to Mistral-7B-v0.3-IT (36.06), while on XSum it scored 23.71, the highest ROUGE-L among the compared models. The LCM generated more abstractive summaries (lower OVL-3), fewer repetitions than the LLMs, but less fluent text by the CoLA metric [1]:
| Model | CNN DailyMail R-L | XSum R-L |
|---|---|---|
| T5-3B | 37.56 | 17.11 |
| Gemma-7B-IT | 31.14 | 18.20 |
| Mistral-7B-v0.3-IT | 36.06 | 21.22 |
| Llama-3.1-8B-IT | 34.97 | 20.35 |
| Two-Tower-7B-IT | 36.47 | 23.71 |
On the long-context LCFO benchmark (documents of roughly 5,000 words, summarized to 20%, 10%, and 5% of their length), the LCM outperformed Mistral-7B-v0.3-IT and Gemma-7B-IT on ROUGE-L for the 5% and 10% settings, despite having seen relatively few long documents in training [1].
The paper also introduces summary expansion: given a short summary, generate a longer, coherent document. The goal is not to recreate the original text but to extend the input meaningfully. Here the LLMs scored higher ROUGE-L (they tend to reproduce source wording), while the LCM generated more divergent text with lower fluency [1].
A central claim is zero-shot cross-lingual ability. Although all of the paper's LCM training was on English text only, the model can be applied to other languages through SONAR. On the multilingual XLSum benchmark, the authors report ROUGE-L for 42 languages (three were excluded as unsupported by SONAR). Two-Tower-7B-IT outperformed Llama-3.1-8B-IT on English (23.5 versus 20.7) and, averaged over the six languages officially supported by both models, scored 20.2 versus 19.7. It also generalized well to low-resource languages it had never seen, including Southern Pashto, Burmese, Hausa, and Welsh (all above 20 ROUGE-L), reaching 30.4 on Vietnamese [1].
Beyond sentence-level concepts, the paper sketches a higher level of abstraction for explicit planning. A "planning model" produces a high-level description of what should come next (potentially spanning a paragraph), and the LCM is conditioned on that plan. A simplified single-model version, trained to also predict paragraph-break concepts and plan concepts, is called a Large Planning Concept Model (LPCM) [1].
The authors are explicit that the LCM is a feasibility study with a long path to flagship-LLM performance. The limitations they discuss include [1]:
The paper is credited to the "LCM team" at FAIR at Meta. Listed core contributors (alphabetical) include Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, and Artyom Kozhevnikov, with a broader author list of Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, and Holger Schwenk [1]. Several authors overlap with the SONAR and NLLB projects.
The release drew wide coverage as an attempt to move language modeling "beyond tokens", with commentators framing it as an exploration of what a successor to token-based LLMs might look like rather than a drop-in replacement [4][5]. Subsequent academic work, such as SONAR-LLM (2025), built on the same idea of reasoning in a sentence-embedding space [6].