Large Concept Model

Meta AI Model Architecture Natural Language Processing

10 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 2,015 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A Large Concept Model (LCM) is a research approach to language modeling, introduced by Meta AI's Fundamental AI Research (FAIR) group in December 2024, that generates text one sentence at a time instead of one token at a time. Rather than predicting the next subword from a fixed vocabulary, an LCM predicts the next "concept", defined in the paper as an abstract atomic idea that, in practice, corresponds to a whole sentence. Each sentence is represented as a fixed-size vector in the SONAR sentence-embedding space, and the model predicts the next vector autoregressively before SONAR's decoder turns the predicted vectors back into text or speech ^[1].

The work was published as the paper "Large Concept Models: Language Modeling in a Sentence Representation Space" on arXiv on 12 December 2024 (revised 15 December), with accompanying training code released under an MIT license at the facebookresearch/large_concept_model repository ^[1]^[2]. The authors describe the project as a proof of feasibility for an alternative to token-level large language models, not a finished system that matches flagship LLMs ^[1].

Motivation

The paper's starting argument is that humans do not plan and write at the level of individual words. A researcher preparing a fifteen-minute talk outlines a flow of higher-level ideas; those ideas stay the same whether the talk is delivered in English or another language, and whether spoken or written. Standard LLMs, by contrast, are token-based and heavily English-centric, and they model the reasoning process only implicitly. The LCM authors stipulate that an explicit hierarchical architecture, operating on an abstract semantic representation that is independent of any particular language or modality, is better suited to producing coherent long-form output ^[1].

The architecture is related to the Joint Embedding Predictive Architecture (JEPA) proposed by Yann LeCun, which also predicts the representation of the next observation in an embedding space. The paper notes a difference in emphasis: JEPA focuses on learning a representation space in a self-supervised way, whereas the LCM focuses on accurate prediction within an existing, fixed embedding space ^[1].

The role of SONAR

The LCM does not learn its own embedding space. It operates directly on SONAR, a multilingual and multimodal fixed-size sentence embedding space released by Meta in 2023 (Duquenne, Schwenk, and Sagot). SONAR was trained as an encoder/decoder model with a fixed-size bottleneck (rather than cross-attention), combining a machine-translation objective with denoising auto-encoding and an MSE loss at the bottleneck. Its text encoder and decoder were initialized from No Language Left Behind (NLLB) weights, and the speech side was added with a teacher-student approach ^[1]^[3].

In the LCM setup, SONAR provides text input and output for 200 languages, speech input for 76 languages, and speech output in English; the authors also mention an experimental encoder for American Sign Language that is not used in the paper's experiments ^[1]. Because the encoder and decoder are frozen and never trained alongside the LCM, the same generated sequence of concepts can be decoded into different languages or modalities without rerunning the model, and a reasoning operation such as summarization can in principle be applied zero-shot to input in any supported language. SONAR's broad language coverage substantially exceeds that of contemporary LLMs; the paper reports that Llama 3 (400B) covers 8 languages in text and Gemini 47, against SONAR's 200 ^[1].

The pipeline is therefore: segment a document into sentences, encode each with SONAR into a concept, let the LCM generate a new sequence of concepts, then decode those concepts back into subwords with SONAR ^[1].

Data preparation

Converting raw corpora into concept sequences requires reliable sentence segmentation. The authors compared the SpaCy segmenter against Segment any Text (SaT), each with and without a maximum sentence-length cap, and measured reconstruction quality with an AutoBLEU score (the BLEU between a segment and the text decoded from its SONAR embedding). They found that very long sentences degrade SONAR encoding and that capping sentence length at roughly 200 characters helps; LCM training data was prepared with SaT Capped ^[1].

Model variants

A core difficulty is that the next concept must be a continuous vector, and many semantically different continuations can be plausible. The paper explores several ways to model this distribution, all evaluated at 1.6B parameters with training data on the order of 1.3 trillion tokens ^[1].

Variant	Generation method	Notes
Base-LCM	MSE regression	Decoder-only transformer with a PreNet (normalizes and maps SONAR vectors into the hidden dimension) and a PostNet (maps back). Trained to regress the next embedding under mean-squared-error loss. Tends to average over plausible continuations, which hurts quality.
One-Tower diffusion	Diffusion	A single transformer backbone iteratively denoises a noisy next-concept embedding, conditioned on the clean preceding concepts. Trained efficiently by interleaving clean and noisy embeddings with a causal attention mask.
Two-Tower diffusion	Diffusion	Separates a "contextualizer" (a causal decoder-only transformer that encodes preceding context) from a "denoiser" (a stack of transformer blocks with cross-attention that refines the noisy next concept). Layers use adaptive layer norm (AdaLN) conditioned on the diffusion timestep.
Quant-LCM	Quantized SONAR	Quantizes SONAR embeddings into discrete units using residual vector quantization (RVQ), then predicts the next quantized concept. Two flavors: Quant-LCM-c and Quant-LCM-d.

The diffusion model variants use a variance-preserving forward noising process and predict the clean state directly (x0-prediction). The paper studies several noise schedules (cosine, quadratic, sigmoid), classifier-free guidance, and inference settings such as the guidance scale and initial noise level. To enable variable-length generation, training documents are suffixed with the sentence "End of text.", and inference stops when a generated embedding is close enough to that end marker or to the previous output ^[1].

In the 1.6B ablations, the diffusion-based variants clearly outperformed Base-LCM and Quant-LCM on next-sentence-prediction metrics such as mutual information and contrastive accuracy. On an instruction-tuned story-generation test, the diffusion variants scored well on coherence, though a 1.4B Llama-style baseline ("smaLlama") edged them on ROUGE-L and coherence, which the authors attribute to LLMs producing more fluent text ^[1]:

Model (1.6B class)	ROUGE-L	Coherence
Base-LCM	23.69	0.482
One-Tower	33.40	0.968
Two-Tower	33.64	0.938
Quant-LCM-c	30.87	0.847
Quant-LCM-d	28.01	0.704
smaLlama (1.4B)	34.88	0.984

Scaling to 7B parameters

Because the diffusion variants performed best, the authors scaled the Two-Tower design to 7B parameters, choosing it over One-Tower for its smaller memory footprint on long contexts. The 7B model uses a 5-layer contextualizer and a 14-layer denoiser with a hidden dimension of 4096 and 32 attention heads. It was pre-trained on 2.3B documents (about 2.7 trillion tokens, or 142.4B sentences/concepts) on 256 NVIDIA A100 GPUs, with the context extended to 2048 concepts. The pre-trained model is called Two-Tower-7B, and the instruction-tuned version Two-Tower-7B-IT ^[1].

Tasks evaluated

Evaluation focused on generative tasks, since long-form generation is the main challenge for the approach. The team used multiple complementary automatic metrics: ROUGE-L, source-overlap (OVL-3), repetition (REP-4), a fluency classifier score (CoLA), and the SEAHORSE-based attribution and coverage scores (SH-4 and SH-5) ^[1].

Summarization

On CNN DailyMail and XSum, Two-Tower-7B-IT produced ROUGE-L scores competitive with instruction-tuned LLMs of similar size. On CNN DailyMail it reached 36.47, behind the specifically fine-tuned T5-3B (37.56) and close to Mistral-7B-v0.3-IT (36.06), while on XSum it scored 23.71, the highest ROUGE-L among the compared models. The LCM generated more abstractive summaries (lower OVL-3), fewer repetitions than the LLMs, but less fluent text by the CoLA metric ^[1]:

Model	CNN DailyMail R-L	XSum R-L
T5-3B	37.56	17.11
Gemma-7B-IT	31.14	18.20
Mistral-7B-v0.3-IT	36.06	21.22
Llama-3.1-8B-IT	34.97	20.35
Two-Tower-7B-IT	36.47	23.71

On the long-context LCFO benchmark (documents of roughly 5,000 words, summarized to 20%, 10%, and 5% of their length), the LCM outperformed Mistral-7B-v0.3-IT and Gemma-7B-IT on ROUGE-L for the 5% and 10% settings, despite having seen relatively few long documents in training ^[1].

Summary expansion

The paper also introduces summary expansion: given a short summary, generate a longer, coherent document. The goal is not to recreate the original text but to extend the input meaningfully. Here the LLMs scored higher ROUGE-L (they tend to reproduce source wording), while the LCM generated more divergent text with lower fluency ^[1].

Zero-shot multilingual generalization

A central claim is zero-shot cross-lingual ability. Although all of the paper's LCM training was on English text only, the model can be applied to other languages through SONAR. On the multilingual XLSum benchmark, the authors report ROUGE-L for 42 languages (three were excluded as unsupported by SONAR). Two-Tower-7B-IT outperformed Llama-3.1-8B-IT on English (23.5 versus 20.7) and, averaged over the six languages officially supported by both models, scored 20.2 versus 19.7. It also generalized well to low-resource languages it had never seen, including Southern Pashto, Burmese, Hausa, and Welsh (all above 20 ROUGE-L), reaching 30.4 on Vietnamese ^[1].

Planning extension

Beyond sentence-level concepts, the paper sketches a higher level of abstraction for explicit planning. A "planning model" produces a high-level description of what should come next (potentially spanning a paragraph), and the LCM is conditioned on that plan. A simplified single-model version, trained to also predict paragraph-break concepts and plan concepts, is called a Large Planning Concept Model (LPCM) ^[1].

Limitations

The authors are explicit that the LCM is a feasibility study with a long path to flagship-LLM performance. The limitations they discuss include ^[1]:

The embedding space was not designed for this task. SONAR was trained on bitext machine-translation data with relatively short sentences, so it maintains good local geometry but offers no guarantee of behaving well globally, which next-sentence prediction requires.
Fragility. Small perturbations to a SONAR vector can produce a large loss of meaning after decoding. Texts containing links, references, numbers, or code are particularly fragile, and a mismatch between SONAR's training data and typical LLM corpora can compromise factuality.
Continuous versus discrete. Sentences remain discrete combinatorial objects even when represented as continuous vectors, which makes diffusion modeling harder than for images. Continuous modeling also forgoes the contrastive softmax objective that helps token models on tasks needing precise answers.
Concept granularity and data sparsity. Most sentences in a corpus are unique, so next-sentence prediction faces a very wide space of valid continuations, and a fixed-size embedding for a long sentence is a coarse unit. The authors note that finer or alternative segmentation, and better concept representations than SONAR, are major directions for future work.

Authorship and reception

The paper is credited to the "LCM team" at FAIR at Meta. Listed core contributors (alphabetical) include Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, and Artyom Kozhevnikov, with a broader author list of Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, and Holger Schwenk ^[1]. Several authors overlap with the SONAR and NLLB projects.

The release drew wide coverage as an attempt to move language modeling "beyond tokens", with commentators framing it as an exploration of what a successor to token-based LLMs might look like rather than a drop-in replacement ^[4]^[5]. Subsequent academic work, such as SONAR-LLM (2025), built on the same idea of reasoning in a sentence-embedding space ^[6].

References

Barrault, L., Duquenne, P.-A., Elbayad, M., Kozhevnikov, A., et al. (LCM team, FAIR at Meta). "Large Concept Models: Language Modeling in a Sentence Representation Space." arXiv:2412.08821, December 2024. https://arxiv.org/abs/2412.08821 ↩
Meta AI / Facebook Research. "large_concept_model" GitHub repository. https://github.com/facebookresearch/large_concept_model ↩
Duquenne, P.-A., Schwenk, H., Sagot, B. "SONAR: Sentence-Level Multimodal and Language-Agnostic Representations." arXiv:2308.11466, 2023. https://arxiv.org/abs/2308.11466 ↩
"Large Concept Models (LCMs) by Meta: The Era of AI After LLMs?" AI Papers Academy. https://aipapersacademy.com/large-concept-models/ ↩
"Large Concept Models: A Guide With Examples." DataCamp. https://www.datacamp.com/blog/large-concept-models ↩
"SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens." arXiv:2508.05305, 2025. https://arxiv.org/abs/2508.05305 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

LLaMA

Motivation

The role of SONAR

Data preparation

Model variants

Scaling to 7B parameters

Tasks evaluated

Summarization

Summary expansion

Zero-shot multilingual generalization

Planning extension

Limitations

Authorship and reception

References

Improve this article

Related Articles

Joint Embedding Predictive Architecture

Byte Latent Transformer

MEGABYTE

Bahdanau attention

LLaMA

Llama 3

What links here

Related Articles

Joint Embedding Predictive Architecture

Byte Latent Transformer

MEGABYTE

Bahdanau attention

LLaMA

Llama 3