InternLM
Last reviewed
Sources
13 citations
Review status
Source-backed
Revision
v2 · 2,019 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
13 citations
Review status
Source-backed
Revision
v2 · 2,019 words
Add missing citations, update stale details, or suggest a clearer explanation.
InternLM (Chinese name Shusheng Puyu, 书生·浦语) is a family of open-weight large language models developed primarily by Shanghai AI Laboratory, with partners that include SenseTime, the Chinese University of Hong Kong, and Fudan University. First released in mid 2023 at 7B and 20B parameter sizes, the series has grown across four generations (InternLM, InternLM2, InternLM2.5, and InternLM3) and is known for strong reasoning, very long context support (up to a 1M-token variant), and an open Apache-2.0 license from the second generation onward. [1][2] The project is closely tied to the OpenCompass evaluation suite and to a broader "Intern" ecosystem that now extends to the scientific multimodal models Intern-S1 and Intern-S1-Pro. [2][11]
InternLM is a series of general-purpose large language models built and open-sourced by Shanghai AI Laboratory. The base and instruction-tuned weights are released openly, and from the second generation onward both the code and the weights have been published under the permissive Apache-2.0 license, which permits commercial use. [1][2] The lineup spans compact-to-mid sizes (commonly 1.8B, 7B, and 20B, plus an 8B flagship in InternLM3), and it is accompanied by multimodal and domain-specific siblings such as InternLM-XComposer (vision-language) and InternLM-Math (mathematical reasoning). [1][9]
The name Shusheng (书生, "scholar") is the umbrella brand Shanghai AI Laboratory uses for its Intern series of foundation models, and Puyu (浦语) is the language-model line within it. The first public artifact was an internal foundation model of 104B parameters described in an early technical report, pre-trained on roughly 1.6T tokens of multilingual data. That 104B model was not released; instead, on 6 July 2023 the lab open-sourced a 7B derivative, InternLM-7B, providing a base model and a chat-tuned variant aimed at practical use. At launch the code was Apache-2.0 but the weights required written permission for commercial use, a restriction the lab relaxed in later updates so that the weights became free for commercial use after a registration step. [3][8]
InternLM-20B followed on 20 September 2023. It was a deliberately deeper network, 60 layers rather than the 32 to 40 layers typical of 7B to 13B models, pre-trained on over 2.3T tokens of English, Chinese, and code. Shanghai AI Laboratory positioned it against larger open models of the time, and its reported scores led the 13B-to-33B size band on several benchmarks. The 20B weights were published with terms stating that the code is Apache-2.0 while the weights are open for academic research and also allow free commercial usage. [4]
InternLM2 arrived in early 2024. The 7B and 20B models were released on 17 January 2024, with a 1.8B base and chat pair following on 31 January 2024. Each size shipped in several forms: a base model, an SFT-only chat model (often labeled internlm2-chat-*-sft), and a fully aligned chat model. A dedicated reward-model series, InternLM2-Reward in 1.8B, 7B, and 20B, was added on 19 July 2024. [1]
The accompanying InternLM2 Technical Report (arXiv:2403.17297) was submitted on 26 March 2024 with a large author list drawn from Shanghai AI Laboratory and collaborators. The abstract states that the model "outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks," and the report is unusually detailed about the engineering behind the models. [5][6] It describes the data preparation pipeline for text, code, and long-context data; pre-training procedures; and the alignment stack. For alignment, InternLM2 uses supervised fine-tuning (SFT) followed by a reinforcement-learning method the authors call Conditional Online RLHF (COOL RLHF), "a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking." [5] The architecture adopts Grouped-Query Attention (GQA) to keep the memory footprint manageable when serving long sequences. [6]
InternLM2.5 was unveiled at the World Artificial Intelligence Conference (WAIC) in Shanghai in early July 2024 (the conference ran 4 to 6 July 2024). The 7B family (base, chat, and the 1M chat variant) was released on 3 July 2024, and the 1.8B and 20B models followed on 5 August 2024. The line kept the InternLM2 architecture and leaned on large volumes of synthetic data and an iterative "capability flywheel" to improve reasoning, with the lab citing roughly a 20 percent gain in reasoning over InternLM2 at the 20B scale, plus stronger tool use and the ability to gather and synthesize information from many web pages. [1][2][7]
InternLM3 narrowed the family to a single flagship size. InternLM3-8B-Instruct was released on 15 January 2025. Its headline claim is efficiency: it was trained on only 4 trillion high-quality tokens, which the lab says cuts training cost by more than 75 percent versus comparable models, while still beating Llama3.1-8B and Qwen2.5-7B on a range of reasoning and knowledge tasks. [1][2] InternLM3 also adds a dual-mode interface: a normal response mode for ordinary conversation and a deep thinking mode that produces a long chain-of-thought (allocating up to 8192 tokens to the reasoning trace) for harder problems. Long-context behavior is reported on the RULER benchmark across a 4K-to-128K range, with an average score of 87.9. [1][2]
| Version | First release | Sizes | Notable context | License |
|---|---|---|---|---|
| InternLM (1st gen) | 6 Jul 2023 (7B); 20 Sep 2023 (20B) | 7B, 20B | 16K (20B, via extrapolation) | Apache-2.0 code; weights free for commercial use after registration |
| InternLM2 | 17 Jan 2024 | 1.8B, 7B, 20B | 200K (needle-in-a-haystack) | Apache-2.0 |
| InternLM2.5 | 3 Jul 2024 | 1.8B, 7B, 20B | 1M (7B-Chat-1M) | Apache-2.0 |
| InternLM3 | 15 Jan 2025 | 8B | 128K (RULER) | Apache-2.0 |
A central theme of InternLM2 is long-context handling. During pre-training the model is first trained on 4K-token texts and then on high-quality 32K-token texts, and positional-encoding extrapolation extends usable context well beyond the training length. The report demonstrates the result with the "Needle-in-a-Haystack" retrieval test at 200K tokens, where the model reliably locates inserted facts. The lab also constructed 32K data during SFT and RLHF so that long-context ability survives alignment rather than degrading after instruction tuning. [5][6]
The long-context line was pushed further in the next generation. InternLM2.5 introduced a 7B chat variant, InternLM2.5-7B-Chat-1M, trained to operate over a 1M-token context. Shanghai AI Laboratory reported near-full accuracy on needle-in-a-haystack retrieval at 1M tokens and competitive results on long-document suites such as LongBench and L-Eval, while keeping the 1M model's general performance close to the standard 7B chat model. Running the full 1M context is resource-intensive: the model card notes it requires 4 x A100-80G GPUs. [1][7]
From 2025 onward, Shanghai AI Laboratory extended the Intern brand beyond plain language models into scientific multimodal foundation models. Intern-S1 was open-sourced in 2025 and described in a technical report (arXiv:2508.15763, submitted 21 August 2025) as the first open-source general model with advanced scientific reasoning. It is a mixture-of-experts model with 241B total parameters and 28B activated parameters, built on a 235B MoE language backbone (Qwen3) and a 6B InternViT vision encoder, then further pre-trained on 5 trillion tokens including over 2.5 trillion from scientific domains. The abstract reports that "Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains," and the team reported it surpassing some closed-source systems on specialized scientific tasks. [11][12]
Intern-S1-Pro followed on 4 February 2026 as a much larger successor: a roughly 1-trillion-parameter MoE model with 512 experts that activates 8 experts (about 22B parameters) per token, aimed at "AI for Science" (AI4S). The lab reported it reaching gold-medal level on Olympiad mathematics and matching or beating closed-source commercial models on scientific-reasoning evaluations such as SciReasoner. [13]
InternLM is the language-model core of a broader Intern ecosystem. The most prominent multimodal sibling is InternLM-XComposer, a vision-language model for text-image comprehension and composition first described in arXiv:2309.15112 (2023). InternLM-XComposer2 (7B) was released on 26 January 2024, a 1.8B variant followed on 9 April 2024, and InternLM-XComposer-2.5 arrived in July 2024 with a 24K interleaved image-text context that extends to 96K. A separate multimodal series, InternVL, is developed in the same orbit (with OpenGVLab) and is often used together with InternLM language backbones. On the domain side, InternLM-Math is a bilingual model specialized for mathematical reasoning. The training, fine-tuning, and serving toolchain (including XTuner for fine-tuning and LMDeploy for inference) is maintained under the same GitHub organization. [9][10]
The tables below collect figures reported by Shanghai AI Laboratory on the Hugging Face model cards and the project repository. Numbers come from different evaluation setups across generations, so they are best read within a row rather than directly across versions.
InternLM2.5-7B (base) compared with similarly sized open models:
| Benchmark | InternLM2.5-7B | Llama-3-8B | Yi-1.5-9B |
|---|---|---|---|
| MMLU (5-shot) | 71.6 | 66.4 | 71.6 |
| CMMLU (5-shot) | 79.1 | 51.0 | 74.1 |
| BBH (3-shot) | 70.1 | 59.7 | 71.1 |
| MATH (4-shot) | 34.0 | 16.4 | 31.9 |
| GSM8K (4-shot) | 74.8 | 54.3 | 74.5 |
| GPQA (0-shot) | 31.3 | 31.3 | 27.8 |
InternLM3-8B-Instruct compared with peer instruction models (MATH-500 score uses deep thinking mode):
| Benchmark | InternLM3-8B | Qwen2.5-7B | Llama3.1-8B | GPT-4o-mini |
|---|---|---|---|---|
| CMMLU | 83.1 | 75.8 | 53.9 | 66.0 |
| MMLU | 76.6 | 76.8 | 71.8 | 82.7 |
| MMLU-Pro | 57.6 | 56.2 | 48.1 | 64.1 |
| GPQA-Diamond | 37.4 | 33.3 | 24.2 | 42.9 |
| MATH-500 | 83.0 | 72.4 | 48.4 | 74.0 |
| HumanEval | 82.3 | 85.4 | 72.0 | 86.6 |
| AlpacaEval 2.0 | 51.1 | 30.3 | 25.0 | 50.7 |
For reference, the first-generation InternLM-20B reported MMLU 62.05, C-Eval 58.8, HumanEval 25.61, and MBPP 35.6, which the lab noted were the best results in the 13B-to-33B band at release. [1][2][4][7]
From InternLM2 onward, the project states plainly that "Code and model weights are licensed under Apache-2.0," which permits commercial use subject to including the license text and noting any modifications. [1][2] The first-generation models used a split arrangement common to Chinese open-weight releases of 2023: the code was Apache-2.0, while the weights were open for academic research and allowed free commercial use, initially after seeking written permission and later via a registration or application step rather than a per-use fee. Commercial users are pointed to a contact address (internlm@pjlab.org.cn) for licensing questions. The practical effect across the family is that the weights can be downloaded, fine-tuned, and deployed commercially without royalties. [1][2][4]
InternLM is tightly associated with OpenCompass, the open-source large-model evaluation platform maintained within the same Shanghai AI Laboratory ecosystem. OpenCompass provides the standardized benchmark suites (covering knowledge, reasoning, math, code, and long context) used to report InternLM scores, and the InternLM model cards reference OpenCompass-style evaluations such as MMLU, CMMLU, C-Eval, GSM8K, MATH, BBH, GPQA, and RULER. Because OpenCompass and InternLM are developed in close coordination, the project's published numbers are typically reproducible with the OpenCompass toolchain. [1][2][7]