Skywork R1V
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,645 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,645 words
Add missing citations, update stale details, or suggest a clearer explanation.
Skywork R1V is a family of open-weight multimodal vision-language models built for chain-of-thought reasoning, developed by Skywork AI, the large-model team of the Chinese technology company Kunlun Tech (Kunlun Wanwei). The series brings the long, step-by-step "thinking" style popularized by text reasoning models such as DeepSeek-R1 to inputs that combine images and text, letting the model reason through math and science diagrams, charts, and other visual problems before answering. Three open-source generations were released over 2025: Skywork-R1V in March, Skywork-R1V2 in April, and Skywork-R1V3 in July, each a 38-billion-parameter model published with weights, inference code, and a technical report on Hugging Face and GitHub.[1][2][3] A fourth, closed-source agentic variant, Skywork-R1V4-Lite, followed in November 2025.[4]
The R1V models address a gap that opened in early 2025: text-only reasoning models like DeepSeek-R1 and OpenAI's o1 showed that long reinforcement-learned chains of thought could dramatically improve performance on math and code, but those gains did not automatically extend to problems posed as images. Skywork's stated goal was to transfer this reasoning ability into the multimodal setting efficiently, without retraining a large language backbone from scratch, so that a model could examine a figure or diagram and reason about it in the same deliberate, multi-step way.[1][5]
Each generation is a 38B-parameter model that pairs a vision encoder with a reasoning-capable language backbone, connected by a lightweight projector or "connector" module, and is post-trained with reinforcement learning tailored to reasoning. Skywork releases the models under the permissive MIT license and reports benchmark results positioning them among the strongest open multimodal reasoners; as with all vendor-reported figures, the numbers below are Skywork's own claims.[2][3][6]
Kunlun Tech, also known as Kunlun Wanwei, is a Beijing-based technology company founded in 2008 and listed on the Shenzhen Stock Exchange since 2015. Its Skywork (Tiangong) team builds a broad portfolio of open-weight and proprietary AI systems.[7] Earlier work includes Skywork-13B, a bilingual (Chinese and English) large language model pre-trained on 3.2 TB of text and code and released in 2023, and Skywork-MoE, a 146B-parameter mixture-of-experts model with roughly 22B activated parameters that was upcycled from the dense Skywork-13B checkpoints.[8]
The R1V series sits alongside Skywork's text reasoning line. In February 2025 the team released Skywork-o1, an early Chinese reasoning model, and in April 2025 it open-sourced the Skywork-OR1 (Open Reasoner 1) series, including math- and code-focused 7B and 32B models whose performance Skywork reported as approaching DeepSeek-R1 on reasoning tasks.[5][9] Where OR1 targets text, R1V extends the same reasoning philosophy to vision-language inputs.
The first model, Skywork-R1V, was open-sourced in March 2025, with an accompanying technical report ("Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought," arXiv 2504.05599) posted in April 2025.[1] It combines an InternViT-6B vision encoder (InternViT-6B-448px-V2_5) with a DeepSeek-R1-Distill-Qwen-32B language backbone, joined by a lightweight visual projector. The central idea is that reasoning capability can be transferred from the text LLM into the multimodal model by training only the projector and aligning the two modalities, rather than retraining the encoder or the backbone. Skywork combined iterative supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO) and introduced an adaptive-length chain-of-thought distillation method to generate reasoning data and control how long the model "thinks."[1][6]
Skywork-R1V2 was released on April 24, 2025, with the report "Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning" (arXiv 2504.16656).[2] It uses an InternViT-6B vision encoder paired with the QwQ-32B reasoning backbone. The headline contribution is a hybrid reinforcement-learning recipe that jointly applies Mixed Preference Optimization (MPO), which blends a reward model (R1V-RM) with rule-based constraints such as format and factual correctness, and GRPO, which scores candidate answers relative to others in the same group. To keep training efficient, R1V2 adds a Selective Sample Buffer (SSB) that caches high-value examples and reintroduces them to counter GRPO's "vanishing advantages" problem, where many sampled answers receive near-zero learning signal. Skywork reported that MPO substantially lowered the model's hallucination rate (8.7%) compared with Direct Preference Optimization (12.6%) and plain SFT (18.4%).[2]
Skywork-R1V3-38B was open-sourced on July 9, 2025, with its technical report (arXiv 2507.06167) following on July 11, 2025.[3][10] Unlike its predecessors, R1V3 is built on the InternVL3-38B architecture and concentrates its gains in post-training: Skywork describes the recipe as relying mainly on a reinforcement-learning stage, preceded by a "fine-grained cold-start" supervised phase that prepares the model for RL. The team constructed a high-quality multimodal reasoning training set using rejection sampling and distilled cold-start data from the previous generation. R1V3 can adapt the length of its reasoning chain to input difficulty to avoid "overthinking" simple questions. Skywork positioned R1V3's reported MMMU result as setting a new open-source record and as approaching human-expert level on that benchmark.[3][11]
In November 2025 Skywork introduced Skywork-R1V4-Lite, a smaller and more agentic successor built on the Qwen3-VL-30B-A3B-Instruct base (a 30B mixture-of-experts model with roughly 3B activated parameters). Unlike the earlier open-weight releases, R1V4-Lite was offered as a closed-source API service through the Skywork platform, adding tool-use features such as code execution and web search.[4] Because it is not open-weight, it represents a departure from the open-source ethos of R1V through R1V3.
Across the open-weight generations, Skywork's design treats the vision encoder and the language backbone as largely fixed, capable components and focuses effort on (1) aligning them through a small connector and (2) eliciting reasoning through reinforcement learning rather than expensive multimodal pre-training. This "efficient transfer" philosophy is what lets a 38B model inherit the reasoning behavior of a strong text model while gaining the ability to see.[1][6]
The reinforcement-learning methods evolved across versions: R1V used iterative SFT plus GRPO with adaptive-length CoT distillation; R1V2 introduced the MPO-plus-GRPO hybrid with the Selective Sample Buffer; and R1V3 emphasized cold-start fine-tuning followed by RL on a curated, rejection-sampled reasoning dataset. A recurring theme is controlling reasoning length so the model spends more tokens on hard visual-math problems and fewer on easy ones.[2][3]
The table summarizes the open-weight R1V generations. Benchmark figures are Skywork's reported values and use the metrics named in each report.
| Attribute | Skywork-R1V | Skywork-R1V2 | Skywork-R1V3 |
|---|---|---|---|
| Open-source release | March 2025 | April 24, 2025 | July 9, 2025 |
| Technical report | arXiv 2504.05599 | arXiv 2504.16656 | arXiv 2507.06167 |
| Parameters | 38B | 38B | 38B |
| Vision encoder | InternViT-6B (V2_5) | InternViT-6B | from InternVL3-38B |
| Language backbone | DeepSeek-R1-Distill-Qwen-32B | QwQ-32B | from InternVL3-38B |
| Training emphasis | Iterative SFT + GRPO, adaptive CoT distillation | Hybrid RL: MPO + GRPO + SSB | Cold-start SFT + RL post-training |
| License | MIT | MIT | MIT |
| MMMU (val) | 69.0 | 73.6 | 76.0 |
| MathVista (mini) | 67.5 | 74.0 | 77.1 |
All scores below are reported by Skywork in the respective technical reports and Hugging Face model cards, and should be read as vendor claims.
For Skywork-R1V, Skywork reported 69.0 on MMMU (validation) and 67.5 on MathVista (mini), alongside strong text-math results of 72.0 on AIME 2024 and 94.0 on MATH-500, illustrating that the model retained the backbone's reasoning while gaining visual capability.[1][6]
For Skywork-R1V2, the reported figures include 73.6 on MMMU, 52.0 on MMMU-Pro, 74.0 on MathVista, 62.6 on OlympiadBench, 78.9 on AIME 2024, and 63.6 on LiveCodeBench. Skywork stated that the MMMU result outperformed several proprietary systems available at the time, such as Claude 3.5 Sonnet (70.4) and Gemini 2 Flash (70.7).[2]
For Skywork-R1V3, Skywork reported 76.0 on MMMU (validation), 77.1 on MathVista (mini), 78.5 on the MMK12 multimodal reasoning set, and 59.6 on MathVerse (vision-only). The company highlighted the 76.0 MMMU figure as a new open-source state of the art that, per its comparisons, edged out closed models including Claude 3.7 Sonnet (75.0) and GPT-4.5 (74.4) on that benchmark.[3][11] Such cross-vendor comparisons depend on evaluation conditions and should be treated with caution.
Skywork R1V is one of the more visible examples of open multimodal reasoning models emerging from Chinese labs in 2025, a wave that also includes Alibaba's QvQ, OpenGVLab's InternVL line, and Moonshot's Kimi-VL. By open-sourcing 38B models under the MIT license, Skywork made visual chain-of-thought reasoning accessible to researchers and developers who could not run or fine-tune proprietary multimodal systems.[2][3] Its technical contributions, particularly the efficient transfer of text reasoning into the multimodal setting and the MPO-plus-GRPO hybrid RL recipe with the Selective Sample Buffer, were positioned by Skywork as ways to raise reasoning quality while limiting hallucination.[2]
The series also illustrates how quickly the open multimodal-reasoning field moved: across roughly four months in 2025, the reported MMMU figure for the family rose from 69.0 to 76.0, and the team shifted from a DeepSeek-R1-distilled backbone to QwQ-32B to the InternVL3 base. The November 2025 move to a closed-source, tool-using R1V4-Lite marked a turn toward agentic, productized deployment alongside the open research line.[3][4]