Skywork-R1V
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,965 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,965 words
Add missing citations, update stale details, or suggest a clearer explanation.
Skywork-R1V is an open-weight family of multimodal reasoning models released by Skywork AI, the AGI and AIGC division of Beijing Kunlun Tech Co., Ltd. (Kunlun Wanwei).[^1][^2] First announced on March 18, 2025, the initial Skywork-R1V-38B extended the chain-of-thought reasoning paradigm of DeepSeek-R1 from text into vision by coupling an InternViT vision encoder with a DeepSeek-R1-Distill language backbone through a lightweight projector.[^1][^3] Skywork promoted the model as the first industry open-source multimodal reasoning model with visual chain-of-thought capabilities.[^2][^4] The line has since grown to include Skywork-R1V2 (April 2025), Skywork-R1V3-38B (July 2025), and Skywork-R1V4-Lite (November 2025), with model weights, AWQ and GGUF quantizations, and a technical report on arXiv all publicly released under the MIT License.[^1][^5][^6][^7]
| Property | Value |
|---|---|
| Developer | Skywork AI (Kunlun Tech / Kunlun Wanwei) |
| First public release | March 18, 2025 (Skywork-R1V-38B)[^1][^2] |
| Total parameters (R1V/R1V2/R1V3) | 38 billion[^3][^5][^7] |
| Vision encoder | InternViT-6B-448px-V2_5[^3][^5] |
| Language backbone (R1V) | DeepSeek-R1-Distill-Qwen-32B[^3] |
| Language backbone (R1V2) | Qwen/QwQ-32B[^5] |
| Language backbone (R1V3) | InternVL3-38B (pretrained)[^7] |
| Connector | Lightweight MLP visual projector[^8] |
| License | MIT[^3][^5][^7] |
| arXiv (R1V) | 2504.05599[^8] |
| arXiv (R1V2) | 2504.16656[^9] |
| GitHub | SkyworkAI/Skywork-R1V[^1] |
| HuggingFace org | Skywork[^10] |
Beijing Kunlun Tech Co., Ltd. (Kunlun Wanwei, ticker SZ:300418) is a Beijing-based internet company founded in 2008 by Zhou Yahui that operates businesses in distribution, social networking, games, and, since the early 2020s, generative AI.[^11] In June 2023, the company launched the "Tiangong" / "Skywork" large language model brand and was included on China's "Next Tens of Billions of AIGC Products" list.[^11] In October 2023 it open-sourced the Skywork-13B bilingual foundation model under the Skywork Community License, accompanied by the 150B-token SkyPile Chinese corpus.[^12][^13] Through later integration, the AGI and AIGC business was consolidated under the Skywork AI subsidiary, which also produces the SkyReels video models, the SkyMusic / Mureka music platform, and the Skyo real-time voice assistant.[^11][^14]
The release of DeepSeek-R1 in January 2025 and its companion DeepSeek-R1-Distill series demonstrated that long chain-of-thought reasoning, trained primarily with rule-based reinforcement learning, could transfer to dense student models in the 1.5B to 70B parameter range.[^15] Skywork-R1V was conceived as the multimodal extension of that paradigm: rather than retraining a vision-language model from scratch, Skywork's researchers attached an existing vision tower to a reasoning-distilled language model and fine-tuned only the connecting modules and the language model's vision-conditioned behavior.[^8][^3] The team frames this as an "efficient multimodal transfer method" that preserves the textual reasoning of the R1-series LLM while granting it the ability to read images.[^8]
Skywork-R1V-38B was released on March 18, 2025 via the SkyworkAI GitHub organization and the Skywork/Skywork-R1V-38B repository on Hugging Face, with a follow-on AWQ quantized release on March 26, 2025 enabling single-GPU inference on accelerators with at least 30 GB of memory.[^1][^3] The technical report, titled Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought by Peng et al., was posted to arXiv as 2504.05599 on April 8, 2025.[^8]
The model is a modular vision-language model in which a frozen vision tower, a learned multilayer perceptron (MLP) adapter, and a reasoning-capable LLM are wired in series.[^3][^8] Specifically:
Adding the 6B vision tower to the 32B language model gives an aggregate parameter count of approximately 38B, which the team uses as the canonical model size.[^3] The full system processes interleaved image-text inputs with a 16,384-token context length.[^8]
The R1V technical report describes a three-stage training recipe designed to graft visual grounding onto an already-reasoning-capable LLM without disturbing its text-only reasoning quality:[^8]
Layered on top of these stages is an adaptive-length chain-of-thought distillation procedure that dynamically adjusts the length of the reasoning trace to avoid "overthinking" on simple visual questions while preserving long traces for hard ones.[^4][^8] Together, these are the two contributions the report highlights in its abstract: a hybrid SFT+GRPO optimization strategy and adaptive-length CoT distillation.[^8]
Skywork reports the following results for Skywork-R1V-38B, mixing standard multimodal benchmarks with text-only reasoning benchmarks to verify that the vision grafting does not degrade the underlying R1-distill reasoning ability.[^3][^8]
| Benchmark | Skywork-R1V-38B |
|---|---|
| MMMU (val) | 69.0[^3][^8] |
| MathVista (mini) | 67.5[^3][^8] |
| GPQA (pass@1) | 61.6[^3] |
| MATH-500 (pass@1) | 94.0[^3][^8] |
| AIME 2024 (pass@1) | 72.0[^3][^8] |
Skywork's own write-up positions the MMMU and MathVista numbers as comparable to closed-source reasoning systems such as Gemini 2.0 and Kimi K1.5 for visual question answering, while the MATH-500 and AIME 2024 numbers are intended to show that text-only reasoning is preserved relative to the underlying DeepSeek-R1-Distill-Qwen-32B checkpoint.[^2][^3][^4] Independent benchmark trackers and the model card on Hugging Face report the same headline numbers.[^3][^16]
The article task brief mentions a "Cross-modal Self-Iterative Adaptive Reasoning" concept. The exact phrase does not appear in the R1V technical report or in the model card; the closest documented constructs are the iterative SFT + GRPO loop and the adaptive-length CoT distillation described above, and this article restricts itself to those documented terms.[^8]
Skywork-R1V2 was announced on April 24, 2025, with the technical report Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning (Wang et al., arXiv 2504.16656) posted on April 23, 2025.[^5][^9] An AWQ quantized release followed on April 28, 2025.[^1] The model retains the 38B size and the InternViT-6B-448px-V2_5 vision encoder of the original R1V, but switches the language backbone from DeepSeek-R1-Distill-Qwen-32B to Alibaba's QwQ-32B reasoning model and overhauls the training procedure.[^5][^9]
R1V2 frames training as a hybrid reinforcement learning problem that jointly leverages two complementary objectives:[^9][^17]
The team also introduces a Selective Sample Buffer (SSB), a replay mechanism that caches high-quality training examples with non-zero advantage and reintroduces them during later policy updates. SSB is presented as a fix for the "vanishing advantages" problem in GRPO, where most rollouts in a group end up with similar rewards and contribute weak gradients.[^9][^17] Caching and replaying informative samples increases gradient density and, according to the report, encourages deeper chains of reasoning.[^9][^17] The authors additionally report that overly strong reinforcement signals can induce visual hallucinations and discuss calibration thresholds to mitigate this trade-off between reasoning depth and visual faithfulness.[^9][^17]
The R1V2 model card and arXiv report the following figures for Skywork-R1V2-38B, with R1V-38B numbers included for reference.[^5][^9][^17]
| Benchmark | R1V2-38B | R1V-38B |
|---|---|---|
| MMMU (val) | 73.6 | 68.0[^5] |
| MathVista (mini) | 74.0 | 67.0[^5] |
| OlympiadBench | 62.6 | 40.4[^5][^9] |
| AIME 2024 | 78.9 | 72.0[^5][^9] |
| LiveCodeBench | 63.6 | not reported[^5][^9] |
| GPQA | 61.6 | 61.6[^5] |
The most striking jump is on OlympiadBench, where R1V2 lifts the score from 40.4 to 62.6, and on MMMU, where it climbs from 68.0 to 73.6. Skywork frames the latter as the highest then-reported score for any open-source 38B-class multimodal model.[^5][^17]
Skywork-R1V3-38B was released on July 9, 2025 with the model card hosted at Skywork/Skywork-R1V3-38B.[^1][^7] Unlike R1V2's switch to QwQ-32B, R1V3 is built directly on the InternVL3-38B pretrained checkpoint and emphasizes post-training reinforcement learning rather than reasoning-focused pretraining.[^7] The R1V3 model card highlights several methodological choices: a fine-grained cold-start SFT used to prime the model for RL, a connector-only fine-tuning step that further boosts performance after RL, and an "Entropy of Critical Reasoning Tokens" metric used to select checkpoints.[^7] Reported benchmarks include 76.0 on MMMU (val), 77.1 on MathVista (mini), 55.4 on MMMU-Pro, and 28.5 on VisuLogic, which Skywork describes as state of the art among open multimodal reasoning models at the time of release.[^7]
Skywork-R1V4-Lite was announced on November 18, 2025. Unlike its predecessors, it is closed-source, served only through Skywork's platform API and via OpenRouter, and is described as a lightweight reasoning model built on Qwen3-VL-30B-A3B-Instruct (a 30B mixture-of-experts model with about 3B activated parameters).[^1] Skywork emphasizes agentic capabilities: code execution, deep research via search-tool integration, streaming output, and multi-turn reasoning. Reported headline numbers include 91.8 on HIRbench-4K FSP and 71.4 on MME-Real Overall.[^1]
Skywork-R1V sits inside a wider open-source release program that began with the company's 2023 bilingual base model. The table below summarizes the major model families the Skywork organization has published on Hugging Face and GitHub.[^10]
| Family | Year(s) | Description | Sources |
|---|---|---|---|
| Skywork-13B | 2023 | Bilingual (Chinese/English) LLM trained on 3.2T tokens; includes Base, Chat, and Math variants; released with SkyPile-150B corpus. | [^12][^13] |
| Skywork-MoE | 2024 | 146B-parameter mixture-of-experts model with 16 experts and ~22B active parameters; initialized from Skywork-13B; introduces gating logit normalization and adaptive auxiliary loss coefficients. | [^18][^19] |
| Skywork-o1-Open-PRM | late 2024 | Open process reward model series for step-by-step reasoning supervision, based on a Qwen-2.5-1.5B backbone. | [^10] |
| Skywork-R1V series | 2025 | InternViT + R1-style backbone multimodal reasoning models (R1V, R1V2, R1V3, R1V4-Lite). | [^1] |
| Skywork-OR1 | April / May 2025 | "Open Reasoner 1" math and code reasoning models at 7B and 32B; trained with large-scale rule-based reinforcement learning; released under Apache 2.0. | [^20] |
| Skywork-VL Reward | May 2025 | Multimodal VLM reward model based on Qwen2.5-VL-7B-Instruct with a value head; achieves state-of-the-art VL-RewardBench results. | [^21] |
| Skywork-Reward / Reward-V2 | 2024 / July 2025 | Text reward model series; the V2 release in July 2025 includes 8 models from 0.6B to 8B parameters and tops Reward Bench v1/v2, RM-Bench, JudgeBench and other reward benchmarks. | [^22] |
| Skywork-UniPic, UniPic2, UniPic3 | 2025 | Unified autoregressive vision-language image generation and multi-image composition models. | [^10] |
| Skywork-SWE-32B | 2025 | Software engineering reasoning model studying scaling laws for SWE tasks. | [^10] |
| SkyReels-V1 / V3 / V4 | 2025 | Human-centric video foundation and multimodal video-audio generation models. | [^10] |
| Mureka / Skyo | 2025 | Mureka O1 music reasoning model and the Skyo real-time voice assistant launched alongside Skywork 4.0. | [^14][^23] |
Note: Skywork has not, as of the references checked here, released a model branded as "Skywork-Audio." The company's audio and speech work appears under the Skyo voice assistant and the Mureka AI music brand rather than a "Skywork-Audio" product line, so this article does not claim such a model exists.[^14][^23]
Skywork-R1V is one of the earliest fully open-weight attempts to port the long-CoT, RL-trained reasoning paradigm popularized by OpenAI o1 and DeepSeek-R1 into vision. Three aspects make it methodologically interesting:
The R1V technical report and the R1V2 follow-up explicitly note several limitations.[^8][^9][^17]
Skywork-R1V's design sits at the intersection of three lines of work: