Skywork-R1V

Chinese AI Multimodal AI Reasoning Models

15 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v3 · 2,965 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Skywork-R1V

Skywork-R1V is an open-weight family of multimodal reasoning models released by Skywork AI, the AGI and AIGC division of Beijing Kunlun Tech Co., Ltd. (Kunlun Wanwei).^[1]^[2] First announced on March 18, 2025, the initial Skywork-R1V-38B extended the chain-of-thought reasoning paradigm of DeepSeek-R1 from text into vision by coupling an InternViT vision encoder with a DeepSeek-R1-Distill language backbone through a lightweight projector.^[1]^[3] Skywork promoted the model as the first industry open-source multimodal reasoning model with visual chain-of-thought capabilities.^[2]^[4] The line has since grown to include Skywork-R1V2 (April 2025), Skywork-R1V3-38B (July 2025), and Skywork-R1V4-Lite (November 2025), with model weights, AWQ and GGUF quantizations, and a technical report on arXiv all publicly released under the MIT License.^[1]^[5]^[6]^[7]

Infobox

Property	Value
Developer	Skywork AI (Kunlun Tech / Kunlun Wanwei)
First public release	March 18, 2025 (Skywork-R1V-38B)^[1]^[2]
Total parameters (R1V/R1V2/R1V3)	38 billion^[3]^[5]^[7]
Vision encoder	InternViT-6B-448px-V2_5^[3]^[5]
Language backbone (R1V)	DeepSeek-R1-Distill-Qwen-32B^[3]
Language backbone (R1V2)	Qwen/QwQ-32B^[5]
Language backbone (R1V3)	InternVL3-38B (pretrained)^[7]
Connector	Lightweight MLP visual projector^[8]
License	MIT^[3]^[5]^[7]
arXiv (R1V)	2504.05599^[8]
arXiv (R1V2)	2504.16656^[9]
GitHub	SkyworkAI/Skywork-R1V^[1]
HuggingFace org	Skywork^[10]

Background

Kunlun Tech and Skywork AI

Beijing Kunlun Tech Co., Ltd. (Kunlun Wanwei, ticker SZ:300418) is a Beijing-based internet company founded in 2008 by Zhou Yahui that operates businesses in distribution, social networking, games, and, since the early 2020s, generative AI.^[11] In June 2023, the company launched the "Tiangong" / "Skywork" large language model brand and was included on China's "Next Tens of Billions of AIGC Products" list.^[11] In October 2023 it open-sourced the Skywork-13B bilingual foundation model under the Skywork Community License, accompanied by the 150B-token SkyPile Chinese corpus.^[12]^[13] Through later integration, the AGI and AIGC business was consolidated under the Skywork AI subsidiary, which also produces the SkyReels video models, the SkyMusic / Mureka music platform, and the Skyo real-time voice assistant.^[11]^[14]

From text reasoning to multimodal reasoning

The release of DeepSeek-R1 in January 2025 and its companion DeepSeek-R1-Distill series demonstrated that long chain-of-thought reasoning, trained primarily with rule-based reinforcement learning, could transfer to dense student models in the 1.5B to 70B parameter range.^[15] Skywork-R1V was conceived as the multimodal extension of that paradigm: rather than retraining a vision-language model from scratch, Skywork's researchers attached an existing vision tower to a reasoning-distilled language model and fine-tuned only the connecting modules and the language model's vision-conditioned behavior.^[8]^[3] The team frames this as an "efficient multimodal transfer method" that preserves the textual reasoning of the R1-series LLM while granting it the ability to read images.^[8]

Skywork-R1V (March 2025)

Skywork-R1V-38B was released on March 18, 2025 via the SkyworkAI GitHub organization and the Skywork/Skywork-R1V-38B repository on Hugging Face, with a follow-on AWQ quantized release on March 26, 2025 enabling single-GPU inference on accelerators with at least 30 GB of memory.^[1]^[3] The technical report, titled Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought by Peng et al., was posted to arXiv as 2504.05599 on April 8, 2025.^[8]

Architecture

The model is a modular vision-language model in which a frozen vision tower, a learned multilayer perceptron (MLP) adapter, and a reasoning-capable LLM are wired in series.^[3]^[8] Specifically:

The visual backbone is InternViT-6B-448px-V2_5, a roughly 6-billion-parameter Vision Transformer derived from Shanghai AI Lab's InternVL family that ingests 448-pixel image patches.^[3]^[5]
The language backbone is DeepSeek-R1-Distill-Qwen-32B, the 32B DeepSeek-R1-Distill checkpoint built on the Qwen 2.5 architecture.^[3]^[8]
A lightweight MLP visual projector maps the vision encoder's output space into the language model's input space; this projector is the only component initialized from scratch.^[8]

Adding the 6B vision tower to the 32B language model gives an aggregate parameter count of approximately 38B, which the team uses as the canonical model size.^[3] The full system processes interleaved image-text inputs with a 16,384-token context length.^[8]

Training pipeline

The R1V technical report describes a three-stage training recipe designed to graft visual grounding onto an already-reasoning-capable LLM without disturbing its text-only reasoning quality:^[8]

MLP initialization (alignment proxy). The InternViT-6B encoder is first aligned to a non-reasoning substitute language model, Qwen2.5-32B-Instruct, using only the MLP projector and a standard image-caption / VQA-style objective. This step trains the projector to translate vision tokens into a space the Qwen family can read.^[8]
Model re-assembly. The trained MLP is then transferred and used to splice the same InternViT to the reasoning-distilled DeepSeek-R1-Distill-Qwen-32B backbone. Because R1-Distill and Qwen2.5-Instruct share the same Qwen 2.5 architecture, the projected vision tokens slot in without requiring full retraining of either tower.^[8]
Hybrid optimization. The assembled model is trained for visual reasoning with a hybrid loop combining iterative supervised fine-tuning (four iterations of SFT on curated multimodal reasoning data) and Group Relative Policy Optimization (GRPO), the same reinforcement learning algorithm used for DeepSeek-R1.^[8] Reported training hyperparameters include initial learning rate 2 x 10^-4, refinement learning rate 4 x 10^-5, and batch size 512.^[8]

Layered on top of these stages is an adaptive-length chain-of-thought distillation procedure that dynamically adjusts the length of the reasoning trace to avoid "overthinking" on simple visual questions while preserving long traces for hard ones.^[4]^[8] Together, these are the two contributions the report highlights in its abstract: a hybrid SFT+GRPO optimization strategy and adaptive-length CoT distillation.^[8]

Benchmark results

Skywork reports the following results for Skywork-R1V-38B, mixing standard multimodal benchmarks with text-only reasoning benchmarks to verify that the vision grafting does not degrade the underlying R1-distill reasoning ability.^[3]^[8]

Benchmark	Skywork-R1V-38B
MMMU (val)	69.0^[3]^[8]
MathVista (mini)	67.5^[3]^[8]
GPQA (pass@1)	61.6^[3]
MATH-500 (pass@1)	94.0^[3]^[8]
AIME 2024 (pass@1)	72.0^[3]^[8]

Skywork's own write-up positions the MMMU and MathVista numbers as comparable to closed-source reasoning systems such as Gemini 2.0 and Kimi K1.5 for visual question answering, while the MATH-500 and AIME 2024 numbers are intended to show that text-only reasoning is preserved relative to the underlying DeepSeek-R1-Distill-Qwen-32B checkpoint.^[2]^[3]^[4] Independent benchmark trackers and the model card on Hugging Face report the same headline numbers.^[3]^[16]

The article task brief mentions a "Cross-modal Self-Iterative Adaptive Reasoning" concept. The exact phrase does not appear in the R1V technical report or in the model card; the closest documented constructs are the iterative SFT + GRPO loop and the adaptive-length CoT distillation described above, and this article restricts itself to those documented terms.^[8]

Skywork-R1V2 (April 2025)

Skywork-R1V2 was announced on April 24, 2025, with the technical report Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning (Wang et al., arXiv 2504.16656) posted on April 23, 2025.^[5]^[9] An AWQ quantized release followed on April 28, 2025.^[1] The model retains the 38B size and the InternViT-6B-448px-V2_5 vision encoder of the original R1V, but switches the language backbone from DeepSeek-R1-Distill-Qwen-32B to Alibaba's QwQ-32B reasoning model and overhauls the training procedure.^[5]^[9]

Hybrid reinforcement learning

R1V2 frames training as a hybrid reinforcement learning problem that jointly leverages two complementary objectives:^[9]^[17]

Mixed Preference Optimization (MPO), a preference-learning objective that combines reward signals from the dedicated Skywork-VL Reward (R1V-RM) model with rule-based constraints on format correctness, factual consistency, and step-by-step reasoning completeness.^[9]^[17]
Group Relative Policy Optimization (GRPO), the same on-policy RL algorithm used in R1V, retained to drive exploration of harder reasoning trajectories.^[9]^[17]

The team also introduces a Selective Sample Buffer (SSB), a replay mechanism that caches high-quality training examples with non-zero advantage and reintroduces them during later policy updates. SSB is presented as a fix for the "vanishing advantages" problem in GRPO, where most rollouts in a group end up with similar rewards and contribute weak gradients.^[9]^[17] Caching and replaying informative samples increases gradient density and, according to the report, encourages deeper chains of reasoning.^[9]^[17] The authors additionally report that overly strong reinforcement signals can induce visual hallucinations and discuss calibration thresholds to mitigate this trade-off between reasoning depth and visual faithfulness.^[9]^[17]

Benchmark results

The R1V2 model card and arXiv report the following figures for Skywork-R1V2-38B, with R1V-38B numbers included for reference.^[5]^[9]^[17]

Benchmark	R1V2-38B	R1V-38B
MMMU (val)	73.6	68.0^[5]
MathVista (mini)	74.0	67.0^[5]
OlympiadBench	62.6	40.4^[5]^[9]
AIME 2024	78.9	72.0^[5]^[9]
LiveCodeBench	63.6	not reported^[5]^[9]
GPQA	61.6	61.6^[5]

The most striking jump is on OlympiadBench, where R1V2 lifts the score from 40.4 to 62.6, and on MMMU, where it climbs from 68.0 to 73.6. Skywork frames the latter as the highest then-reported score for any open-source 38B-class multimodal model.^[5]^[17]

Later releases

Skywork-R1V3-38B (July 2025)

Skywork-R1V3-38B was released on July 9, 2025 with the model card hosted at Skywork/Skywork-R1V3-38B.^[1]^[7] Unlike R1V2's switch to QwQ-32B, R1V3 is built directly on the InternVL3-38B pretrained checkpoint and emphasizes post-training reinforcement learning rather than reasoning-focused pretraining.^[7] The R1V3 model card highlights several methodological choices: a fine-grained cold-start SFT used to prime the model for RL, a connector-only fine-tuning step that further boosts performance after RL, and an "Entropy of Critical Reasoning Tokens" metric used to select checkpoints.^[7] Reported benchmarks include 76.0 on MMMU (val), 77.1 on MathVista (mini), 55.4 on MMMU-Pro, and 28.5 on VisuLogic, which Skywork describes as state of the art among open multimodal reasoning models at the time of release.^[7]

Skywork-R1V4-Lite (November 2025)

Skywork-R1V4-Lite was announced on November 18, 2025. Unlike its predecessors, it is closed-source, served only through Skywork's platform API and via OpenRouter, and is described as a lightweight reasoning model built on Qwen3-VL-30B-A3B-Instruct (a 30B mixture-of-experts model with about 3B activated parameters).^[1] Skywork emphasizes agentic capabilities: code execution, deep research via search-tool integration, streaming output, and multi-turn reasoning. Reported headline numbers include 91.8 on HIRbench-4K FSP and 71.4 on MME-Real Overall.^[1]

Skywork's broader catalog

Skywork-R1V sits inside a wider open-source release program that began with the company's 2023 bilingual base model. The table below summarizes the major model families the Skywork organization has published on Hugging Face and GitHub.^[10]

Family	Year(s)	Description	Sources
Skywork-13B	2023	Bilingual (Chinese/English) LLM trained on 3.2T tokens; includes Base, Chat, and Math variants; released with SkyPile-150B corpus.	^[12]^[13]
Skywork-MoE	2024	146B-parameter mixture-of-experts model with 16 experts and ~22B active parameters; initialized from Skywork-13B; introduces gating logit normalization and adaptive auxiliary loss coefficients.	^[18]^[19]
Skywork-o1-Open-PRM	late 2024	Open process reward model series for step-by-step reasoning supervision, based on a Qwen-2.5-1.5B backbone.	^[10]
Skywork-R1V series	2025	InternViT + R1-style backbone multimodal reasoning models (R1V, R1V2, R1V3, R1V4-Lite).	^[1]
Skywork-OR1	April / May 2025	"Open Reasoner 1" math and code reasoning models at 7B and 32B; trained with large-scale rule-based reinforcement learning; released under Apache 2.0.	^[20]
Skywork-VL Reward	May 2025	Multimodal VLM reward model based on Qwen2.5-VL-7B-Instruct with a value head; achieves state-of-the-art VL-RewardBench results.	^[21]
Skywork-Reward / Reward-V2	2024 / July 2025	Text reward model series; the V2 release in July 2025 includes 8 models from 0.6B to 8B parameters and tops Reward Bench v1/v2, RM-Bench, JudgeBench and other reward benchmarks.	^[22]
Skywork-UniPic, UniPic2, UniPic3	2025	Unified autoregressive vision-language image generation and multi-image composition models.	^[10]
Skywork-SWE-32B	2025	Software engineering reasoning model studying scaling laws for SWE tasks.	^[10]
SkyReels-V1 / V3 / V4	2025	Human-centric video foundation and multimodal video-audio generation models.	^[10]
Mureka / Skyo	2025	Mureka O1 music reasoning model and the Skyo real-time voice assistant launched alongside Skywork 4.0.	^[14]^[23]

Note: Skywork has not, as of the references checked here, released a model branded as "Skywork-Audio." The company's audio and speech work appears under the Skyo voice assistant and the Mureka AI music brand rather than a "Skywork-Audio" product line, so this article does not claim such a model exists.^[14]^[23]

Significance

Skywork-R1V is one of the earliest fully open-weight attempts to port the long-CoT, RL-trained reasoning paradigm popularized by OpenAI o1 and DeepSeek-R1 into vision. Three aspects make it methodologically interesting:

Reasoning-preserving multimodal transfer. By freezing the vision tower and a reasoning-distilled LLM and training only an MLP projector before the joint SFT/GRPO loop, the team aims for "near-lossless" preservation of the underlying chain-of-thought behavior on text benchmarks like MATH-500 and AIME 2024 while still gaining visual grounding.^[3]^[8]
Hybrid RL for VLMs. R1V2's combination of MPO with GRPO and the Selective Sample Buffer extends the GRPO recipe (originally text-only in DeepSeek) to vision-language settings, and surfaces the now-well-known trade-off between reasoning depth and visual hallucination under strong reward signals.^[9]^[17]
Open weights, MIT license. Together with the contemporaneous Skywork-OR1, Skywork-VL Reward, and Skywork-Reward-V2 releases, R1V is part of an unusually transparent open ecosystem for reasoning research, including model weights, AWQ and GGUF quantizations, training procedures, and benchmark scripts.^[1]^[10]^[20]^[21]

Limitations and criticisms

The R1V technical report and the R1V2 follow-up explicitly note several limitations.^[8]^[9]^[17]

Visual hallucinations under strong RL signals. R1V2 reports that excessively aggressive reinforcement signals push the model toward longer but more hallucinated reasoning traces, motivating calibrated reward thresholds.^[9]^[17]
Substitution dependence on a non-reasoning proxy. R1V's projector is first aligned with Qwen2.5-32B-Instruct rather than the actual reasoning backbone, and the report acknowledges that this two-step transfer is partly a workaround for the difficulty of training projectors directly against an already-CoT-trained LLM.^[8]
Hardware footprint. Even in BF16, the 38B parameter R1V family requires roughly 80 GB of GPU memory for inference; AWQ and GGUF quantizations were released specifically to bring it onto single 30 GB+ GPUs and CPU inference setups, but the unquantized form is heavy.^[1]^[3]
Benchmark coverage gaps. Some benchmarks named in third-party descriptions (for example, AI2D and OlympiadBench for the original R1V) are not reported in the official R1V technical report or model card, so cross-version comparisons must use only the benchmarks each report actually publishes.^[3]^[8] For R1V, the documented set is MMMU, MathVista, MATH-500, AIME 2024, and GPQA; for R1V2, it adds OlympiadBench and LiveCodeBench.^[3]^[5]^[8]^[9]
License nuance for R1V4-Lite. R1V4-Lite is closed-source and API-only, breaking the open-weight pattern of R1V, R1V2, and R1V3, although its base (Qwen3-VL-30B-A3B-Instruct) remains Apache 2.0.^[1]

Skywork-R1V's design sits at the intersection of three lines of work:

The R1-style reasoning paradigm: DeepSeek-R1, DeepSeek-R1-Distill, GRPO, QwQ-32B, OpenAI o1, and Kimi K1.5, all of which use long chain-of-thought traces trained with rule-based RL on math, code, and science tasks.^[15]^[24]
The InternViT / InternVL family of open vision encoders from Shanghai AI Laboratory, whose InternViT-6B-448px-V2_5 encoder is reused by R1V, R1V2, and R1V3, and whose InternVL3-38B serves as the full base for R1V3.^[3]^[7]
Other contemporaneous Chinese open multimodal reasoning models such as DeepSeek-VL2, DeepSeek Janus, Qwen2.5-VL, and MiniCPM-V, which together populate the open-weights side of the multimodal AI landscape.

References

SkyworkAI, "Skywork-R1V (GitHub README)", GitHub, 2025-11-18. https://github.com/SkyworkAI/Skywork-R1V. Accessed 2026-05-21. ↩
AIBase, "Kunlun Wanwei Open-Sources Skywork R1V Visual Reasoning Chain Model", AIBase, 2025-03-18. https://www.aibase.com/news/16387. Accessed 2026-05-21. ↩
Skywork, "Skywork/Skywork-R1V-38B (model card)", Hugging Face, 2025-04-08. https://huggingface.co/Skywork/Skywork-R1V-38B. Accessed 2026-05-21. ↩
AIBase, "Game Changer! Kunlun Wanwei's Skywork R1V Multimodal Reasoning Model Open-Sourced!", AIBase, 2025-03-19. https://www.aibase.com/news/16394. Accessed 2026-05-21. ↩
Skywork, "Skywork/Skywork-R1V2-38B (model card)", Hugging Face, 2025-04-23. https://huggingface.co/Skywork/Skywork-R1V2-38B. Accessed 2026-05-21. ↩
Skywork, "Skywork/Skywork-R1V2-38B-AWQ (model card)", Hugging Face, 2025-04-28. https://huggingface.co/Skywork/Skywork-R1V2-38B-AWQ. Accessed 2026-05-21. ↩
Skywork, "Skywork/Skywork-R1V3-38B (model card)", Hugging Face, 2025-07-09. https://huggingface.co/Skywork/Skywork-R1V3-38B. Accessed 2026-05-21. ↩
Peng, Y. et al., "Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought", arXiv:2504.05599, 2025-04-08. https://arxiv.org/abs/2504.05599. Accessed 2026-05-21. ↩
Wang, P. et al., "Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning", arXiv:2504.16656, 2025-04-23. https://arxiv.org/abs/2504.16656. Accessed 2026-05-21. ↩
Skywork, "Skywork organization page", Hugging Face, 2026-05-21. https://huggingface.co/Skywork. Accessed 2026-05-21. ↩
Kunlun Tech, "Kunlun Tech Launched The 'SkyWork' Large Language Model And Was Selected Into The List of China's 'Next Tens of Billions of AIGC Products'", PR Newswire, 2023-06-08. https://www.prnewswire.com/news-releases/kunlun-tech-launched-the-skywork-large-language-model-and-was-selected-into-the-list-of-chinas-next-tens-of-billions-of-aigc-products-301836921.html. Accessed 2026-05-21. ↩
Kunlun Tech, "Kunlun Tech releases open source 13B high-quality commercial large model, ahead of Llama2 and Baichuan2", Kunlun Tech News, 2023-10-30. http://www.kunlun.com/2023/en_mnews_1030/328.html. Accessed 2026-05-21. ↩
Wei, T. et al., "Skywork: A More Open Bilingual Foundation Model", arXiv:2310.19341, 2023-10-30. https://arxiv.org/abs/2310.19341. Accessed 2026-05-21. ↩
TMTPost, "Kunlun Tech Launches Skywork 4.0 AI Model and Skyo Real-Time Voice Assistant", TMTPost, 2025-04-18. https://en.tmtpost.com/news/7345973. Accessed 2026-05-21. ↩
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", arXiv:2501.12948, 2025-01-22. https://arxiv.org/abs/2501.12948. Accessed 2026-05-21. ↩
Hugging Face, "Paper page: Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought", Hugging Face Papers, 2025-04-08. https://huggingface.co/papers/2504.05599. Accessed 2026-05-21. ↩
MarkTechPost, "Skywork AI Advances Multimodal Reasoning: Introducing Skywork R1V2 with Hybrid Reinforcement Learning", MarkTechPost, 2025-04-25. https://www.marktechpost.com/2025/04/25/skywork-ai-advances-multimodal-reasoning-introducing-skywork-r1v2-with-hybrid-reinforcement-learning/. Accessed 2026-05-21. ↩
Skywork, "Skywork/Skywork-MoE-Base (model card)", Hugging Face, 2024-06-10. https://huggingface.co/Skywork/Skywork-MoE-Base. Accessed 2026-05-21. ↩
Wei, T. et al., "Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models", arXiv:2406.06563, 2024-06-10. https://arxiv.org/abs/2406.06563. Accessed 2026-05-21. ↩
SkyworkAI, "Skywork-OR1 (GitHub README)", GitHub, 2025-05-13. https://github.com/SkyworkAI/Skywork-OR1. Accessed 2026-05-21. ↩
Wang, X. et al., "Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning", arXiv:2505.07263, 2025-05-12. https://arxiv.org/abs/2505.07263. Accessed 2026-05-21. ↩
Skywork, "Skywork-Reward-V2: Leading the New Milestone for Open-Source Reward Models", PR Newswire, 2025-07-04. https://www.prnewswire.com/news-releases/skywork-reward-v2-leading-the-new-milestone-for-open-source-reward-models-302498377.html. Accessed 2026-05-21. ↩
Music Business Worldwide, "China's $6B-valued Kunlun Tech debuts 'world's first' music reasoning model, claims it can outperform Suno", Music Business Worldwide, 2025-03-26. https://www.musicbusinessworldwide.com/chinas-6b-valued-kunlun-tech-debuts-worlds-first-music-reasoning-model-claims-it-can-outperform-suno/. Accessed 2026-05-21. ↩
Qwen Team, "QwQ-32B (model card)", Hugging Face, 2025-03-06. https://huggingface.co/Qwen/QwQ-32B. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

QvQ

Skywork-R1V

Infobox

Background

Kunlun Tech and Skywork AI

From text reasoning to multimodal reasoning

Skywork-R1V (March 2025)

Architecture

Training pipeline

Benchmark results

Skywork-R1V2 (April 2025)

Hybrid reinforcement learning

Benchmark results

Later releases

Skywork-R1V3-38B (July 2025)

Skywork-R1V4-Lite (November 2025)

Skywork's broader catalog

Significance

Limitations and criticisms

Related work

See also

References

Improve this article

Related Articles

QvQ

Muse Spark

DeepSeek-R1

GRPO

DeepSeek-R1-Distill

DeepSeek V3.1

What links here

Related Articles

QvQ

Muse Spark

DeepSeek-R1

GRPO

DeepSeek-R1-Distill

DeepSeek V3.1