Skywork R1V

AI Models Large Language Models

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v1 · 1,645 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Skywork R1V is a family of open-weight multimodal vision-language models built for chain-of-thought reasoning, developed by Skywork AI, the large-model team of the Chinese technology company Kunlun Tech (Kunlun Wanwei). The series brings the long, step-by-step "thinking" style popularized by text reasoning models such as DeepSeek-R1 to inputs that combine images and text, letting the model reason through math and science diagrams, charts, and other visual problems before answering. Three open-source generations were released over 2025: Skywork-R1V in March, Skywork-R1V2 in April, and Skywork-R1V3 in July, each a 38-billion-parameter model published with weights, inference code, and a technical report on Hugging Face and GitHub.^[1]^[2]^[3] A fourth, closed-source agentic variant, Skywork-R1V4-Lite, followed in November 2025.^[4]

Overview

The R1V models address a gap that opened in early 2025: text-only reasoning models like DeepSeek-R1 and OpenAI's o1 showed that long reinforcement-learned chains of thought could dramatically improve performance on math and code, but those gains did not automatically extend to problems posed as images. Skywork's stated goal was to transfer this reasoning ability into the multimodal setting efficiently, without retraining a large language backbone from scratch, so that a model could examine a figure or diagram and reason about it in the same deliberate, multi-step way.^[1]^[5]

Each generation is a 38B-parameter model that pairs a vision encoder with a reasoning-capable language backbone, connected by a lightweight projector or "connector" module, and is post-trained with reinforcement learning tailored to reasoning. Skywork releases the models under the permissive MIT license and reports benchmark results positioning them among the strongest open multimodal reasoners; as with all vendor-reported figures, the numbers below are Skywork's own claims.^[2]^[3]^[6]

Skywork and Kunlun Tech

Kunlun Tech, also known as Kunlun Wanwei, is a Beijing-based technology company founded in 2008 and listed on the Shenzhen Stock Exchange since 2015. Its Skywork (Tiangong) team builds a broad portfolio of open-weight and proprietary AI systems.^[7] Earlier work includes Skywork-13B, a bilingual (Chinese and English) large language model pre-trained on 3.2 TB of text and code and released in 2023, and Skywork-MoE, a 146B-parameter mixture-of-experts model with roughly 22B activated parameters that was upcycled from the dense Skywork-13B checkpoints.^[8]

The R1V series sits alongside Skywork's text reasoning line. In February 2025 the team released Skywork-o1, an early Chinese reasoning model, and in April 2025 it open-sourced the Skywork-OR1 (Open Reasoner 1) series, including math- and code-focused 7B and 32B models whose performance Skywork reported as approaching DeepSeek-R1 on reasoning tasks.^[5]^[9] Where OR1 targets text, R1V extends the same reasoning philosophy to vision-language inputs.

The R1V models

Skywork-R1V

The first model, Skywork-R1V, was open-sourced in March 2025, with an accompanying technical report ("Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought," arXiv 2504.05599) posted in April 2025.^[1] It combines an InternViT-6B vision encoder (InternViT-6B-448px-V2_5) with a DeepSeek-R1-Distill-Qwen-32B language backbone, joined by a lightweight visual projector. The central idea is that reasoning capability can be transferred from the text LLM into the multimodal model by training only the projector and aligning the two modalities, rather than retraining the encoder or the backbone. Skywork combined iterative supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO) and introduced an adaptive-length chain-of-thought distillation method to generate reasoning data and control how long the model "thinks."^[1]^[6]

Skywork-R1V2

Skywork-R1V2 was released on April 24, 2025, with the report "Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning" (arXiv 2504.16656).^[2] It uses an InternViT-6B vision encoder paired with the QwQ-32B reasoning backbone. The headline contribution is a hybrid reinforcement-learning recipe that jointly applies Mixed Preference Optimization (MPO), which blends a reward model (R1V-RM) with rule-based constraints such as format and factual correctness, and GRPO, which scores candidate answers relative to others in the same group. To keep training efficient, R1V2 adds a Selective Sample Buffer (SSB) that caches high-value examples and reintroduces them to counter GRPO's "vanishing advantages" problem, where many sampled answers receive near-zero learning signal. Skywork reported that MPO substantially lowered the model's hallucination rate (8.7%) compared with Direct Preference Optimization (12.6%) and plain SFT (18.4%).^[2]

Skywork-R1V3

Skywork-R1V3-38B was open-sourced on July 9, 2025, with its technical report (arXiv 2507.06167) following on July 11, 2025.^[3]^[10] Unlike its predecessors, R1V3 is built on the InternVL3-38B architecture and concentrates its gains in post-training: Skywork describes the recipe as relying mainly on a reinforcement-learning stage, preceded by a "fine-grained cold-start" supervised phase that prepares the model for RL. The team constructed a high-quality multimodal reasoning training set using rejection sampling and distilled cold-start data from the previous generation. R1V3 can adapt the length of its reasoning chain to input difficulty to avoid "overthinking" simple questions. Skywork positioned R1V3's reported MMMU result as setting a new open-source record and as approaching human-expert level on that benchmark.^[3]^[11]

Skywork-R1V4-Lite

In November 2025 Skywork introduced Skywork-R1V4-Lite, a smaller and more agentic successor built on the Qwen3-VL-30B-A3B-Instruct base (a 30B mixture-of-experts model with roughly 3B activated parameters). Unlike the earlier open-weight releases, R1V4-Lite was offered as a closed-source API service through the Skywork platform, adding tool-use features such as code execution and web search.^[4] Because it is not open-weight, it represents a departure from the open-source ethos of R1V through R1V3.

Multimodal reasoning approach

Across the open-weight generations, Skywork's design treats the vision encoder and the language backbone as largely fixed, capable components and focuses effort on (1) aligning them through a small connector and (2) eliciting reasoning through reinforcement learning rather than expensive multimodal pre-training. This "efficient transfer" philosophy is what lets a 38B model inherit the reasoning behavior of a strong text model while gaining the ability to see.^[1]^[6]

The reinforcement-learning methods evolved across versions: R1V used iterative SFT plus GRPO with adaptive-length CoT distillation; R1V2 introduced the MPO-plus-GRPO hybrid with the Selective Sample Buffer; and R1V3 emphasized cold-start fine-tuning followed by RL on a curated, rejection-sampled reasoning dataset. A recurring theme is controlling reasoning length so the model spends more tokens on hard visual-math problems and fewer on easy ones.^[2]^[3]

Specifications

The table summarizes the open-weight R1V generations. Benchmark figures are Skywork's reported values and use the metrics named in each report.

Attribute	Skywork-R1V	Skywork-R1V2	Skywork-R1V3
Open-source release	March 2025	April 24, 2025	July 9, 2025
Technical report	arXiv 2504.05599	arXiv 2504.16656	arXiv 2507.06167
Parameters	38B	38B	38B
Vision encoder	InternViT-6B (V2_5)	InternViT-6B	from InternVL3-38B
Language backbone	DeepSeek-R1-Distill-Qwen-32B	QwQ-32B	from InternVL3-38B
Training emphasis	Iterative SFT + GRPO, adaptive CoT distillation	Hybrid RL: MPO + GRPO + SSB	Cold-start SFT + RL post-training
License	MIT	MIT	MIT
MMMU (val)	69.0	73.6	76.0
MathVista (mini)	67.5	74.0	77.1

Sources: ^[1]^[2]^[3]^[6].

Benchmarks

All scores below are reported by Skywork in the respective technical reports and Hugging Face model cards, and should be read as vendor claims.

For Skywork-R1V, Skywork reported 69.0 on MMMU (validation) and 67.5 on MathVista (mini), alongside strong text-math results of 72.0 on AIME 2024 and 94.0 on MATH-500, illustrating that the model retained the backbone's reasoning while gaining visual capability.^[1]^[6]

For Skywork-R1V2, the reported figures include 73.6 on MMMU, 52.0 on MMMU-Pro, 74.0 on MathVista, 62.6 on OlympiadBench, 78.9 on AIME 2024, and 63.6 on LiveCodeBench. Skywork stated that the MMMU result outperformed several proprietary systems available at the time, such as Claude 3.5 Sonnet (70.4) and Gemini 2 Flash (70.7).^[2]

For Skywork-R1V3, Skywork reported 76.0 on MMMU (validation), 77.1 on MathVista (mini), 78.5 on the MMK12 multimodal reasoning set, and 59.6 on MathVerse (vision-only). The company highlighted the 76.0 MMMU figure as a new open-source state of the art that, per its comparisons, edged out closed models including Claude 3.7 Sonnet (75.0) and GPT-4.5 (74.4) on that benchmark.^[3]^[11] Such cross-vendor comparisons depend on evaluation conditions and should be treated with caution.

Significance

Skywork R1V is one of the more visible examples of open multimodal reasoning models emerging from Chinese labs in 2025, a wave that also includes Alibaba's QvQ, OpenGVLab's InternVL line, and Moonshot's Kimi-VL. By open-sourcing 38B models under the MIT license, Skywork made visual chain-of-thought reasoning accessible to researchers and developers who could not run or fine-tune proprietary multimodal systems.^[2]^[3] Its technical contributions, particularly the efficient transfer of text reasoning into the multimodal setting and the MPO-plus-GRPO hybrid RL recipe with the Selective Sample Buffer, were positioned by Skywork as ways to raise reasoning quality while limiting hallucination.^[2]

The series also illustrates how quickly the open multimodal-reasoning field moved: across roughly four months in 2025, the reported MMMU figure for the family rose from 69.0 to 76.0, and the team shifted from a DeepSeek-R1-distilled backbone to QwQ-32B to the InternVL3 base. The November 2025 move to a closed-source, tool-using R1V4-Lite marked a turn toward agentic, productized deployment alongside the open research line.^[3]^[4]

References

Skywork AI. "Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought." arXiv:2504.05599, April 2025. https://arxiv.org/abs/2504.05599 ↩
Wang, Chris, et al. "Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning." arXiv:2504.16656, April 2025. https://arxiv.org/abs/2504.16656 ↩
Shen, Wei, et al. "Skywork-R1V3 Technical Report." arXiv:2507.06167, July 2025. https://arxiv.org/abs/2507.06167 ↩
"Kunlun Wanwei releases Skywork-R1V4-Lite." Skywork-R1V GitHub repository (release notes), November 18, 2025. https://github.com/SkyworkAI/Skywork-R1V ↩
"Kunlun Wanwei Open-Sources Skywork-OR1 Series Models: Exceptional Mathematical and Coding Capabilities." AIBase, April 2025. https://news.aibase.com/news/17080 ↩
"Skywork/Skywork-R1V-38B." Hugging Face model card. https://huggingface.co/Skywork/Skywork-R1V-38B ↩
"Kunlun Tech Launched the 'SkyWork' Large Language Model." Kunlun Tech / PR Newswire, June 2023. https://www.prnewswire.com/news-releases/kunlun-tech-launched-the-skywork-large-language-model-and-was-selected-into-the-list-of-chinas-next-tens-of-billions-of-aigc-products-301836921.html ↩
"SkyworkAI/Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models." GitHub. https://github.com/SkyworkAI/Skywork-MoE ↩
"Kunlun Tech Launches Skywork o1, a Sophisticated AI Model with Advanced Reasoning Capabilities." TMTPost, 2024. https://en.tmtpost.com/news/7355212 ↩
"Skywork/Skywork-R1V3-38B." Hugging Face model card. https://huggingface.co/Skywork/Skywork-R1V3-38B ↩
"Kunlun Wanwei releases and open-sources Skywork-R1V 3.0, multimodal reasoning capability approaches human expert level." 1ai.net, July 2025. https://www.1ai.net/en/39088.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Skywork-R1V

Overview

Skywork and Kunlun Tech

The R1V models

Skywork-R1V

Skywork-R1V2

Skywork-R1V3

Skywork-R1V4-Lite

Multimodal reasoning approach

Specifications

Benchmarks

Significance

References

Improve this article

Related Articles

LLaMA/Model Card

Bert-base-uncased model

Foundation models

GPT

Llama 3

GPT-5

What links here

Related Articles

LLaMA/Model Card

Bert-base-uncased model

Foundation models

GPT

Llama 3

GPT-5