# Skywork-R1V

> Source: https://aiwiki.ai/wiki/skywork_r1
> Updated: 2026-06-09
> Categories: Chinese AI, Multimodal AI, Reasoning Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

## Skywork-R1V

**Skywork-R1V** is an open-weight family of multimodal reasoning models released by Skywork AI, the AGI and AIGC division of Beijing Kunlun Tech Co., Ltd. (Kunlun Wanwei).[^1][^2] First announced on March 18, 2025, the initial Skywork-R1V-38B extended the chain-of-thought reasoning paradigm of [DeepSeek-R1](/wiki/deepseek_r1) from text into vision by coupling an InternViT vision encoder with a [DeepSeek-R1-Distill](/wiki/deepseek_r1_distill) language backbone through a lightweight projector.[^1][^3] Skywork promoted the model as the first industry open-source multimodal reasoning model with visual [chain-of-thought](/wiki/chain_of_thought) capabilities.[^2][^4] The line has since grown to include Skywork-R1V2 (April 2025), Skywork-R1V3-38B (July 2025), and Skywork-R1V4-Lite (November 2025), with model weights, AWQ and GGUF quantizations, and a technical report on arXiv all publicly released under the MIT License.[^1][^5][^6][^7]

### Infobox

| Property | Value |
|---|---|
| Developer | Skywork AI (Kunlun Tech / Kunlun Wanwei) |
| First public release | March 18, 2025 (Skywork-R1V-38B)[^1][^2] |
| Total parameters (R1V/R1V2/R1V3) | 38 billion[^3][^5][^7] |
| Vision encoder | InternViT-6B-448px-V2_5[^3][^5] |
| Language backbone (R1V) | DeepSeek-R1-Distill-Qwen-32B[^3] |
| Language backbone (R1V2) | Qwen/QwQ-32B[^5] |
| Language backbone (R1V3) | InternVL3-38B (pretrained)[^7] |
| Connector | Lightweight MLP visual projector[^8] |
| License | MIT[^3][^5][^7] |
| arXiv (R1V) | 2504.05599[^8] |
| arXiv (R1V2) | 2504.16656[^9] |
| GitHub | SkyworkAI/Skywork-R1V[^1] |
| HuggingFace org | Skywork[^10] |

## Background

### Kunlun Tech and Skywork AI

Beijing Kunlun Tech Co., Ltd. (Kunlun Wanwei, ticker SZ:300418) is a Beijing-based internet company founded in 2008 by Zhou Yahui that operates businesses in distribution, social networking, games, and, since the early 2020s, generative AI.[^11] In June 2023, the company launched the "Tiangong" / "Skywork" large language model brand and was included on China's "Next Tens of Billions of AIGC Products" list.[^11] In October 2023 it open-sourced the Skywork-13B bilingual foundation model under the Skywork Community License, accompanied by the 150B-token SkyPile Chinese corpus.[^12][^13] Through later integration, the AGI and AIGC business was consolidated under the Skywork AI subsidiary, which also produces the SkyReels video models, the SkyMusic / Mureka music platform, and the Skyo real-time voice assistant.[^11][^14]

### From text reasoning to multimodal reasoning

The release of DeepSeek-R1 in January 2025 and its companion DeepSeek-R1-Distill series demonstrated that long chain-of-thought reasoning, trained primarily with rule-based reinforcement learning, could transfer to dense student models in the 1.5B to 70B parameter range.[^15] Skywork-R1V was conceived as the multimodal extension of that paradigm: rather than retraining a vision-language model from scratch, Skywork's researchers attached an existing vision tower to a reasoning-distilled language model and fine-tuned only the connecting modules and the language model's vision-conditioned behavior.[^8][^3] The team frames this as an "efficient multimodal transfer method" that preserves the textual reasoning of the R1-series LLM while granting it the ability to read images.[^8]

## Skywork-R1V (March 2025)

Skywork-R1V-38B was released on March 18, 2025 via the SkyworkAI GitHub organization and the `Skywork/Skywork-R1V-38B` repository on Hugging Face, with a follow-on AWQ quantized release on March 26, 2025 enabling single-GPU inference on accelerators with at least 30 GB of memory.[^1][^3] The technical report, titled *[Skywork R1V](/wiki/skywork_r1v): Pioneering Multimodal Reasoning with Chain-of-Thought* by Peng et al., was posted to arXiv as 2504.05599 on April 8, 2025.[^8]

### Architecture

The model is a modular [vision-language model](/wiki/vision_language_model) in which a frozen vision tower, a learned multilayer perceptron (MLP) adapter, and a reasoning-capable [LLM](/wiki/llm) are wired in series.[^3][^8] Specifically:

- The visual backbone is **InternViT-6B-448px-V2_5**, a roughly 6-billion-parameter [Vision Transformer](/wiki/vision_transformer) derived from Shanghai AI Lab's [InternVL](/wiki/internvl) family that ingests 448-pixel image patches.[^3][^5]
- The language backbone is **DeepSeek-R1-Distill-Qwen-32B**, the 32B DeepSeek-R1-Distill checkpoint built on the [Qwen](/wiki/qwen) 2.5 architecture.[^3][^8]
- A lightweight MLP **visual projector** maps the vision encoder's output space into the language model's input space; this projector is the only component initialized from scratch.[^8]

Adding the 6B vision tower to the 32B language model gives an aggregate parameter count of approximately 38B, which the team uses as the canonical model size.[^3] The full system processes interleaved image-text inputs with a 16,384-token context length.[^8]

### Training pipeline

The R1V technical report describes a three-stage training recipe designed to graft visual grounding onto an already-reasoning-capable LLM without disturbing its text-only reasoning quality:[^8]

1. **MLP initialization (alignment proxy).** The InternViT-6B encoder is first aligned to a non-reasoning substitute language model, **Qwen2.5-32B-Instruct**, using only the MLP projector and a standard image-caption / VQA-style objective. This step trains the projector to translate vision tokens into a space the Qwen family can read.[^8]
2. **Model re-assembly.** The trained MLP is then transferred and used to splice the same InternViT to the reasoning-distilled DeepSeek-R1-Distill-Qwen-32B backbone. Because R1-Distill and Qwen2.5-Instruct share the same Qwen 2.5 architecture, the projected vision tokens slot in without requiring full retraining of either tower.[^8]
3. **Hybrid optimization.** The assembled model is trained for visual reasoning with a hybrid loop combining iterative [supervised fine-tuning](/wiki/supervised_fine-tuning) (four iterations of SFT on curated multimodal reasoning data) and [Group Relative Policy Optimization (GRPO)](/wiki/grpo), the same reinforcement learning algorithm used for DeepSeek-R1.[^8] Reported training hyperparameters include initial learning rate 2 x 10^-4, refinement learning rate 4 x 10^-5, and batch size 512.[^8]

Layered on top of these stages is an **adaptive-length chain-of-thought distillation** procedure that dynamically adjusts the length of the reasoning trace to avoid "overthinking" on simple visual questions while preserving long traces for hard ones.[^4][^8] Together, these are the two contributions the report highlights in its abstract: a hybrid SFT+GRPO optimization strategy and adaptive-length CoT distillation.[^8]

### Benchmark results

Skywork reports the following results for Skywork-R1V-38B, mixing standard multimodal benchmarks with text-only reasoning benchmarks to verify that the vision grafting does not degrade the underlying R1-distill reasoning ability.[^3][^8]

| Benchmark | Skywork-R1V-38B |
|---|---|
| [MMMU](/wiki/mmmu) (val) | 69.0[^3][^8] |
| [MathVista](/wiki/mathvista) (mini) | 67.5[^3][^8] |
| [GPQA](/wiki/gpqa) (pass@1) | 61.6[^3] |
| [MATH-500](/wiki/math_500) (pass@1) | 94.0[^3][^8] |
| [AIME 2024](/wiki/aime_2024) (pass@1) | 72.0[^3][^8] |

Skywork's own write-up positions the MMMU and MathVista numbers as comparable to closed-source reasoning systems such as [Gemini 2.0](/wiki/gemini) and [Kimi K1.5](/wiki/kimi_k1_5) for visual question answering, while the MATH-500 and AIME 2024 numbers are intended to show that text-only reasoning is preserved relative to the underlying DeepSeek-R1-Distill-Qwen-32B checkpoint.[^2][^3][^4] Independent benchmark trackers and the model card on Hugging Face report the same headline numbers.[^3][^16]

The article task brief mentions a "Cross-modal Self-Iterative Adaptive Reasoning" concept. The exact phrase does not appear in the R1V technical report or in the model card; the closest documented constructs are the iterative SFT + GRPO loop and the adaptive-length CoT distillation described above, and this article restricts itself to those documented terms.[^8]

## Skywork-R1V2 (April 2025)

Skywork-R1V2 was announced on April 24, 2025, with the technical report *Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning* (Wang et al., arXiv 2504.16656) posted on April 23, 2025.[^5][^9] An AWQ quantized release followed on April 28, 2025.[^1] The model retains the 38B size and the InternViT-6B-448px-V2_5 vision encoder of the original R1V, but switches the language backbone from DeepSeek-R1-Distill-Qwen-32B to Alibaba's [QwQ-32B](/wiki/qwq) reasoning model and overhauls the training procedure.[^5][^9]

### Hybrid reinforcement learning

R1V2 frames training as a hybrid reinforcement learning problem that jointly leverages two complementary objectives:[^9][^17]

- **Mixed Preference Optimization (MPO)**, a preference-learning objective that combines reward signals from the dedicated Skywork-VL Reward (R1V-RM) model with rule-based constraints on format correctness, factual consistency, and step-by-step reasoning completeness.[^9][^17]
- **Group Relative Policy Optimization (GRPO)**, the same on-policy RL algorithm used in R1V, retained to drive exploration of harder reasoning trajectories.[^9][^17]

The team also introduces a **Selective Sample Buffer (SSB)**, a replay mechanism that caches high-quality training examples with non-zero advantage and reintroduces them during later policy updates. SSB is presented as a fix for the "vanishing advantages" problem in GRPO, where most rollouts in a group end up with similar rewards and contribute weak gradients.[^9][^17] Caching and replaying informative samples increases gradient density and, according to the report, encourages deeper chains of reasoning.[^9][^17] The authors additionally report that overly strong reinforcement signals can induce visual hallucinations and discuss calibration thresholds to mitigate this trade-off between reasoning depth and visual faithfulness.[^9][^17]

### Benchmark results

The R1V2 model card and arXiv report the following figures for Skywork-R1V2-38B, with R1V-38B numbers included for reference.[^5][^9][^17]

| Benchmark | R1V2-38B | R1V-38B |
|---|---|---|
| [MMMU](/wiki/mmmu) (val) | 73.6 | 68.0[^5] |
| MathVista (mini) | 74.0 | 67.0[^5] |
| OlympiadBench | 62.6 | 40.4[^5][^9] |
| [AIME 2024](/wiki/aime_2024) | 78.9 | 72.0[^5][^9] |
| [LiveCodeBench](/wiki/livecodebench) | 63.6 | not reported[^5][^9] |
| [GPQA](/wiki/gpqa) | 61.6 | 61.6[^5] |

The most striking jump is on OlympiadBench, where R1V2 lifts the score from 40.4 to 62.6, and on MMMU, where it climbs from 68.0 to 73.6. Skywork frames the latter as the highest then-reported score for any open-source 38B-class multimodal model.[^5][^17]

## Later releases

### Skywork-R1V3-38B (July 2025)

Skywork-R1V3-38B was released on July 9, 2025 with the model card hosted at `Skywork/Skywork-R1V3-38B`.[^1][^7] Unlike R1V2's switch to QwQ-32B, R1V3 is built directly on the InternVL3-38B pretrained checkpoint and emphasizes post-training reinforcement learning rather than reasoning-focused pretraining.[^7] The R1V3 model card highlights several methodological choices: a fine-grained cold-start SFT used to prime the model for RL, a connector-only fine-tuning step that further boosts performance after RL, and an "Entropy of Critical Reasoning Tokens" metric used to select checkpoints.[^7] Reported benchmarks include 76.0 on [MMMU](/wiki/mmmu) (val), 77.1 on MathVista (mini), 55.4 on [MMMU-Pro](/wiki/mmmu-pro), and 28.5 on VisuLogic, which Skywork describes as state of the art among open multimodal reasoning models at the time of release.[^7]

### Skywork-R1V4-Lite (November 2025)

Skywork-R1V4-Lite was announced on November 18, 2025. Unlike its predecessors, it is closed-source, served only through Skywork's platform API and via OpenRouter, and is described as a lightweight reasoning model built on Qwen3-VL-30B-A3B-Instruct (a 30B mixture-of-experts model with about 3B activated parameters).[^1] Skywork emphasizes agentic capabilities: code execution, deep research via search-tool integration, streaming output, and multi-turn reasoning. Reported headline numbers include 91.8 on HIRbench-4K FSP and 71.4 on MME-Real Overall.[^1]

## Skywork's broader catalog

Skywork-R1V sits inside a wider open-source release program that began with the company's 2023 bilingual base model. The table below summarizes the major model families the Skywork organization has published on Hugging Face and GitHub.[^10]

| Family | Year(s) | Description | Sources |
|---|---|---|---|
| Skywork-13B | 2023 | Bilingual (Chinese/English) [LLM](/wiki/llm) trained on 3.2T tokens; includes Base, Chat, and Math variants; released with SkyPile-150B corpus. | [^12][^13] |
| Skywork-MoE | 2024 | 146B-parameter [mixture-of-experts](/wiki/mixture_of_experts) model with 16 experts and ~22B active parameters; initialized from Skywork-13B; introduces gating logit normalization and adaptive auxiliary loss coefficients. | [^18][^19] |
| Skywork-o1-Open-PRM | late 2024 | Open process reward model series for step-by-step reasoning supervision, based on a Qwen-2.5-1.5B backbone. | [^10] |
| Skywork-R1V series | 2025 | InternViT + R1-style backbone multimodal reasoning models (R1V, R1V2, R1V3, R1V4-Lite). | [^1] |
| Skywork-OR1 | April / May 2025 | "Open Reasoner 1" math and code reasoning models at 7B and 32B; trained with large-scale rule-based reinforcement learning; released under Apache 2.0. | [^20] |
| Skywork-VL Reward | May 2025 | Multimodal [VLM](/wiki/vlm) reward model based on Qwen2.5-VL-7B-Instruct with a value head; achieves state-of-the-art VL-RewardBench results. | [^21] |
| Skywork-Reward / Reward-V2 | 2024 / July 2025 | Text reward model series; the V2 release in July 2025 includes 8 models from 0.6B to 8B parameters and tops Reward Bench v1/v2, RM-Bench, JudgeBench and other reward benchmarks. | [^22] |
| Skywork-UniPic, UniPic2, UniPic3 | 2025 | Unified autoregressive vision-language image generation and multi-image composition models. | [^10] |
| Skywork-SWE-32B | 2025 | Software engineering reasoning model studying scaling laws for SWE tasks. | [^10] |
| SkyReels-V1 / V3 / V4 | 2025 | Human-centric video foundation and multimodal video-audio generation models. | [^10] |
| Mureka / Skyo | 2025 | Mureka O1 music reasoning model and the Skyo real-time voice assistant launched alongside Skywork 4.0. | [^14][^23] |

Note: Skywork has not, as of the references checked here, released a model branded as "Skywork-Audio." The company's audio and speech work appears under the **Skyo** voice assistant and the **Mureka** AI music brand rather than a "Skywork-Audio" product line, so this article does not claim such a model exists.[^14][^23]

## Significance

Skywork-R1V is one of the earliest fully open-weight attempts to port the long-CoT, RL-trained reasoning paradigm popularized by [OpenAI o1](/wiki/o1) and DeepSeek-R1 into vision. Three aspects make it methodologically interesting:

1. **Reasoning-preserving multimodal transfer.** By freezing the vision tower and a reasoning-distilled LLM and training only an MLP projector before the joint SFT/GRPO loop, the team aims for "near-lossless" preservation of the underlying chain-of-thought behavior on text benchmarks like MATH-500 and AIME 2024 while still gaining visual grounding.[^3][^8]
2. **Hybrid RL for VLMs.** R1V2's combination of MPO with GRPO and the Selective Sample Buffer extends the GRPO recipe (originally text-only in DeepSeek) to vision-language settings, and surfaces the now-well-known trade-off between reasoning depth and visual hallucination under strong reward signals.[^9][^17]
3. **Open weights, MIT license.** Together with the contemporaneous Skywork-OR1, Skywork-VL Reward, and Skywork-Reward-V2 releases, R1V is part of an unusually transparent open ecosystem for reasoning research, including model weights, AWQ and GGUF quantizations, training procedures, and benchmark scripts.[^1][^10][^20][^21]

## Limitations and criticisms

The R1V technical report and the R1V2 follow-up explicitly note several limitations.[^8][^9][^17]

- **Visual hallucinations under strong RL signals.** R1V2 reports that excessively aggressive reinforcement signals push the model toward longer but more hallucinated reasoning traces, motivating calibrated reward thresholds.[^9][^17]
- **Substitution dependence on a non-reasoning proxy.** R1V's projector is first aligned with Qwen2.5-32B-Instruct rather than the actual reasoning backbone, and the report acknowledges that this two-step transfer is partly a workaround for the difficulty of training projectors directly against an already-CoT-trained LLM.[^8]
- **Hardware footprint.** Even in BF16, the 38B parameter R1V family requires roughly 80 GB of GPU memory for inference; AWQ and GGUF quantizations were released specifically to bring it onto single 30 GB+ GPUs and CPU inference setups, but the unquantized form is heavy.[^1][^3]
- **Benchmark coverage gaps.** Some benchmarks named in third-party descriptions (for example, AI2D and OlympiadBench for the original R1V) are not reported in the official R1V technical report or model card, so cross-version comparisons must use only the benchmarks each report actually publishes.[^3][^8] For R1V, the documented set is MMMU, MathVista, MATH-500, AIME 2024, and GPQA; for R1V2, it adds OlympiadBench and LiveCodeBench.[^3][^5][^8][^9]
- **License nuance for R1V4-Lite.** R1V4-Lite is closed-source and API-only, breaking the open-weight pattern of R1V, R1V2, and R1V3, although its base (Qwen3-VL-30B-A3B-Instruct) remains Apache 2.0.[^1]

## Related work

Skywork-R1V's design sits at the intersection of three lines of work:

- The **R1-style reasoning paradigm**: DeepSeek-R1, DeepSeek-R1-Distill, GRPO, QwQ-32B, OpenAI o1, and [Kimi K1.5](/wiki/kimi_k1_5), all of which use long chain-of-thought traces trained with rule-based RL on math, code, and science tasks.[^15][^24]
- The **InternViT / InternVL** family of open vision encoders from Shanghai AI Laboratory, whose InternViT-6B-448px-V2_5 encoder is reused by R1V, R1V2, and R1V3, and whose InternVL3-38B serves as the full base for R1V3.[^3][^7]
- Other contemporaneous Chinese open multimodal reasoning models such as [DeepSeek-VL2](/wiki/deepseek_vl2), [DeepSeek Janus](/wiki/deepseek_janus), [Qwen2.5-VL](/wiki/qwen2_5_vl), and [MiniCPM-V](/wiki/minicpm_v), which together populate the open-weights side of the [multimodal AI](/wiki/multimodal_ai) landscape.

## See also

- [DeepSeek-R1](/wiki/deepseek_r1)
- [DeepSeek-R1-Distill](/wiki/deepseek_r1_distill)
- [Group Relative Policy Optimization (GRPO)](/wiki/grpo)
- [Chain-of-Thought](/wiki/chain_of_thought)
- [QwQ](/wiki/qwq)
- [Qwen2.5-VL](/wiki/qwen2_5_vl)
- [InternVL](/wiki/internvl)
- [MMMU](/wiki/mmmu)
- [MathVista](/wiki/mathvista)
- [MATH-500](/wiki/math_500)
- [AIME 2024](/wiki/aime_2024)
- [LiveCodeBench](/wiki/livecodebench)
- [GPQA](/wiki/gpqa)
- [Vision language model](/wiki/vision_language_model)
- [Reasoning models](/wiki/reasoning_models)
- [Mixture of Experts (MoE)](/wiki/mixture_of_experts)
- [Knowledge Distillation](/wiki/knowledge_distillation)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)
- [Multimodal AI](/wiki/multimodal_ai)

## References

[^1]: SkyworkAI, "Skywork-R1V (GitHub README)", GitHub, 2025-11-18. https://github.com/SkyworkAI/Skywork-R1V. Accessed 2026-05-21.
[^2]: AIBase, "Kunlun Wanwei Open-Sources Skywork R1V Visual Reasoning Chain Model", AIBase, 2025-03-18. https://www.aibase.com/news/16387. Accessed 2026-05-21.
[^3]: Skywork, "Skywork/Skywork-R1V-38B (model card)", Hugging Face, 2025-04-08. https://huggingface.co/Skywork/Skywork-R1V-38B. Accessed 2026-05-21.
[^4]: AIBase, "Game Changer! Kunlun Wanwei's Skywork R1V Multimodal Reasoning Model Open-Sourced!", AIBase, 2025-03-19. https://www.aibase.com/news/16394. Accessed 2026-05-21.
[^5]: Skywork, "Skywork/Skywork-R1V2-38B (model card)", Hugging Face, 2025-04-23. https://huggingface.co/Skywork/Skywork-R1V2-38B. Accessed 2026-05-21.
[^6]: Skywork, "Skywork/Skywork-R1V2-38B-AWQ (model card)", Hugging Face, 2025-04-28. https://huggingface.co/Skywork/Skywork-R1V2-38B-AWQ. Accessed 2026-05-21.
[^7]: Skywork, "Skywork/Skywork-R1V3-38B (model card)", Hugging Face, 2025-07-09. https://huggingface.co/Skywork/Skywork-R1V3-38B. Accessed 2026-05-21.
[^8]: Peng, Y. et al., "Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought", arXiv:2504.05599, 2025-04-08. https://arxiv.org/abs/2504.05599. Accessed 2026-05-21.
[^9]: Wang, P. et al., "Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning", arXiv:2504.16656, 2025-04-23. https://arxiv.org/abs/2504.16656. Accessed 2026-05-21.
[^10]: Skywork, "Skywork organization page", Hugging Face, 2026-05-21. https://huggingface.co/Skywork. Accessed 2026-05-21.
[^11]: Kunlun Tech, "Kunlun Tech Launched The 'SkyWork' Large Language Model And Was Selected Into The List of China's 'Next Tens of Billions of AIGC Products'", PR Newswire, 2023-06-08. https://www.prnewswire.com/news-releases/kunlun-tech-launched-the-skywork-large-language-model-and-was-selected-into-the-list-of-chinas-next-tens-of-billions-of-aigc-products-301836921.html. Accessed 2026-05-21.
[^12]: Kunlun Tech, "Kunlun Tech releases open source 13B high-quality commercial large model, ahead of Llama2 and Baichuan2", Kunlun Tech News, 2023-10-30. http://www.kunlun.com/2023/en_mnews_1030/328.html. Accessed 2026-05-21.
[^13]: Wei, T. et al., "Skywork: A More Open Bilingual Foundation Model", arXiv:2310.19341, 2023-10-30. https://arxiv.org/abs/2310.19341. Accessed 2026-05-21.
[^14]: TMTPost, "Kunlun Tech Launches Skywork 4.0 AI Model and Skyo Real-Time Voice Assistant", TMTPost, 2025-04-18. https://en.tmtpost.com/news/7345973. Accessed 2026-05-21.
[^15]: DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", arXiv:2501.12948, 2025-01-22. https://arxiv.org/abs/2501.12948. Accessed 2026-05-21.
[^16]: Hugging Face, "Paper page: Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought", Hugging Face Papers, 2025-04-08. https://huggingface.co/papers/2504.05599. Accessed 2026-05-21.
[^17]: MarkTechPost, "Skywork AI Advances Multimodal Reasoning: Introducing Skywork R1V2 with Hybrid Reinforcement Learning", MarkTechPost, 2025-04-25. https://www.marktechpost.com/2025/04/25/skywork-ai-advances-multimodal-reasoning-introducing-skywork-r1v2-with-hybrid-reinforcement-learning/. Accessed 2026-05-21.
[^18]: Skywork, "Skywork/Skywork-MoE-Base (model card)", Hugging Face, 2024-06-10. https://huggingface.co/Skywork/Skywork-MoE-Base. Accessed 2026-05-21.
[^19]: Wei, T. et al., "Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models", arXiv:2406.06563, 2024-06-10. https://arxiv.org/abs/2406.06563. Accessed 2026-05-21.
[^20]: SkyworkAI, "Skywork-OR1 (GitHub README)", GitHub, 2025-05-13. https://github.com/SkyworkAI/Skywork-OR1. Accessed 2026-05-21.
[^21]: Wang, X. et al., "Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning", arXiv:2505.07263, 2025-05-12. https://arxiv.org/abs/2505.07263. Accessed 2026-05-21.
[^22]: Skywork, "Skywork-Reward-V2: Leading the New Milestone for Open-Source Reward Models", PR Newswire, 2025-07-04. https://www.prnewswire.com/news-releases/skywork-reward-v2-leading-the-new-milestone-for-open-source-reward-models-302498377.html. Accessed 2026-05-21.
[^23]: Music Business Worldwide, "China's $6B-valued Kunlun Tech debuts 'world's first' music reasoning model, claims it can outperform Suno", Music Business Worldwide, 2025-03-26. https://www.musicbusinessworldwide.com/chinas-6b-valued-kunlun-tech-debuts-worlds-first-music-reasoning-model-claims-it-can-outperform-suno/. Accessed 2026-05-21.
[^24]: Qwen Team, "QwQ-32B (model card)", Hugging Face, 2025-03-06. https://huggingface.co/Qwen/QwQ-32B. Accessed 2026-05-21.