Reka Flash
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,866 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,866 words
Add missing citations, update stale details, or suggest a clearer explanation.
Reka Flash is a family of multimodal large language models developed by Reka AI, a San Francisco Bay Area research company founded in 2022 by former researchers from Google DeepMind, Meta FAIR, and Google. The Flash line sits in the middle of Reka's three-tier model series, positioned between the larger Reka Core and the compact Reka Edge. The original Reka Flash, introduced in February 2024, was a 21 billion parameter model designed to process text, images, video, and audio inputs while running at lower cost than frontier models of the time.
The series became more widely known in March 2025, when Reka released Reka Flash 3, a 21 billion parameter reasoning model published under the Apache 2.0 license on Hugging Face. Reka Flash 3 was the company's first fully open weights release and was positioned as a general-purpose reasoning model competitive with OpenAI's o1-mini at a fraction of the deployment cost. The release made Reka one of a small number of frontier-focused labs to publish open weight reasoning models in early 2025, alongside DeepSeek, Qwen, and Mistral AI.
Reka AI announced its model lineup in stages through 2023 and 2024. The company emerged from stealth in June 2023 with $58 million in funding from DST Global Partners, Radical Ventures, and Snowflake Ventures, and a pitch focused on building efficient, enterprise-deployable multimodal models from scratch. Its first public model, Yasa-1, shipped in October 2023 as a multimodal assistant capable of processing images, audio, and short video clips alongside text.
In February 2024 Reka rolled out a structured family of three models intended to cover different performance and cost points. Reka Edge, at roughly 7 billion parameters, targeted on-device and resource-constrained deployments. Reka Flash, at 21 billion parameters, served as the workhorse model for cost-sensitive production workloads. Reka Core, the largest model in the series, was designed to compete with frontier-class systems such as GPT-4 and Claude 3 Opus on multimodal benchmarks. All three were trained from scratch rather than fine-tuned from a third party base, which the company emphasized as a differentiator from labs that relied on Llama or Mistral checkpoints.
The original Reka Flash entered public beta on February 12, 2024 via the Reka Playground. At launch the model accepted text and images, with video and audio support arriving over the following months. Reka described Flash as a "turbo-class" model trained on approximately 4.5 trillion deduplicated and filtered language tokens spanning more than 32 languages, including English, Chinese, Japanese, Spanish, Arabic, and Hindi. The standard context length at release was 8,000 tokens, with a 128,000 token long-context variant added later for retrieval and long-document tasks.
Reka published headline benchmark results showing Flash outperforming Gemini Pro 1.0 on the MMLU and GPQA evaluations and reaching competitive scores on GSM8K and HumanEval. On multimodal evaluations including MMMU, VQA-v2, VATEX video captioning, and Perception Test video question answering, Flash was reported as competitive with Gemini Pro across all four benchmarks. In a blind text chat human evaluation Flash placed ahead of GPT-3.5 Turbo, Claude 2.1, Mixtral 8x7B, and Gemini Pro, and in multimodal chat human evaluation it ranked second only to GPT-4V.
The technical details for Flash, Core, and Edge were consolidated in a single arXiv paper, Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models, posted on April 18, 2024 (arXiv:2404.12387). The paper was authored by the 25-person Reka team and submitted by researcher Max Bain. It described the shared training pipeline, the multimodal evaluation methodology, and ablations on data mix decisions, although Reka did not disclose full architectural details such as the exact number of attention heads or hidden dimensions.
On October 4, 2024 Reka shipped a major update to Reka Flash that the company referred to internally as Flash v1.5. The update raised the model's quality score from 66.1 percent to 72.2 percent on internal evaluations and added a 43-point gain in internal Elo rating. On the public LMSYS Chatbot Arena leaderboard Reka Flash climbed from an Elo of 1148 to 1204, a 56-point gain.
The October release expanded multimodal coverage in several ways. Image inputs gained better OCR and support for arbitrary resolutions and aspect ratios. Video inputs grew from one minute to three to five minutes per clip and gained native audio understanding rather than a separate transcription pass. Speech became a first-class input modality, and an experimental English speech output mode was added. Reka also positioned the new Flash as an agent backbone, introducing function calling and structured output, lifting output-format instruction accuracy from 40.4 percent to 83.6 percent, and reporting a 51.8 percent score on its internal MegaTask agent benchmark compared to 40.4 percent for Gemini 1.5 Flash and 25.9 percent for GPT-4o mini. The update was deployed via Reka Chat, the Reka API, and an NVIDIA NIM microservice in partnership with Nvidia.
Reka Flash 3 was released on March 10, 2025 and represented a substantial reorientation of the Flash brand. The original Flash had been a closed-weights multimodal model offered through Reka's hosted API. Reka Flash 3 was instead a text-only reasoning model published as open weights on Hugging Face under the Apache 2.0 license, with full model files available for download, fine-tuning, and self-hosting.
Reka Flash 3 keeps the 21 billion parameter scale of the original Flash line but is described in the release notes as having been trained from scratch as a reasoning-focused successor rather than a continuation of the v1.5 multimodal weights. The model targets a budget point of 35 percent fewer parameters than Qwen QwQ-32B, which Reka identified as the closest open weight reasoning peer at release. At full BF16 precision the checkpoint occupies 39 GB on disk, and Reka shipped guidance for 4-bit quantization that compresses the model to roughly 11 GB while preserving most reasoning performance, compared with about 18 GB minimum for QwQ-32B.
The context window is 32,000 tokens. The tokenizer is OpenAI's cl100k_base without any added special tokens, which simplifies integration with tools that already understand that vocabulary. The model uses a chat template based on human: and assistant: turns separated by a <sep> token, and generation stops on <sep> or <|endoftext|>. System prompts are prepended to the first user turn rather than carried as a separate role marker. Reka also published the model in a Llama-compatible weight layout so that downstream tooling such as Hugging Face Transformers and vLLM can load it without custom code paths.
Reka described the training pipeline as a three-stage process. The first stage was large-scale pretraining on a mix of public web data and curated synthetic datasets. The second stage was supervised instruction tuning on Reka-authored and filtered third-party instruction data. The third stage applied reinforcement learning using REINFORCE Leave-One-Out (RLOO) with a combination of model-based reward models and rule-based reward signals, with what Reka described as a deliberate focus on general reasoning improvements rather than specializing the model on any single domain such as competition math or code. The training data was largely English with some multilingual coverage.
The most novel design choice was the budget forcing mechanism, a built-in pair of <reasoning> and </reasoning> tags that delimit chain-of-thought output. Users or downstream applications can stop the model after a chosen number of reasoning tokens, force it to close its reasoning trace, and immediately produce a final answer. This is intended to give application builders explicit control over the latency and cost of reasoning without retraining, and complements the trend toward inference-time scaling pioneered by OpenAI o1 and DeepSeek's R1.
Reka has not published a full architecture diagram for either Reka Flash or Reka Flash 3. The April 2024 technical report describes the family at a high level as decoder-only transformer language models with a paired vision encoder for image and video frames, training jointly on text and visual tokens. The vision pathway accepts images at arbitrary resolution, with each image converted to a sequence of patch tokens that are interleaved with text tokens in the model's input.
For Reka Flash 3, the Hugging Face model card lists the architecture as Llama-compatible at the weight format level, which implies the same general decoder-only transformer layout with RoPE positional embeddings, grouped-query attention, and SwiGLU feedforward blocks used by Llama and related families. Reka has not confirmed the exact number of layers, attention heads, or hidden dimension. The 21 billion parameter scale is similar to other mid-size reasoning models such as Qwen 32B and slightly larger than Gemma 27B.
Reported scores for the original Reka Flash from the Reka Core, Flash, and Edge technical report:
| Benchmark | Score | Domain |
|---|---|---|
| MMLU | 75.9 | General knowledge |
| GSM8K | 85.8 | Grade school math |
| HumanEval | 72.0 | Python coding |
| GPQA | 34.0 | Graduate science QA |
| MMMU | 53.3 | Multimodal college-level QA |
| VQA-v2 | 78.4 | Visual question answering |
| Multimodal chat Elo | 1082 | Blind human eval |
The technical report contextualized these numbers by showing that Flash outperformed several substantially larger models on equivalent evaluations, including Llama 2 70B, Grok-1, and Mistral Medium, while running closer to Gemini Pro 1.0 in cost.
Third-party benchmark coverage of Reka Flash 3 reported the following numbers:
| Benchmark | Score | Domain |
|---|---|---|
| AIME 2024 | 51.0 | Competition math |
| LiveCodeBench | 43.5 | Coding |
| MMLU-Pro | 65.0 to 66.9 | General knowledge (harder) |
| WMT'23 | 83.2 COMET | Multilingual translation |
| Intelligence Index (Artificial Analysis) | 10 | Composite |
Reka itself noted in the release blog post that Reka Flash 3 was "not the best model for knowledge-intensive tasks" and recommended pairing it with web search or retrieval systems for factual questions. The model performed best on reasoning-heavy benchmarks where the budget forcing mechanism could be tuned to allow longer chains of thought.
Artificial Analysis ranked Reka Flash 3 at position 101 of 125 evaluated models on its composite intelligence index as of mid-2025, with a median score of 15 for that cohort. The same analysis flagged that hosted pricing of $0.20 per million input tokens and $0.80 per million output tokens on Reka's own API made the model relatively expensive compared with other open weight models of similar size, although self-hosted inference removed that comparison.
Reka Flash 3 was the first Reka model published with downloadable weights. The model card on Hugging Face lists the license as Apache 2.0, which permits commercial use, modification, and redistribution without per-token royalties or usage restrictions. The release also made clear that the checkpoint is suitable for fine-tuning and that derivative models can be released under different licenses.
The choice of Apache 2.0 placed Reka Flash 3 in the same licensing tier as Mistral 7B, Falcon, and OLMo, rather than the more restrictive Llama 2 community license or the custom DeepSeek and Qwen licenses that include export and use clauses. For developers and research labs the practical effect is that Reka Flash 3 can be deployed in commercial products with minimal legal review.
The original Reka Flash and its October 2024 update remain closed weights and are accessible only through Reka's hosted API, the Reka Chat product, and the NVIDIA NIM partnership. Reka has not indicated whether the multimodal Flash weights will be opened in the future.
Deployment guidance for Reka Flash 3 lists three common operating points:
| Configuration | Memory | Use case |
|---|---|---|
| BF16 full precision | 39 GB | Single A100 80GB or two A100 40GB |
| 8-bit quantization | ~22 GB | Single A100 40GB |
| 4-bit quantization | 11 GB | Single consumer GPU (RTX 4090, L40S) |
Reka has confirmed compatibility with vLLM, Hugging Face Transformers, and llama.cpp via GGUF conversions community members have published. The model has also been served through inference providers including Fireworks AI, Together AI, and DeepInfra.
| Model | Parameters | Weights | License | Multimodal | Reasoning mode | Context |
|---|---|---|---|---|---|---|
| Reka Flash (Feb 2024) | 21B | Closed | Reka API | Text, image, video, audio | No | 8K, 128K long |
| Reka Flash 3 (Mar 2025) | 21B | Open | Apache 2.0 | Text only | Yes (budget forcing) | 32K |
| GPT-4o mini | Undisclosed | Closed | OpenAI API | Text, image, audio | No | 128K |
| Claude 3 Haiku | Undisclosed | Closed | Anthropic API | Text, image | No | 200K |
| Qwen QwQ-32B | 32B | Open | Apache 2.0 | Text only | Yes | 32K |
| DeepSeek R1-Distill-Qwen-32B | 32B | Open | MIT | Text only | Yes | 128K |
A few specific notes on the comparison. GPT-4o mini and Claude 3 Haiku do not disclose parameter counts, so direct size comparisons are not possible; they are listed here because Reka's own marketing positioned the original Reka Flash against them. Reka Flash 3 is smaller than Qwen QwQ-32B by 11 billion parameters and ships with broadly similar reasoning performance, which was the headline efficiency claim at release. The DeepSeek R1 distillation models occupy the same open weight reasoning niche and competed directly with Reka Flash 3 in benchmark coverage during spring 2025.
Reka Flash 3 is text-only at release. Developers who want a multimodal open weight model in the same size class generally turn to Qwen2-VL, Llama 3.2 Vision, or the Pixtral models from Mistral.
Responses to the original Reka Flash in early 2024 were measured. The model received favorable coverage on technical AI blogs and from MarkTechPost and VentureBeat, which highlighted the strong benchmark numbers for a model in the 21 billion parameter range. Reviewers noted that the multimodal coverage of image, video, and audio in a single model of that size was uncommon at the time, with Gemini Pro 1.0 being the most direct comparison point. The closed-weights distribution and reliance on Reka's API limited independent verification of the published benchmarks.
Reka Flash 3 attracted more discussion in March 2025, in part because the open weights made independent evaluation straightforward. Coverage on MarkTechPost, DigiAlps, and several Medium technical posts emphasized the budget forcing mechanism as a practical feature for production deployments. The Hugging Face community produced quantizations within days of release and integrated the model into common inference stacks.
Critical reactions focused on three points. First, the model's MMLU-Pro score of around 65 to 67 was below the leading open weight reasoning models on knowledge-heavy benchmarks, and Reka itself acknowledged this limitation. Second, the text-only scope was a step back from the original Flash's multimodal capability, which some observers considered a strategic retreat. Third, the 32,000 token context window was shorter than the 128,000 or longer windows offered by several peers in 2025, which limited use for long-document analysis without retrieval augmentation.
For Reka the release served a different strategic purpose than chasing top benchmark scores. The company had spent 2023 and 2024 building a closed-source API business, and the Apache 2.0 release of Reka Flash 3 broadened developer awareness of the Reka stack ahead of the company's July 2025 funding round, in which Reka raised $110 million at a valuation above one billion dollars led by Nvidia and Snowflake. By mid-2025 Reka Flash 3 had become a commonly cited reference point for sub-30 billion parameter open weight reasoning models, alongside the DeepSeek distillations and Qwen QwQ.