Gemini 1.5 Flash
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,323 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,323 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gemini 1.5 Flash is a lightweight, low-latency multimodal large language model developed by Google DeepMind as part of the Gemini family. It was built as a faster, cheaper companion to Gemini 1.5 Pro, trained to mimic the larger model through a process called online distillation, and was aimed at high-volume, latency-sensitive workloads such as summarization, chat, captioning, and data extraction.[1][2] Google introduced it at the I/O developer conference on May 14, 2024.[1][3]
Google first detailed the Gemini 1.5 generation in February 2024, when it released Gemini 1.5 Pro as a model capable of reasoning over very long inputs. Three months later, at I/O 2024, the company expanded the line with Gemini 1.5 Flash, positioning it as the fastest Gemini model offered through its API.[1][3] The motivation was practical: developers running production applications frequently needed a model that responded quickly and cheaply at scale, and Pro was more capability than many of those tasks required. Flash was the answer, described by Google as a model that is "lighter weight than 1.5 Pro" yet "highly capable of multimodal reasoning across vast amounts of information."[2]
The model shipped initially in public preview through Google's developer surfaces, alongside Gemini 1.5 Pro's expansion to a 2 million token window and a refreshed lineup of smaller open Gemma 2 models announced at the same event.[2]
Gemini 1.5 Flash was not trained from scratch as an independent system. According to the Gemini 1.5 technical report, it is "a more lightweight variant designed for efficiency with minimal regression in quality," created through online distillation in which the smaller model learns to imitate the behavior of its more capable counterpart, Gemini 1.5 Pro.[4] In Google's own framing, distillation transfers "the most essential knowledge and skills from a larger model" into a smaller, more efficient one.[2]
The report describes Flash as a dense Transformer-based model, in contrast to the sparse mixture-of-experts approach used for the Pro tier. Google has not published an official parameter count for the standard Flash model, and this encyclopedia does not state one. The design goal was to preserve as much of Pro's quality as possible while cutting the cost and latency of serving, which is what makes the model attractive for high-frequency tasks.[2][4]
Gemini 1.5 Flash launched with the same 1 million token context window that defined the 1.5 generation, allowing it to ingest large documents, long transcripts, and extended multimedia inputs in a single request.[2] In research evaluations reported in the technical report, the 1.5 models maintained near-perfect retrieval, above 99 percent, on long-context tasks out to at least 10 million tokens, although the context made available to users through the public API was held at the 1 to 2 million token range.[4] For Flash specifically, the production limit was 1 million tokens, whereas Pro was later raised to 2 million.[2]
Flash accepts multimodal input, handling text, images, audio, and video, and returns text output.[2] Google highlighted summarization, chat applications, image and video captioning, and data and structured extraction from long documents and tables as representative use cases.[2] The combination of a long context window with low serving cost was the central selling point: developers could feed large amounts of material to the model and get fast responses suitable for interactive or batch processing.
The model also supported text tuning. On August 8, 2024, Google completed the rollout of Gemini 1.5 Flash tuning to all developers through both the Gemini API and Google AI Studio, letting teams fine-tune the base model on their own data.[5]
Independent and reported benchmark figures placed Flash close to Pro on general knowledge tests despite its smaller footprint, consistent with the technical report's claim of minimal quality regression.[4] In September 2024, Google released updated -002 builds of both Flash and Pro and reported the following gains for the refreshed models over their May versions:[6]
| Benchmark area | Reported change (002 update) |
|---|---|
| MMLU-Pro | About +7% |
| MATH and HiddenMath | About +20% |
| Vision and code tasks | About +2% to +7% |
Google also said the updated models produced more concise default outputs and delivered roughly 2x faster output with about 3x lower latency on the new builds.[6]
Pricing fell sharply over the model's lifetime. On August 8, 2024, Google cut Flash input pricing by 78 percent to $0.075 per million tokens and output pricing by 71 percent to $0.30 per million tokens, for prompts under 128,000 tokens.[5] The September update raised rate limits and reduced 1.5 Pro pricing as well.[6] The standard Flash rate card is summarized below.
| Item | Price (prompts under 128K tokens) |
|---|---|
| Input | $0.075 per 1M tokens |
| Output | $0.30 per 1M tokens |
Usage through Google AI Studio was free of charge, with the paid rates applying to the Gemini API and Vertex AI.[6]
In the September 24, 2024 refresh, Google released an experimental Gemini 1.5 Flash-8B, then made it production ready on October 3, 2024, with billing for paid usage beginning October 14, 2024.[6][7] As the name suggests, Flash-8B is a smaller and faster variant of Flash; Google described it as a model that "nearly matches the performance of the 1.5 Flash model launched in May across many benchmarks," while costing half as much.[7]
Flash-8B carried the lowest published prices in the Gemini lineup at the time. For prompts under 128,000 tokens, it cost $0.0375 per million input tokens, $0.15 per million output tokens, and $0.01 per million cached tokens, with prices doubling for prompts above that threshold.[7] Google also doubled the variant's throughput ceiling to 4,000 requests per minute, twice the standard Flash limit.[7] The company recommended it for chat, transcription, long-context translation, and high-volume multimodal summarization.[7]
| Flash-8B item | Price (prompts under 128K tokens) |
|---|---|
| Input | $0.0375 per 1M tokens |
| Output | $0.15 per 1M tokens |
| Cached input | $0.01 per 1M tokens |
Gemini 1.5 Flash was distributed through three Google channels: Google AI Studio for prototyping, the Gemini API for developers, and Vertex AI for enterprise and Google Cloud customers.[2][6] It launched in public preview in May 2024 and reached general availability later in the year, with the stable build addressed by the version aliases gemini-1.5-flash and gemini-1.5-flash-002.[6]
The 1.5 generation has since been retired. Google listed September 24, 2025, as the discontinuation date for gemini-1.5-flash-002, the model that the gemini-1.5-flash alias pointed to, and the broader Gemini 1.0 and 1.5 families were subsequently shut down on the Gemini API.[8]
Gemini 1.5 Flash was directly succeeded by Gemini 2.0 Flash, which Google announced on December 11, 2024. Google called 1.5 Flash "our most popular model yet for developers" and said 2.0 Flash built on it with stronger performance at similarly fast response times, even outperforming 1.5 Pro on key benchmarks at twice the speed.[9] Gemini 2.0 Flash became the default model on January 30, 2025, with 1.5 Flash remaining available for a transition period before its later retirement.[9] Subsequent generations, including the Gemini 2.5 and 3.x Flash models, continued the line.