Gemini 2.0 Flash-Lite
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,197 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,197 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gemini 2.0 Flash-Lite is a large language model developed by Google DeepMind and released as the most cost-efficient member of the Gemini 2.0 model family. It was built to give developers the price and latency profile of Gemini 1.5 Flash while delivering higher quality across most benchmarks, positioning it as a drop-in upgrade for high-volume, latency-sensitive workloads such as captioning, summarization, classification, and other large-scale text tasks. [1][2]
Google introduced the Gemini 2.0 family on December 11, 2024, beginning with an experimental release of Gemini 2.0 Flash aimed at the "agentic era" of AI applications. [3] The lineup expanded on February 5, 2025, when Google announced three additional pieces: a production-ready update to Gemini 2.0 Flash, an experimental Gemini 2.0 Pro for coding and complex prompts, and a new variant, Gemini 2.0 Flash-Lite, described as the company's most cost-efficient model yet. [1][2] At that point Flash-Lite was offered only as a public preview through Google AI Studio and Vertex AI, under the model identifier gemini-2.0-flash-lite-preview-02-05. [4]
The model reached general availability about three weeks later. On February 25, 2025, Google released the stable gemini-2.0-flash-lite version for production use in the Gemini API through Google AI Studio and for enterprise customers on Vertex AI. [4][5] The preview-to-GA gap is a common pattern in the Gemini program and is worth noting because the variant is sometimes incorrectly described as having shipped to general availability on the February 5 announcement date.
Like the rest of the 2.0 series, Flash-Lite is built on a sparse Mixture-of-Experts (MoE) Transformer architecture, an approach Google carried forward from Gemini 1.5 and refined for better training stability and computational efficiency. [6] It was trained and served on Trillium, Google's sixth-generation Tensor Processing Unit, which the company credits with substantial gains in inference throughput and energy efficiency. [6]
The product rationale was straightforward. Gemini 1.5 Flash had become a workhorse for developers who needed cheap, fast inference, and Google wanted to raise the quality ceiling for that segment without raising the bill. Flash-Lite was therefore tuned to match 1.5 Flash on speed and cost while beating it on quality, leaving the standard 2.0 Flash to occupy a slightly higher tier with broader feature support and stronger benchmark scores. [1][2] Within the 2.0 family, the ordering by capability and price runs from Flash-Lite at the bottom, to Flash in the middle, to the experimental Pro at the top. [1]
Flash-Lite is deliberately a reduced-feature model rather than simply a smaller one. Google's model card notes that it "does not include all of the same features as Gemini 2.0 Flash," and secondary documentation at launch indicated it did not support image or audio output, the Multimodal Live API, search as a tool, or code execution as a tool. [6][7] The trade-off is intentional: by stripping the more expensive capabilities, Google could hold the price floor while still shipping a genuine 2.0-generation model.
Gemini 2.0 Flash-Lite has a context window of 1,048,576 tokens, the same one-million-token window as the standard 2.0 Flash. [6] Its maximum output length is 8,192 tokens. [6][8] According to Google's model card, the model accepts text, image, audio, and video inputs and produces text output, and its knowledge cutoff is June 2024. [6] Some developer-facing materials summarize the input support more narrowly as text and images, reflecting the model's positioning around large-scale text workloads rather than full multimodal parity with 2.0 Flash. [8]
Google framed the long context window as a cost story as much as a capability one. In a launch example, the company said Flash-Lite could generate a relevant one-line caption for roughly 40,000 unique photos for under one dollar on Google AI Studio's paid tier. [1] For prompts above 128,000 tokens, the company highlighted Flash-Lite as an especially economical choice for long-context work. [5]
Google published a benchmark table on the Gemini 2.0 Flash-Lite model card comparing the variant against Gemini 1.5 Flash, Gemini 1.5 Pro, and the GA release of Gemini 2.0 Flash. Flash-Lite performs better than 1.5 Flash on the majority of these benchmarks, though it trails on a few, notably long-context retrieval (MRCR) and Python code generation (LiveCodeBench). The headline figures are summarized below. [6]
| Benchmark | Capability | Gemini 1.5 Flash | Gemini 2.0 Flash-Lite | Gemini 2.0 Flash (GA) |
|---|---|---|---|---|
| MMLU-Pro | General reasoning | 67.3% | 71.6% | 77.6% |
| Bird-SQL (Dev) | Code (text-to-SQL) | 45.6% | 57.4% | 58.7% |
| LiveCodeBench (v5) | Code (Python) | 30.7% | 28.9% | 34.5% |
| GPQA (diamond) | Science reasoning | 51.0% | 51.5% | 60.1% |
| SimpleQA | Factuality | 8.6% | 21.7% | 29.9% |
| FACTS Grounding | Factuality | 82.9% | 83.6% | 84.6% |
| Global MMLU (Lite) | Multilingual | 73.7% | 78.2% | 83.4% |
| MATH | Mathematics | 77.9% | 86.8% | 90.9% |
| MRCR (1M) | Long context | 71.9% | 58.0% | 70.5% |
| MMMU | Image understanding | 62.3% | 68.0% | 71.7% |
The clearest gains over 1.5 Flash show up in factuality (SimpleQA rising from 8.6% to 21.7%), mathematics, and text-to-SQL generation. The regression on MRCR indicates that, despite sharing the one-million-token window, Flash-Lite is weaker than 1.5 Flash at recalling information spread across very long inputs, a relevant caveat for retrieval-heavy applications. [6]
On price, Flash-Lite launched at $0.075 per million input tokens and $0.30 per million output tokens, undercutting the standard 2.0 Flash. [8][9] A notable change from 1.5 Flash was pricing simplicity: the 2.0 models use a single price per input type and drop the earlier distinction between short-context and long-context requests, which Google said made the long-context window roughly a third more affordable for the affected prompts. [2][5]
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Gemini 2.0 Flash-Lite | $0.075 | $0.30 |
| Gemini 2.0 Flash | $0.10 (text/image/video) | $0.40 |
At general availability the model was offered through Google AI Studio for individual developers and prototyping, the Gemini API for production deployments, and Vertex AI for enterprise customers. [4][5] It was a paid, proprietary model with no open weights.
Google later moved its naming and quality focus to the Gemini 2.5 generation, which introduced its own Flash-Lite tier. Gemini 2.0 Flash-Lite, along with the pinned gemini-2.0-flash-lite-001 snapshot, was subsequently deprecated and shut down on June 1, 2026, with Google directing users to migrate to newer models. [8][9]