Aya (Expanse / Vision)
Last reviewed
Jun 8, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 1,733 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 1,733 words
Add missing citations, update stale details, or suggest a clearer explanation.
Aya is an open multilingual model family and global research initiative from Cohere Labs (formerly Cohere For AI, or C4AI), the nonprofit research arm of the Canadian artificial intelligence company Cohere. Launched to close the language gap in large language models, Aya aims to bring high-quality multilingual AI to many of the world's languages, including under-served and low-resource ones. The family began with the Aya 101 model and the Aya Collection dataset in early 2024, then progressed through Aya 23 (2024), Aya Expanse (2024), and the multimodal Aya Vision (2025), with the compact Tiny Aya series following in 2026.[1][2] Aya models are released as open-weight research artifacts under a non-commercial license, and the project is widely regarded as one of the most significant open efforts in multilingual AI.[3][4]
The central goal of Aya is language inclusion. Most leading language models are trained predominantly on English and a handful of other high-resource languages, leaving thousands of languages poorly served. Aya addresses this gap on two fronts: by building open instruction-tuning datasets that span far more languages than prior collections, and by releasing instruction-tuned models that perform well across those languages while remaining openly available to researchers.[1][3]
The initiative is notable for its scale of collaboration. The original Aya project was a global open-science effort that, according to Cohere, brought together more than 3,000 researchers and collaborators from 119 countries to build training data through human annotation, translation, and curation.[1][5] Later Aya releases pair this community-built and synthetic data with Cohere's pre-trained Command family of models and a series of post-training techniques developed by Cohere Labs.[2][6]
Cohere Labs, known as Cohere For AI until a 2025 rebrand, is Cohere's nonprofit research lab focused on open and collaborative machine-learning research. The Aya project was its flagship multilingual program. The effort produced not only models but also one of the largest open multilingual instruction datasets at the time of release.[1][3]
The data foundation comprises two linked resources. The Aya Dataset is a collection of human-curated prompt-and-completion pairs contributed by fluent speakers across many languages. The broader Aya Collection aggregates that human data with templated and translated instances from existing sources, totaling roughly 513 million prompts and completions across 114 languages.[1][5] The Aya Dataset and the underlying research were recognized with a Best Paper Award at ACL 2024, and the dataset was highlighted by Stanford HAI among featured releases of 2024.[2]
Aya 101, released in February 2024, was the project's first model. It is a 13-billion-parameter encoder-decoder model built on the mT5-xxl architecture and instruction-tuned to cover 101 languages, more than double the language coverage of comparable open models at the time.[7][8] It was fine-tuned on a mix of data including the xP3x corpus, the Aya Dataset, and the Aya Collection, all filtered to the 101 languages supported by mT5. Cohere reported that Aya 101 outperformed prior massively multilingual open models such as mT0 and BLOOMZ across automatic and human evaluations despite covering roughly twice as many languages.[7] Unlike later releases, Aya 101 was published under the permissive Apache 2.0 license, and the accompanying research paper appeared in February 2024.[7][8]
Aya 23, announced in May 2024, marked a shift in approach: rather than spreading capacity across 101 languages, it concentrated on 23 widely spoken languages to improve depth and quality per language.[9] Released in 8-billion and 35-billion-parameter sizes, Aya 23 paired Cohere's pre-trained Command family of models with the Aya Collection as instruction data. Its 23 languages include Arabic, Chinese (simplified and traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.[9] Aya 23 was released as open weights under a CC-BY-NC license with an acceptable-use addendum.[9]
Aya Expanse, released on 24 October 2024 in 8-billion and 32-billion-parameter sizes, was positioned as a state-of-the-art open multilingual family covering the same 23 languages as Aya 23.[6][10] The 32B model offers a 128K-token context length and is built on the Command foundation, enhanced through a year of Cohere Labs research in multilingual post-training: data arbitrage, multilingual preference training, safety tuning, and model merging.[10][11] Cohere reported that Aya Expanse delivered leading multilingual performance for its size class (see Benchmarks below).[6][11]
Aya Vision, released on 3 March 2025, extended the family into multimodal territory as Cohere's first vision-language model, adding image understanding to Aya's multilingual text capabilities across 23 languages.[12][13] It comes in 8-billion and 32-billion-parameter variants: the 8B model is initialized from Cohere's Command R7B, while the 32B model is initialized from Aya Expanse 32B.[13][14] Aya Vision supports tasks such as image captioning, visual question answering, and translation involving images, and it was trained using synthetic data generation and cross-modal model merging.[12][14] Like the other recent releases, it is available as open weights under a non-commercial CC-BY-NC license, with access also offered through Hugging Face, Kaggle, and WhatsApp.[13]
In 2026, Cohere Labs introduced Tiny Aya, a compact multilingual family built on a 3.35-billion-parameter base model designed to run locally on consumer devices while covering more than 70 languages. It is offered in several regionally optimized variants.[1]
The table below summarizes the main Aya releases. Parameter sizes, language counts, and release dates are drawn from Cohere's announcements and Hugging Face model cards.[1][6][7][9][13]
| Model | Sizes | Languages | Modality | Base / architecture | Released | License |
|---|---|---|---|---|---|---|
| Aya 101 | 13B | 101 | Text | mT5-xxl | Feb 2024 | Apache 2.0 |
| Aya 23 | 8B, 35B | 23 | Text | Command family + Aya Collection | May 2024 | CC-BY-NC |
| Aya Expanse | 8B, 32B | 23 | Text | Command family | Oct 2024 | CC-BY-NC |
| Aya Vision | 8B, 32B | 23 | Text + image | Command R7B (8B); Aya Expanse 32B (32B) | Mar 2025 | CC-BY-NC |
| Tiny Aya | 3.35B base | 70+ | Text | Compact multilingual | 2026 | Open |
Aya's results rest on a combination of data work and post-training methods.
Open and synthetic data. The Aya Dataset and Aya Collection were assembled through a large open-science collaboration, gathering human annotations and curated instances across more than 100 languages. Later models supplemented this with synthetic data generation, including translated and templated instruction data and, for Aya Vision, synthetically generated multimodal data.[1][5][14]
Data arbitrage. Cohere Labs describes "data arbitrage" as a strategy for sampling high-quality training data across languages by drawing on the strongest available teacher signals for each language rather than relying on a single multilingual teacher, helping low-resource languages benefit from better data sources.[6][11]
Multilingual preference training and safety tuning. Aya Expanse used preference-based optimization to improve both general performance and safety across languages, extending alignment techniques beyond English.[6][11]
Model merging. The family makes extensive use of model merging, combining the parameters of separately trained checkpoints into a single model. Cohere reported that merging contributed measurable gains in general performance and safety in its Aya research, and Aya Vision applied cross-modal merging to fuse multilingual text and vision capabilities.[11][14]
Cohere reported strong results for Aya Expanse on multilingual evaluations, with the important caveat that these figures come from the developer and use model-based judging. On m-ArenaHard, a multilingual benchmark created by translating the ArenaHard prompts into all 23 supported languages and scored with GPT-4o as judge, Cohere reported that Aya Expanse 8B outperformed leading models in its parameter class, including Gemma 2 9B, Llama 3.1 8B, and Ministral 8B, with win rates ranging from 60.4% to 70.6%.[6][10] Cohere also reported that Aya Expanse 32B achieved roughly 25% higher average accuracy on low-resource language benchmarks compared with peers such as Gemma 2 27B, Mixtral 8x22B, and Llama 3.1 70B.[6][11]
For the multimodal release, Cohere reported that on the AyaVisionBench and m-WildVision benchmarks, Aya Vision 8B reached win rates of up to 79% and outperformed the much larger Llama 3.2 90B Vision, while Aya Vision 32B reached win rates up to around 72% and outperformed larger models including Qwen2.5-VL 72B, Llama 3.2 90B Vision, and Molmo 72B.[12][13] As with all benchmark claims, these are first-party results and depend on the chosen evaluation sets and judging methodology.
The broader significance of Aya lies less in any single leaderboard number than in its contribution to open multilingual AI. By releasing both large open datasets spanning 100-plus languages and capable instruction-tuned models for under-served languages, the project lowered the barrier to multilingual research and set a reference point for inclusive, community-driven model development.[3][4][5]
Aya model weights are distributed openly through Hugging Face under the CohereLabs organization (Aya 101 under Apache 2.0; Aya 23, Aya Expanse, and Aya Vision under CC-BY-NC with an acceptable-use addendum) and are also available on Kaggle.[7][9][13] Aya Expanse and Aya Vision can additionally be accessed through Cohere's platform and demos, with Aya Vision usable via WhatsApp. Because the post-Aya-101 models carry non-commercial licenses, they are intended for research and non-commercial use rather than production deployment.[6][13] The Aya Collection and Aya Dataset are released openly for use in training and evaluating other multilingual systems.[1][5]