Cohere Command A
Last reviewed
May 17, 2026
Sources
25 citations
Review status
Source-backed
Revision
v2 ยท 6,133 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
25 citations
Review status
Source-backed
Revision
v2 ยท 6,133 words
Add missing citations, update stale details, or suggest a clearer explanation.
Cohere Command A is a 111 billion parameter dense large language model released by Cohere on March 13, 2025. It is the successor to the Command R and Command R+ series and was Cohere's flagship general-purpose model at the time of its launch. The model supports a 256,000 token context window, runs on as few as two H100 or A100 GPUs in 16-bit precision, and is optimized for enterprise workloads such as Retrieval-Augmented Generation, multi-step tool use, and multilingual agents covering 23 languages. Its open weights were published on Hugging Face under the Creative Commons Attribution-NonCommercial 4.0 license, while commercial access is provided through Cohere's hosted API and through cloud partners including Amazon Bedrock, Vertex AI, Microsoft Azure AI Foundry, and Oracle Cloud Infrastructure.
Cohere positioned Command A as a deliberate counterweight to larger frontier models like GPT-4o, Claude 3.7 Sonnet, and DeepSeek V3. Rather than chasing higher parameter counts or breaking new ground on academic chat benchmarks, the company emphasized hardware efficiency and the practical demands of business deployment. According to Cohere's own measurements, Command A delivers roughly 1.75 times the token streaming throughput of GPT-4o and 2.4 times that of DeepSeek V3 on long-context requests, while matching or exceeding both on tool-use and agentic benchmarks. The accompanying 91 page technical report, posted to arXiv as paper 2504.00698, was credited to 228 contributors at Cohere and Cohere Labs and described a decentralized training pipeline that combined supervised fine-tuning, preference optimization, and model merging across capability-specific expert checkpoints.
The Command A platform expanded over the following months into a family of specialized variants. Command A Vision, released July 31, 2025, added image input through a SigLIP2 encoder grafted onto the same text tower. Command A Reasoning, announced August 21, 2025, introduced a configurable thinking-token budget for deeper inference-time deliberation. Command A Translate, released August 28, 2025, was tuned specifically for high-quality machine translation across the same 23-language set. Each variant retained the two-GPU deployment footprint and the open-weights distribution model that defined the original release.
Cohere is a Canadian AI company headquartered in Toronto with additional offices in San Francisco, New York, London, Paris, Seoul, and Montreal. It was founded in 2019 by Aidan Gomez, Ivan Zhang, and Nick Frosst. Gomez was one of the eight co-authors of the 2017 paper "Attention Is All You Need," the work that introduced the Transformer architecture, and Frosst was a researcher at Google Brain. From the start the company concentrated on enterprise customers rather than consumer chatbots, providing models for search, document processing, customer support, and workflow automation.
The Command product line predates Command A by several years. The first generation, simply called Command and Command Light, served as general-purpose text generators. In March 2024 Cohere shifted strategy with the launch of Command R, a 35 billion parameter model purpose-built for Retrieval-Augmented Generation and tool use. Command R was followed a month later by the larger Command R+ at 104 billion parameters. An August 2024 refresh of both models brought throughput improvements and pricing reductions. In December 2024 Cohere added Command R7B at the small end, a 7 billion parameter model that was explicitly described as the final entry in the R series. Command A, announced three months later, was the first model in the next generation. It carries forward several R-series design choices, including the focus on grounded generation with inline citations and on enterprise-grade tool use, while changing the underlying architecture and dramatically increasing the context window.
Cohere has consistently differentiated itself from competitors on deployment flexibility. Customers can call Cohere's hosted API, deploy through one of the major cloud marketplaces, run the models inside a virtual private cloud, or install them on premises. This matters for regulated industries such as banking, healthcare, and government, where data residency and air-gapped deployments are often non-negotiable. The Bell Canada partnership announced in July 2025, which placed Cohere models inside Bell-owned Canadian data centers for use by public-sector clients, illustrates the pattern. By late 2025 Cohere had raised approximately 1.6 billion dollars in total funding, with a valuation reported at around 7 billion dollars after a September 2025 extension round, and counted RBC, Dell, LG, Ensemble Health Partners, Palantir, and Oracle among its disclosed customers. A September 2025 partnership with AMD committed the company to deeper optimization of Command A on AMD Instinct accelerators, broadening the supported hardware beyond Nvidia and Google TPUs.
The company's commercial positioning has shifted alongside the Command A family. Cohere has spent less effort competing for consumer mindshare and more on its North workspace platform, an agentic environment that uses Command A as its core reasoning engine and integrates with enterprise data sources such as email, document repositories, customer relationship management systems, and ticketing tools. Royal Bank of Canada announced in mid-2025 that it would co-develop a private deployment called North for Banking, optimized for wealth advisor and customer-support workloads. By the time Command A Reasoning shipped in August 2025, Cohere was describing North and Command A together as a vertically integrated stack rather than a single model release.
Command A is a dense decoder-only Transformer with 111 billion parameters and a 256,000 token context window. The architecture builds directly on the hybrid attention design that Cohere first introduced in the smaller Command R7B and scales it up. Three out of every four transformer blocks use sliding window self-attention with a 4,096 token window, while the fourth block uses full global attention. Sliding window blocks rely on Rotary Position Embedding (RoPE) for positional information, whereas the global blocks omit explicit positional encoding so that distant tokens can interact with one another without being filtered through a particular position bias.
The practical effect of this design is that the bulk of the model's compute stays cheap. Sliding window attention has linear complexity in the sequence length, which is what makes a 256K context tractable on modest hardware. The interleaved global attention blocks retain the model's ability to track relationships across the entire context, which matters for long-document RAG, repository-scale code understanding, and multi-step agent traces that revisit earlier observations. Cohere has described the global attention layers as the mechanism by which the model effectively notices a needle in a haystack while paying for it only once every four layers.
The model uses BFloat16 tensors in its native release and a proprietary chat template that delineates system, user, assistant, and tool turns with special tokens. Tokenization continues to use Cohere's own multilingual tokenizer, which the company has measured as producing fewer tokens than OpenAI's tokenizer for non-English text. For Japanese the difference is around 1.67 times fewer tokens; for many other non-English languages the gap is smaller but still meaningful. Lower token counts translate directly into lower API costs and longer effective context for non-English users.
A practical consequence of the dense rather than mixture-of-experts choice is that activation patterns are uniform across tokens. Mixture-of-experts models like DeepSeek V3 achieve large total parameter counts but require all experts to be resident in GPU memory, which inflates the hardware footprint even though only a fraction of parameters are activated per token. Cohere argued in the technical report that for a two-GPU serving target the dense layout was the better trade because it amortized communication overhead more predictably and made batched inference easier to schedule. The same reasoning informed the choice to publish only the dense 111B configuration rather than a family of sizes, since each additional configuration would have required its own performance and safety evaluation pass.
The Command A technical report describes a multi-stage training pipeline that the authors call decentralized training, in contrast to the more typical monolithic pretraining followed by a single round of instruction tuning. Pretraining ran on a large multilingual web and code corpus with a knowledge cutoff of June 1, 2024, after which the model was refined through a sequence of supervised fine-tuning, preference optimization, and model merging stages.
The supervised fine-tuning stage produced a set of capability-specific expert checkpoints, each optimized for a domain such as code, math, RAG, tool use, multilingual chat, or safety. Rather than serving these specialists as discrete fine-tunes or routing between them at inference time, Cohere merged the expert checkpoints into a single set of base weights using parameter-space averaging and conflict-resolution techniques. The result is a unified model whose behavior can be steered by the system prompt rather than by a model selector. Cohere's stated rationale was that enterprise workflows tend to combine domains, since a customer-service agent might call a code interpreter and a translation tool in the same session, and that a single set of weights eliminates the latency and infrastructure complexity of routing between specialists.
Preference optimization used two related losses. Self-Revising Preference Optimization, abbreviated SRPO, was introduced by Cohere to reinforce human-preferred style, tone, and formatting while teaching the model to self-refine its own outputs at inference time. SRPO normalizes the log-likelihoods of preferred and dispreferred completions by length, which controls for the well-known tendency of preference optimizers to prefer longer responses. Alongside SRPO, the team used Contrastive Preference Generation, abbreviated CoPG, with two generations per prompt and a mix of offline and online on-policy training. The two methods were interleaved in a ping-pong schedule that the authors said helped avoid the regressions and reward-model hacking patterns that plague single-pass preference training.
Reinforcement learning from human feedback was layered on top to further align outputs with human raters' preferences. Prompts for the RLHF stage were drawn from the same domains used in supervised fine-tuning so that improvements in alignment did not erode capability. Cohere also reported using synthetic preference data generated by earlier Command models, with care taken to avoid mode collapse from training on too narrow a slice of its own outputs.
A distillation pathway feeds back into Cohere's smaller multilingual line. Through a procedure the company calls Fusion, Command A is used as a teacher to synthesize safe and helpful responses that train the Aya Expanse 8B and 32B models. The Aya line, which targets the same 23 languages as the Command family, benefits from Command A's strong crosslingual generalization in safety behaviors that are difficult to teach in lower-resource languages directly. The arrangement is symbiotic: Aya research feeds tokenization, preference data, and evaluation infrastructure back to the Command line, and Command A's safety and refusal patterns are propagated downstream to the smaller open-source releases.
Command A is described by Cohere as a single model that performs well across the workloads its enterprise customers actually run, rather than one optimized for a particular leaderboard. The main capability buckets are summarized below.
| Capability | What it does | Typical use |
|---|---|---|
| Grounded generation with citations | Generates responses with inline citations mapping spans of text to specific source passages | Enterprise RAG pipelines, regulated document workflows |
| Single-step tool use | Selects a single tool from a list and produces JSON arguments matching the schema | Function calling, calculator, database lookups |
| Multi-step tool use | Plans and executes loops of action, observation, and reflection across multiple tools | Agentic AI workflows, customer service automation |
| Multilingual chat | Conversational fluency in 23 supported languages | Cross-border customer support, translation, localization |
| Code generation | Writes, explains, and translates code across languages, with strong SQL performance | Internal developer tools, analytics assistants |
| Long-context reasoning | Handles inputs up to 256K tokens with sliding window plus global attention | Repository-scale code review, contract analysis, multi-document summarization |
The 23 supported languages match the set introduced with Command R7B: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Chinese, Arabic, Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. Cohere noted improved handling of Arabic dialects compared with prior Command models, with internal evaluations placing Command A at 98.2 percent accuracy when responding in Arabic to English prompts, and on the FLoRES and WMT23 translation suites the model was reported as competitive with GPT-4o across the major European and East Asian languages.
Citations are a particular point of pride for the Command line and remain a built-in feature of Command A rather than a separate post-processing step. The model can return responses in two citation modes. "Accurate" mode generates the answer first and then produces citations that map back to the underlying passages, optimizing for precision. "Fast" mode injects citation tags inline as tokens stream, which is more suitable for interactive applications where latency matters. In both modes the citations reference document chunks that Cohere recommends sizing at 100 to 400 words, and the model's behavior was trained with both citation styles so that the same weights can switch between them based on the system prompt.
Multi-step tool use is the foundation of the model's agent capabilities. Given a list of tools defined by name, description, and JSON parameter schema, the model iterates through cycles of choosing actions, reading observations, and reflecting on whether to continue. Tools can be called in parallel when their inputs are independent, and the model is trained to recover from failed calls by retrying with different parameters or selecting an alternative tool. A built-in directly_answer tool lets it skip external calls when its own knowledge is sufficient. This loop is what powers Cohere's North workspace platform, where Command A coordinates calls into Gmail, Slack, Salesforce, and customer systems on behalf of business users.
The API exposes a Safety Modes parameter that lets developers choose between three behavioral profiles without changing the underlying weights. Contextual mode is the default and follows the safety expectations described in the system prompt, allowing the model to handle mature or sensitive topics in contexts where they are appropriate, such as medical question answering or legal analysis. Strict mode applies a stricter refusal pattern suited to consumer-facing or unmoderated environments. None mode disables the safety system prompt entirely and is intended for research and red-teaming workflows where developers want to probe the model's underlying behavior. The same safety pipeline carries through to the Command A Vision, Reasoning, and Translate variants, which inherit Command A's preference-trained refusal patterns rather than retraining their own.
Cohere published a wide array of benchmark and human-evaluation results in the Command A technical report and accompanying blog post. The headline claim is that Command A is "on par or better than GPT-4o and DeepSeek V3 across agentic enterprise tasks, with significantly greater efficiency." The table below collects the figures Cohere reported alongside competitor scores in March 2025. As with any vendor-published numbers, readers should treat them as best-case results rather than independent evaluations.
| Benchmark | Command A | GPT-4o | DeepSeek V3 | Notes |
|---|---|---|---|---|
| MMLU | On par | Reference | Reference | Cohere reported parity with GPT-4o on general knowledge |
| MBPPPlus | On par | Reference | Reference | Code generation parity reported by Cohere |
| SQL generation | Leads | Behind | Behind | Cohere's internal SQL evaluation |
| BFCL v3 | Leads | Behind | Behind | Berkeley Function Calling Leaderboard, agentic tool use |
| Tau-bench | Leads | Behind | Behind | Tau-bench, conversational agent benchmark |
| RepoQA | Dominates | Behind | Behind | Long-context code understanding |
| IFEval | Strong | Reference | Reference | Instruction following |
| MATH | Competitive | Reference | Reference | Mathematical reasoning |
| Long-context throughput | 73 tok/s | 38 tok/s | 32 tok/s | Cohere measurement at 100K context |
In human evaluations Cohere ran across business analysis, coding, and agentic task categories, Command A was reported as winning or tying against both GPT-4o and DeepSeek V3 in the majority of head-to-head comparisons. The company also reported that the model's 256K context window provides additional headroom for processing lengthy documents, extensive conversation histories, and large retrieval sets relative to GPT-4o's 128K window. On the Berkeley Function Calling Leaderboard, Command A's reported lead was particularly notable because BFCL is widely watched as a proxy for real-world function calling quality.
Independent reviewers reached more measured conclusions. The model holds up well on agentic and tool-use evaluations and on long-context tasks, where its hybrid attention design is a structural advantage. On standard chat-style benchmarks such as the LMSYS Chatbot Arena, Command A entered the leaderboard at a respectable but not class-leading position, climbing as high as rank 13 in 2025 but trailing several closed frontier models in raw preference ratings. Artificial Analysis, an independent benchmarking service, reported an Intelligence Index of 13 for Command A on its composite of academic and reasoning benchmarks, which placed the model in the middle of the 2025 flagship pack rather than at the top.
Cohere's public response to the gap between its internal numbers and the LMSYS rankings was to argue that Arena votes are dominated by stylistic preferences and verbose answers, neither of which is what enterprise buyers are paying for. Cohere Labs vice president Sara Hooker described the broader pattern of leaderboard gaming as a "crisis" for the field, and the company co-authored a 2025 paper titled "The Leaderboard Illusion" with researchers from Stanford, Princeton, Waterloo, the University of Washington, MIT, and the Allen Institute. The paper documented private pre-release testing patterns at major frontier labs, including the existence of 27 private Llama 4 variants tested before public release, and argued that the practice systematically advantaged well-resourced labs over open-source providers. Independent evaluations by integrators who tested Command A on internal RAG and agent pipelines tended to report results closer to Cohere's published claims, particularly on long-context tasks and tool-use accuracy.
The most distinctive technical claim Cohere made about Command A is that it can be served on as few as two GPUs. In Cohere's reference deployment, this means two H100 SXM modules or two A100 80GB modules running BFloat16 weights. Cohere has explicitly contrasted this with frontier models that require eight or more GPUs of comparable class to serve at the same precision. The 111 billion parameter count at BF16 takes roughly 222 gigabytes of memory before key-value cache, which fits inside two H100 modules with 80 gigabytes each only because Cohere also engineered the attention design and memory layout to keep activations and KV cache compact even at long context.
The efficiency claim is more than a bullet point. Two-GPU serving roughly halves the on-prem hardware footprint for an enterprise that wants to host a model behind its own firewall, and it makes the model accessible to mid-size companies that cannot justify the capital cost of an eight-GPU node per replica. It also has direct consequences for inference economics. Cohere's published throughput numbers, with 73 tokens per second of streaming output on a 100K context request, are achieved on this same two-GPU footprint, which is what allows the model to undercut larger competitors on price while remaining profitable for Cohere to serve.
For self-hosted deployments, the Hugging Face model card ships with a 128K context configuration by default but documents that the supported context length is 256K and that the configuration can be raised. Quantized variants released through Cohere Labs make it possible to run the model on a single 80 gigabyte GPU at reduced precision, although Cohere reserves its accuracy claims for the BF16 configuration. The technical report emphasizes that the hardware-efficiency story is the result of architecture decisions made early in the design rather than after-the-fact compression: the sliding window plus global attention layout, the choice of dense rather than mixture-of-experts, and the training pipeline were all chosen to optimize for the two-GPU deployment target.
Command A is accessible through Cohere's hosted API as command-a-03-2025, with the same Chat, RAG, and Tool Use endpoints used by the rest of the Command family. It is also available through Amazon Bedrock and SageMaker, Microsoft Azure AI Foundry, Google Cloud Vertex AI, and Oracle Cloud Infrastructure Generative AI. Each cloud partner exposes the model through its own SDK and billing path, but the underlying weights are the same.
List pricing on Cohere's own platform at launch was set to match the GPT-4o reference rate, which served the company's positioning as a same-price-better-efficiency alternative. The maximum output length is capped at 8,000 tokens per request, which is larger than several peer flagship models but shorter than the unlimited streaming offered by some open-source releases.
| Tier | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Cohere API list | $2.50 | $10.00 |
| Enterprise volume | Negotiated | Negotiated |
| VPC and on-prem | License fee | License fee |
The per-token API rate is identical to GPT-4o's at the time of Command A's launch and to Command R+ 08-2024's refreshed price. Pricing across cloud partners varies slightly depending on the marketplace, with some providers including their own infrastructure markup or volume discounts. Enterprise customers that prefer a flat license over per-token billing can negotiate VPC or on-premises agreements directly with Cohere, in which case the weights are deployed inside the customer's controlled environment and Cohere never touches the inference traffic. This is one of the structural reasons that Cohere wins business in finance, healthcare, and government segments where token-metered SaaS access is not an acceptable model.
The API supports the same parameters as earlier Command R models: system preambles, tool definitions, RAG documents with chunk metadata, citation mode selection, and Safety Modes for adjusting moderation behavior. Existing R-series customers can usually swap the model name and reuse most of their integration code, although Cohere documents some differences in the chat template and recommends a brief evaluation before migrating production workloads.
Command A is released under a split licensing arrangement that has become Cohere's house style. The model weights themselves are distributed on Hugging Face as CohereLabs/c4ai-command-a-03-2025 under the Creative Commons Attribution-NonCommercial 4.0 (CC-BY-NC 4.0) license, together with the Cohere Labs Acceptable Use Policy. This permits research and non-commercial use freely: academics, students, and hobbyists can download the weights and run the model locally on their own hardware without paying Cohere anything.
Commercial use of the open weights is not granted by the CC-BY-NC license. Organizations that want to use Command A in a product or for any other revenue-generating purpose must either call the model through Cohere's hosted API, license it through one of the cloud partner marketplaces, or negotiate a separate commercial agreement with Cohere directly. In practice this means that the open-weight release functions primarily as a research artifact and as a means for prospective enterprise customers to evaluate the model in detail before committing to a paid contract.
The split between "open weights" and "open source" is also worth noting. Cohere publishes the trained weights and a technical report, but it does not release the pretraining dataset, the full training code, or the model-merging recipes used to produce the final checkpoint. By the criteria of the Open Source Initiative, the release would not qualify as open source. It is closer in spirit to Meta's Llama series than to fully open releases such as the Allen Institute's OLMo models. Cohere has been consistent on this point across the Command R, Command R+, Command R7B, and Command A releases, and the company has framed CC-BY-NC as a middle ground that captures community feedback while preserving the commercial value of the model.
Cohere did not stop at the base Command A model. Over the second half of 2025, the company released three specialized variants that share the underlying weights and tokenizer but add or repurpose capability. All three preserve the two-GPU deployment target and the CC-BY-NC weight release alongside hosted API access.
| Variant | Released | Parameters | Context | Specialty |
|---|---|---|---|---|
| Command A | March 13, 2025 | 111B | 256K | General agentic and RAG flagship |
| Command A Vision | July 31, 2025 | 112B | 128K | Image, chart, and document understanding |
| Command A Reasoning | August 21, 2025 | 111B | 256K | Hybrid reasoning with token budget control |
| Command A Translate | August 28, 2025 | 111B | 16K | High-fidelity machine translation |
Command A Vision pairs the 111 billion parameter Command A text tower with a SigLIP2 vision encoder, taking the total parameter count to roughly 112 billion. The vision variant supports English, Portuguese, Italian, French, German, and Spanish rather than the full 23-language set, reflecting where document-understanding training data was strongest. Cohere reported an average score of 83.1 percent across nine benchmarks, beating GPT-4.1 at 78.6 percent, Llama 4 Maverick at 80.5 percent, and Mistral Medium 3 at 78.3 percent on the same evaluations. The variant leads by 7.3 points over GPT-4.1 on DocVQA and by 6.7 points on OCRBench, two benchmarks that emphasize structured-document and OCR-style reasoning, although it trails on the MMMU academic-knowledge benchmark with a score of 65.3 percent against GPT-4.1's 74.8 percent. The vision tower is deployed on the same two-A100 footprint as the base model.
Command A Reasoning was Cohere's first dedicated reasoning model. It introduces a token_budget parameter that lets developers cap the number of thinking tokens the model generates before its final answer, trading deliberation for latency and cost. When the budget is exceeded, the model halts its chain-of-thought and produces a response immediately. The variant retains the 256K context window and is targeted at customer-service workflows, complex multi-step agent tasks, and decision-support scenarios where a few extra seconds of reasoning materially improve answer quality. Like the base model, it is hybrid: developers can also disable thinking entirely and run it as a fast non-reasoning chat model on the same weights.
Command A Translate is the most specialized of the three. It was tuned for high-fidelity translation across the same 23 languages and ships with a shorter 16K context window appropriate for document-translation workloads rather than long-context reasoning. Cohere claimed at launch that Command A Translate outperforms GPT-5, DeepSeek V3, DeepL Pro's LLM-backed product, and Google Translate across the supported language pairs, citing internal benchmarks scored with the MetricX-24-XL automatic translation metric. The company also submitted a system called CommandA-WMT to the WMT 2025 shared task, augmenting the production translation model with Minimum Bayes Risk decoding and a step-by-step reasoning post-edit pass.
The Hugging Face release of CohereLabs/c4ai-command-a-03-2025 quickly accumulated downstream artifacts. Within weeks of launch, third-party packagers including Unsloth, Bartowski, and LM Studio Community had published GGUF-quantized builds at precisions ranging from 2-bit to 8-bit, allowing the model to run on consumer or single-GPU setups at the cost of accuracy. Inference frameworks such as llama.cpp, vLLM, and TGI added support for the hybrid sliding-window plus global attention pattern in the months after release, which had been a sticking point for early adopters because the unusual attention layout required framework-level changes rather than configuration tweaks.
Fine-tuning interest on the open weights has been moderate rather than large, partly because the CC-BY-NC license forecloses many commercial applications and partly because the 111 billion parameter size makes full-finetune training expensive. Parameter-efficient methods such as LoRA and QLoRA are the more common approach in the community. The Vision variant in particular has been picked up for fine-tuning experiments on document-understanding tasks where its OCR and chart-reading strengths are a useful starting point. Cohere Labs has run periodic community events around the Aya line, which uses Command A as a teacher model through the Fusion distillation procedure, and the Aya releases have provided a venue for academic researchers to engage with the Command stack without needing to license commercial use of the larger model.
The table below places Command A alongside the other flagship general-purpose models that were available during 2025 and into early 2026, drawing on Cohere's published numbers, vendor documentation, and contemporaneous reporting.
| Model | Release | Parameters | Context | Input price (per 1M) | Output price (per 1M) | Open weights | License |
|---|---|---|---|---|---|---|---|
| Cohere Command A | March 2025 | 111B dense | 256K | $2.50 | $10.00 | Yes | CC-BY-NC 4.0 |
| GPT-4o | May 2024 | Undisclosed | 128K | $2.50 | $10.00 | No | Proprietary |
| Claude 3.7 Sonnet | February 2025 | Undisclosed | 200K | $3.00 | $15.00 | No | Proprietary |
| Claude Sonnet 4 | 2025 | Undisclosed | 200K | $3.00 | $15.00 | No | Proprietary |
| DeepSeek V3 | December 2024 | 671B MoE (37B active) | 128K | Varies | Varies | Yes | MIT-style |
| Llama 3.1 405B | July 2024 | 405B dense | 128K | Varies | Varies | Yes | Llama 3.1 Community License |
| Mistral Large 2 | July 2024 | 123B dense | 128K | $2.00 | $6.00 | Yes | Mistral Research License |
| Mistral Large 3 | December 2025 | 675B MoE (41B active) | 256K | $0.50 | $1.50 | Yes | Apache 2.0 |
A few patterns stand out. Command A is priced identically to GPT-4o while offering double the context window and downloadable weights, which is the headline marketing message. Against DeepSeek V3 it concedes raw parameter count and the more permissive license but argues a hardware-efficiency advantage, since DeepSeek V3 in its native configuration requires substantially more GPU memory to serve at full precision than Command A's two-GPU footprint. Against Llama 3.1 405B, the open-weight comparison is more direct: Llama is roughly 3.6 times larger, requires substantially more memory to serve, and lags on the agentic and grounded-generation tasks that Cohere has built its product around, although Llama's broader community ecosystem and more permissive license remain advantages for many users.
Mistral Large 3, released in December 2025 as Mistral AI's first sparse mixture-of-experts flagship, redrew the competitive landscape in several ways. Its 675 billion parameter total count with 41 billion active per token gave it a higher capacity-to-compute ratio than Command A, its Apache 2.0 license made it fully open source, and its public list pricing of $0.50 per million input tokens and $1.50 per million output tokens undercut Command A's API by roughly 80 percent. The trade-off is hardware: serving Mistral Large 3 at full precision requires enough GPU memory for all 675 billion parameters to be resident even though only 41 billion are activated per token, which Mistral itself recommends on multi-GPU deployments larger than the two-GPU footprint Cohere targets. For enterprise on-prem deployments where capital expenditure on GPUs is the binding constraint, Command A retains an efficiency edge; for hosted API workloads where the per-token bill matters more than infrastructure, Mistral Large 3 has the cleaner economics.
Against Anthropic's Claude line, Command A occupies a different niche. Claude 3.7 Sonnet and the Claude Sonnet 4 family from 2025 are priced at a premium to Command A and ship without downloadable weights, but they remain ahead on the academic chat benchmarks that drive Arena rankings and on long-form writing quality. Cohere's marketing has been explicit that Claude is not the comparison the company wants to encourage, since Claude is built for general assistant behavior across consumer and enterprise contexts and Command A is built for tool-using enterprise agents that need grounded citations.
The Gemini 2.5 family from Google released later in 2025 is a different kind of competitor again, with very long context windows and tight integration into Google Cloud, but its weights are not available for download. Direct head-to-head numbers between Command A and the Gemini 2.5 family were not part of Cohere's launch materials, so this article does not assert specific benchmark comparisons.
Coverage of the Command A launch focused heavily on the two-GPU efficiency claim, which functioned as a clear, quantifiable contrast with the larger and more compute-hungry frontier models. VentureBeat's coverage led with the framing that Cohere was targeting global enterprises with a highly multilingual model that requires only two GPUs, and HPCwire used the headline "max performance, minimal compute" directly from Cohere's marketing. Trade press across the financial-services and healthcare verticals picked up on the on-premises deployment story, which fit with existing narratives about regulated industries preferring vendors that offered hosted, VPC, and on-prem options under one product line.
Developer reaction on Hugging Face and Reddit was generally positive but cautious. Practitioners noted that the agentic and citation-heavy use cases were real strengths, particularly the model's tendency to refuse to fabricate citations when its grounding context did not actually support a claim. The 256K context window was widely tested and held up well for long-document tasks, although users sometimes ran into the default 128K configuration in the Hugging Face model card and had to update the configuration to access the full window. The CC-BY-NC license drew the same criticism it had drawn for earlier Command R releases: research-friendly but not usable for production deployment outside Cohere's commercial channels. For purists in the open-source community this was a continuing disappointment; for enterprise buyers comparing it with fully proprietary alternatives it was largely a non-issue.
On the leaderboard side, Command A's debut on the LMSYS Chatbot Arena landed below several closed frontier models, which prompted some commentary that Cohere's vendor-published benchmarks did not fully transfer to user-preference contexts. Cohere acknowledged the gap and reiterated its position that Arena ratings reward stylistic verbosity and that the agentic and BFCL-style evaluations were a better fit for the enterprise audience the model was designed for. The release of the "Leaderboard Illusion" paper later in 2025, which Cohere Labs co-authored, gave that response sharper rhetorical edges and turned the LMSYS gap into a broader argument about the credibility of crowd-sourced evaluation. Independent evaluations by integrators who tested the model on internal RAG and agent pipelines tended to report results closer to Cohere's published claims, particularly on long-context tasks and tool-use accuracy.
Follow-on releases in 2025 extended the Command A line into specialized variants. Command A Vision, released in July 2025, added image input as Cohere's first multimodal Command model. Command A Reasoning, announced in August 2025, was Cohere's first dedicated reasoning model, intended to think before generating final outputs and aimed at customer-service and complex enterprise tasks. Command A Translate, released the same month, was a specialized translation model covering the same 23-language set with a 16K context window optimized for translation workloads. The fact that Cohere chose to extend Command A rather than launch a numbered successor suggests that the base model has held up well in production through the second half of 2025 and into 2026.