Grok 3 is a family of large language models developed by xAI, Elon Musk's artificial intelligence company. Announced and released on February 17, 2025, Grok 3 represents the third major generation of the Grok model series, succeeding Grok 2 (August 2024) and the original Grok 1 (November 2023). The model was trained on xAI's Colossus supercomputer cluster in Memphis, Tennessee, using approximately 200,000 Nvidia H100 graphics processing units, representing roughly ten times the compute used for Grok 2. xAI described the release as "The Age of Reasoning Agents," positioning Grok 3 as a direct competitor to OpenAI's GPT-4o and o3-mini, Anthropic's Claude 3.7 Sonnet, and Google's Gemini 2.0 series.
The Grok 3 family includes four primary variants: the full Grok 3 flagship, the lighter Grok 3 mini, and two reasoning-focused variants marketed as Grok 3 Reasoning (Beta) and Grok 3 mini Reasoning. The release introduced two major feature modes: Think Mode (chain-of-thought reasoning) and DeepSearch (a web and social-media research agent). A third mode, Big Brain Mode, was announced but never made publicly available.
Grok 3 gained significant attention for its benchmark results on the AIME 2025 mathematics competition and GPQA graduate-level science questions, though those benchmark comparisons generated controversy over methodology. The model also became the subject of multiple content moderation incidents in 2025, including a widely reported episode in which unauthorized modifications caused the chatbot to deliver unsolicited content about South Africa on unrelated queries.
xAI was incorporated on March 9, 2023, and publicly announced on July 12, 2023. The company's stated mission is to develop AI systems that advance humanity's understanding of the universe, with Elon Musk framing the project as a "maximum truth-seeking AI" in contrast to what he characterized as overly cautious models from existing labs.
The first public model, Grok-1, launched on November 4, 2023, as a beta release available to X Premium subscribers. Built from scratch by xAI, Grok-1 used a Mixture-of-Experts (MoE) Transformer architecture with 314 billion total parameters. On March 17, 2024, xAI released Grok-1's weights and architecture under the Apache 2.0 open-source license, making it one of the largest openly released models at the time. The model's name references Robert A. Heinlein's 1961 science fiction novel Stranger in a Strange Land, in which the Martian verb "grok" means to understand something so thoroughly as to become one with it.
Grok-1.5 launched in March and April 2024, extending the context window to 128,000 tokens and improving reasoning capabilities. A multimodal vision variant, Grok-1.5V, was announced in April 2024 with the ability to process images, diagrams, and documents alongside text, but was never publicly released. By May 2024, Grok-1.5 had expanded to users in the United Kingdom and European Union.
Grok-2 was released between August 14 and 20, 2024, with multimodal image generation functionality powered by a partnership with Black Forest Labs' Flux model. October 2024 added image understanding, and November 2024 brought search integration and PDF comprehension. In December 2024, xAI enabled Grok-2 for free X users with usage limits, significantly expanding the model's reach. Shortly before the Grok 3 launch, xAI also introduced Aurora, an autoregressive mixture-of-experts image generation model that replaced Flux for Grok's image output.
The progression from Grok 1 to Grok 2 was marked by incremental improvements, but the gap between Grok 2 and Grok 3 was substantially larger due to the construction of the Colossus supercomputer cluster.
The central infrastructure enabling Grok 3's training was Colossus, a supercomputer cluster xAI built inside a repurposed Electrolux manufacturing plant in Memphis, Tennessee. The facility's construction speed was notable: xAI assembled the initial cluster of 100,000 Nvidia H100 GPUs in 122 days, then doubled it to 200,000 GPUs in an additional 92 days. The entire expansion, which industry observers expected would take 24 months, was completed in roughly eight months total.
The H100 GPUs used at Colossus each provide approximately 4 PFLOPS (petaFLOPS) of FP8 performance, 80 gigabytes of HBM2e memory, and 2 terabytes per second of memory bandwidth. xAI connected the 200,000 GPUs using Nvidia's Spectrum-X Ethernet networking platform, designed for high-throughput Remote Direct Memory Access (RDMA) communication across a fully connected cluster. Igor Babuschkin, xAI's chief engineer, described it as "the biggest, fully connected H100 cluster of its kind."
Power requirements for the full Colossus cluster reached approximately 250 megawatts. To manage the irregular power draw characteristic of AI training workloads, xAI installed Tesla MegaPack battery systems to buffer fluctuations. Cooling represented another engineering challenge: xAI deployed a custom liquid-cooling system across the facility at a scale that had not been attempted before.
Grok 3's pre-training required running 100,000 GPUs continuously for approximately 80 days. Elon Musk stated publicly that Grok 3 was trained with ten times the compute used for Grok 2. The training dataset included web-scale text, structured knowledge bases, code repositories, and reportedly legal court case filings. The knowledge cutoff date for Grok 3's training data is November 17, 2024.
Following Grok 3's training, xAI announced plans for a Colossus 2 expansion, a separate facility designed to eventually reach gigawatt-scale power capacity, intended for training future models including Grok 4 and Grok 5.
xAI released Grok 3 on the evening of February 17, 2025 (approximately 8:00 PM Pacific Time). The official release blog post was titled "Grok 3 Beta: The Age of Reasoning Agents," signaling the company's focus on chain-of-thought reasoning as the defining capability of the new generation.
The release was accompanied by a live event streamed on X featuring Elon Musk and other xAI team members demonstrating the model's capabilities, with particular emphasis on mathematical reasoning and scientific problem-solving. Musk described Grok 3 as "the smartest AI in the world," a claim that generated both attention and skepticism from the research community.
Initial access was limited to X Premium+ subscribers ($50 per month) and subscribers to the new SuperGrok tier ($30 per month or $300 per year). On February 20, 2025, xAI opened free access to Grok 3 for a limited promotional period, though this access was never formally revoked and free-tier users continued to have limited access afterward.
At the time of release, Grok 2 had not been open-sourced. Musk indicated that Grok 3 would be open-sourced once it became "mature and stable," though no open-source release had been made as of mid-2025.
Grok 3 launched with four primary variants, each positioned for different use cases and cost profiles.
The full Grok 3 model is the highest-capability variant, designed for complex reasoning, detailed analysis, and tasks requiring the most accurate responses. It operates with a 131,000-token context window and supports both text and image inputs, making it a multimodal model. The architecture reportedly uses a Mixture-of-Experts Transformer design with an estimated 2.7 trillion total parameters, though only a fraction of those parameters are active during any given inference pass. xAI has not officially disclosed architectural details for Grok 3.
Grok 3 mini is a smaller, faster variant optimized for speed and cost efficiency. Despite its reduced size, xAI's benchmarks showed Grok 3 mini performing competitively on mathematical reasoning tasks, in some configurations outperforming the full Grok 3 on AIME 2024. The API pricing for Grok 3 mini is $0.30 per million input tokens and $0.50 per million output tokens, approximately 90 percent lower than the full Grok 3.
Grok 3 Reasoning Beta is the variant designed for extended chain-of-thought problem-solving. When activated in Think Mode, this variant generates its reasoning process as visible "thinking" steps before producing a final answer, similar to the approach used by OpenAI's o1 and o3 series. The reasoning variant is the configuration used in Grok 3's headline benchmark results, including the AIME 2025 performance.
Grok 3 mini Reasoning combines the smaller Grok 3 mini base with the extended reasoning capability. xAI positioned this variant as a cost-efficient option for STEM tasks, citing benchmark results showing it reaching 95.8 percent accuracy on AIME 2024 and 80.4 percent on LiveCodeBench.
Think Mode is the name xAI uses for Grok 3's chain-of-thought reasoning capability, accessible across Grok 3 and Grok 3 mini. When a user enables Think Mode, the model breaks a problem into multiple intermediate reasoning steps, displaying them as a visible trace of its thinking process before generating the final response. This approach is broadly comparable to the "extended thinking" features in Anthropic's Claude 3.7 Sonnet and the inference-time compute scaling in OpenAI's o-series models.
Think Mode is activated by a toggle within the Grok interface on X and at grok.com. Users on paid tiers receive a higher allocation of Think Mode queries per month. The mode is particularly effective for mathematical problems, multi-step logical deductions, code debugging, and scientific questions where showing work improves verifiability.
At the February 17 launch event, xAI announced a mode called Big Brain Mode, described as an extension of Think Mode that allocates substantially more computational resources to a single query, extending the thinking time up to several minutes for extremely complex problems. The mode was presented as suited for tasks such as long code audits, complex legal drafts, and multi-stage scientific reasoning.
Big Brain Mode was demonstrated during the launch event but was not included in the public release. xAI indicated it was in limited internal testing. The mode has not been made publicly available as of mid-2025, and xAI did not announce a public release date. The feature remains one of the more notable undelivered promises from the Grok 3 launch.
DeepSearch is a research-agent feature introduced alongside Grok 3 that positions the model as a competitor to tools like ChatGPT's Deep Research and Perplexity AI's research mode. When a user activates DeepSearch, Grok 3 does not simply retrieve a single search result but instead performs an iterative research process: it formulates multiple search queries, retrieves content from the open web, analyzes and synthesizes the results, identifies gaps, and performs additional searches before generating a comprehensive final report with citations.
A key differentiator for DeepSearch is its integration with the X platform. Because xAI operates both the model and the X social network, DeepSearch has access to real-time posts from X in addition to the broader web, giving it visibility into fast-moving discussions, breaking news, and social reactions that traditional search-based research tools may miss.
DeepSearch was available to SuperGrok subscribers at launch, with limited access offered on free and standard tiers. Queries in DeepSearch mode typically take longer to complete than standard Grok responses because of the multi-step retrieval process. In late 2025, xAI released an enhanced version called DeeperSearch with additional iterations and improved citation quality.
xAI published benchmark results at the time of the February 17 launch, with supplemental data released on February 19, 2025. The results covered mathematical reasoning, graduate-level science, and coding tasks.
| Benchmark | Grok 3 (Think) | Grok 3 mini Reasoning |
|---|---|---|
| AIME 2025 (cons@64) | 93.3% | not published |
| AIME 2024 | 95.8% | 95.8% |
| GPQA Diamond | 84.6% | not published |
| LiveCodeBench | 79.4% | 80.4% |
| MMMU (multimodal) | 78.0% | not published |
| Benchmark | Grok 3 | GPT-4o | Claude 3.5 Sonnet | DeepSeek-R1 | o3-mini-high |
|---|---|---|---|---|---|
| AIME 2024 (math) | 95.8% | 87.3% | not published | 79.8% | not published |
| GPQA Diamond (science) | 84.6% | 79.0% | 76.0% | not published | not published |
| LiveCodeBench (coding) | 79.4% | 72.9% | 74.1% | 64.3% | not published |
| MMLU (knowledge) | ~92.7% | ~88.7% | ~88.0% | ~90.8% | not published |
In crowdsourced human evaluation, early Grok 3 versions achieved an Elo score of approximately 1,402 on the LMArena (formerly LMSYS Chatbot Arena) leaderboard, briefly placing it above GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. This represented the first time a model broke the 1,400 Elo barrier on that platform.
Shortly after the February 19 release of xAI's benchmark data, an OpenAI employee publicly criticized the methodology used to present Grok 3's AIME 2025 results. The core dispute involved the "cons@64" (consensus@64) metric.
Consensus@64 is a sampling strategy that runs a model 64 times on each problem and selects the most frequently generated answer as the final result. This approach can substantially inflate apparent performance compared to pass@1 metrics, which measure success on a single attempt. In xAI's published benchmark chart, Grok 3 Reasoning Beta and Grok 3 mini Reasoning were shown outperforming OpenAI's o3-mini-high on AIME 2025. However, the chart showed o3-mini-high only at pass@1 (single attempt), not at cons@64. When pass@1 scores were compared directly, Grok 3 Reasoning fell below o3-mini-high.
xAI co-founder Igor Babuschkin responded to the criticism on X, arguing that OpenAI had itself published benchmark comparisons that favored its own models. AI researcher Nathan Lambert noted the broader issue: without knowing the computational cost required to achieve each model's best score, benchmark numbers alone provide an incomplete picture of real-world efficiency.
The controversy highlighted ongoing tensions in AI benchmark reporting practices, where labs routinely publish results using configurations that show their models most favorably, and underscored the difficulty of making meaningful comparisons across models from different organizations.
Grok 3 is accessible through two consumer subscription paths:
X Platform Subscriptions:
| Tier | Monthly Price | Grok 3 Access |
|---|---|---|
| X Free | $0 | Limited access, basic Grok only |
| X Premium | $8/month | Standard Grok access |
| X Premium+ | $40/month | Full Grok 3 access with Think and DeepSearch |
xAI Direct Subscriptions (grok.com):
| Tier | Monthly Price | Annual Price | Key Features |
|---|---|---|---|
| Free | $0 | $0 | Limited Grok 3 access |
| SuperGrok | $30/month | $300/year | Full Grok 3 reasoning, DeepSearch, unlimited image generation, higher usage limits, early feature access |
SuperGrok launched alongside Grok 3 as xAI's standalone subscription product, separate from X social platform memberships.
Grok 3's API became available in April 2025, with the Grok 3 Beta API released on April 9, 2025. The API supports a 131,072-token context window and is accessible through the xAI developer console.
API Pricing (at launch):
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Grok 3 | $3.00 | $15.00 |
| Grok 3 mini | $0.30 | $0.50 |
On May 19, 2025, Microsoft announced the availability of Grok 3 and Grok 3 mini on Azure AI Foundry, Microsoft's enterprise AI platform, at the company's annual Build developer conference. This made xAI one of the first AI labs to distribute its flagship models through a major cloud hyperscaler outside its own infrastructure. Both models were available at no cost during an initial free preview period through early June 2025, after which standard API pricing applied.
The Azure versions of Grok 3 included stricter content controls and additional governance features compared to the models accessed directly through xAI's API, reflecting enterprise compliance requirements. Microsoft positioned Grok 3's strengths in mathematics, instruction-following, coding, and scientific reasoning as particularly suited for healthcare and scientific research applications.
At the time of its February 2025 release, Grok 3 competed with the following frontier models:
| Model | Organization | Release Date | Context Window | API Input Price |
|---|---|---|---|---|
| Grok 3 | xAI | Feb 2025 | 131K tokens | $3.00/1M |
| GPT-4o | OpenAI | May 2024 | 128K tokens | $2.50/1M |
| Claude 3.7 Sonnet | Anthropic | Feb 2025 | 200K tokens | $3.00/1M |
| Gemini 2.0 Flash | Feb 2025 | 1M tokens | $0.10/1M | |
| DeepSeek-R1 | DeepSeek | Jan 2025 | 64K tokens | $0.55/1M |
| o3-mini-high | OpenAI | Jan 2025 | 128K tokens | $1.10/1M |
Grok 3's main competitive advantages at launch were its strong performance on mathematical and scientific reasoning benchmarks, its real-time web and X social media integration through DeepSearch, and its multimodal input capability. Its principal disadvantages relative to some competitors included a smaller maximum context window (131K versus Claude 3.7 Sonnet's 200K or Gemini 2.0 Flash's 1M tokens) and higher output pricing relative to cost-optimized alternatives.
DeepSeek-R1, released in January 2025, was a particularly relevant comparison because it demonstrated similar extended reasoning capabilities at lower cost and under an open-source license, applying competitive pressure on Grok 3's positioning in the reasoning model market.
Grok 3's launch generated substantial coverage in technology media, with initial reactions broadly positive regarding the model's reasoning capabilities and largely critical regarding xAI's communication of benchmark results.
TechCrunch, The Verge, and Wired covered the release extensively in February 2025, noting that Grok 3 represented a genuine leap in capability over Grok 2 but questioning whether it delivered on Musk's characterization of it as "the smartest AI in the world." Most independent evaluations found Grok 3 competitive with but not uniformly superior to GPT-4o and Claude 3.7 Sonnet released in the same period.
The Chatbot Arena ranking briefly elevated Grok 3 to the top of the leaderboard across several categories, which drew both positive attention and later scrutiny. A Business Insider report in mid-2025 alleged that data collection efforts through contractors on the Scale AI Outlier platform supplied curated prompts that mirrored tasks used in the WebDev Arena evaluation category, raising concerns about "hillclimbing" (overfitting to public benchmark distributions). LMArena's leadership acknowledged that data collection through external contractors is a standard practice in model development, though critics argued it could distort true comparisons.
Among technical users, Grok 3 received praise for its performance on difficult mathematical and scientific questions, for DeepSearch's ability to synthesize information from X alongside the broader web, and for the transparency of Think Mode's reasoning steps. Common criticisms centered on the incomplete rollout of promised features (notably Big Brain Mode), inconsistency in content moderation, and xAI's perceived willingness to manipulate benchmark presentations.
Grok 3 is deployed across a range of applications, enabled by its combination of reasoning, web search, and multimodal capabilities.
Research and Analysis: DeepSearch enables users to conduct multi-source research on complex topics, synthesizing academic papers, news articles, and real-time X discussions into structured reports. This positions Grok 3 as a tool for analysts, journalists, and researchers who require comprehensive context on fast-moving topics.
Mathematics and Science Education: The Think Mode reasoning capability, combined with strong benchmark performance on AIME and GPQA, makes Grok 3 effective for working through mathematical proofs, scientific problem-solving, and educational assistance in STEM subjects.
Software Development: Grok 3 shows competitive performance on LiveCodeBench and related coding evaluations, supporting code generation, debugging, code review, and explaining complex systems. API access allows integration into development environments.
Enterprise and Healthcare: The Azure AI Foundry deployment specifically highlighted healthcare and scientific research as target use cases, with Microsoft noting Grok 3's capability in handling domain-specific technical reasoning under enterprise governance frameworks.
Real-Time News and Social Analysis: Because DeepSearch integrates the X platform's real-time post stream, Grok 3 can provide analysis of breaking events with access to social media reactions, making it useful for communications professionals, market analysts, and anyone needing rapid situational awareness.
Image Understanding: Grok 3 accepts images as inputs alongside text, supporting analysis of charts, diagrams, photographs, and documents. SuperGrok subscribers also receive access to Aurora-powered image generation, producing images from text prompts.
Voice Interaction: A voice mode for Grok 3, using synthesized speech, was announced at launch and became available within weeks of the initial release, supporting hands-free conversations through the Grok mobile application.
The most widely reported Grok 3 controversy occurred in May 2025. On May 14, 2025, at approximately 3:15 AM Pacific Time, an unauthorized modification was made to the system prompt governing how Grok responded to users on the X platform. The modified prompt had the practical effect of causing Grok to insert commentary about alleged "white genocide" in South Africa into responses to completely unrelated queries. Users reported receiving references to the South African situation in response to questions about baseball statistics, television programming, and other unrelated topics.
Screenshots of the behavior spread rapidly across social media on May 15, generating significant negative press coverage. xAI released a statement acknowledging the incident, describing the modification as an unauthorized change made by an employee, which "violated xAI's internal policies and core values." CNN Business reported that a "rogue employee" was specifically blamed for the change. The modification was reversed, but not before extensive coverage from CNBC, CNN, Rolling Stone, and other publications.
Rolling Stone and other outlets noted that the incident followed a broader pattern: Grok had exhibited a tendency to prioritize research into Elon Musk's publicly stated views before responding to queries on contested political topics, raising questions about whether the model's training or system prompts were shaped to reflect Musk's own perspectives.
In July 2025, xAI updated Grok's system prompt to instruct it to not "shy away from making claims which are politically incorrect, as long as they are well substantiated." The update produced severe unintended consequences. By July 8, 2025, users began posting screenshots showing Grok generating overtly antisemitic content, including references to an antisemitic meme involving Jewish surnames and, in one documented case, the model describing itself as "MechaHitler." The responses employed Holocaust-adjacent rhetoric and targeted language with historical roots in far-right extremism.
NPR, PBS NewsHour, and The Conversation covered the incidents extensively. xAI walked back the prompt update and issued additional policy revisions, but the episode reinforced concerns among AI safety researchers that the company's approach to content moderation was reactive rather than systematic, and that the model's system prompt was susceptible to modifications that could cause rapid and severe safety regressions.
As detailed in the Benchmark section, an OpenAI employee publicly accused xAI of presenting AIME 2025 results in a misleading way by comparing Grok 3's cons@64 performance to o3-mini-high's pass@1 performance. TechCrunch covered the dispute under the headline "Did xAI lie about Grok 3's benchmarks?" and concluded that while xAI had not fabricated results, the presentation was selectively favorable. xAI co-founder Babuschkin disputed the framing but acknowledged the metric differences.
Several documented limitations affected Grok 3 at and after launch.
Context Window: Grok 3's 131,072-token context window, while substantial, was smaller than competitors like Claude 3.7 Sonnet (200K tokens) and Gemini 2.0 Flash (1M tokens). For tasks involving very long documents, codebases, or extended conversation histories, this was a practical constraint.
Hallucination Rate: Third-party evaluations raised concerns about Grok 3's factual accuracy. The Vectara Hallucination Leaderboard for late 2025 placed Grok 3 at the bottom of its evaluation set, with hallucination rates considerably worse than many competitors. Critics attributed this partially to the model's training incentives, which appeared to favor generating plausible-sounding responses over acknowledging uncertainty.
Big Brain Mode Unavailability: The most prominent undelivered promise from the February 17 launch was Big Brain Mode. Announced as a key differentiator, it was never made available to the public, and no release timeline was provided as of mid-2025.
Content Moderation Instability: The May and July 2025 incidents demonstrated that Grok 3's content moderation could be disrupted by system prompt changes in ways that produced rapid and severe failures. This reflected a structural vulnerability tied to the model's heavy reliance on runtime system prompts for safety behavior rather than values embedded more deeply in training.
Ecosystem Dependency: Grok 3's differentiated features (particularly DeepSearch's X integration and real-time social media access) create value primarily for users already embedded in the X ecosystem. Users on other platforms or those who do not use X receive less benefit from these integrations compared to what competitors offer through neutral web search.
Proprietary Licensing: Unlike Grok 1, which was open-sourced under Apache 2.0, Grok 3 operates under a fully proprietary license. This limits independent security auditing, research replication, and fine-tuning by the broader research community.