Artificial Analysis is an independent benchmarking and analytics platform that evaluates artificial intelligence models and API providers across metrics including intelligence, speed, price, and latency. Founded in 2023 by Micah Hill-Smith and George Cameron, the platform has grown into one of the most widely referenced sources for comparing large language models (LLMs), image generation models, video generation models, text-to-speech systems, speech-to-text systems, and embedding models. Artificial Analysis serves both individual developers and enterprise customers, offering free public leaderboards alongside paid enterprise subscriptions. The platform is backed by investors including Nat Friedman, Daniel Gross, and Andrew Ng through AI Grant.
Micah Hill-Smith and George Cameron first met while interning at Google. Both went on to pursue separate careers: Hill-Smith worked as a Business Analyst at McKinsey & Company before leaving to build a legal AI startup, while Cameron worked as a Senior Strategy Consultant at Altman Solon, focusing on technology and data centers.
In 2023, while building applications with AI models, both founders noticed a significant gap in the market. There was no reliable, independent resource for benchmarking LLMs across quality, speed, and price. AI labs frequently reported their own benchmark results using inconsistent methodologies. A notable example that motivated the founders was Google's reporting of Gemini 1.0 Ultra performance on MMLU, where Google used 32-shot chain-of-thought prompts to make the model appear superior to GPT-4, a methodology that differed from how other labs reported their scores.
Hill-Smith and Cameron built Artificial Analysis as a side project in 2023, initially expecting it to require only occasional updates. However, the rapid pace of new model releases and strong developer interest quickly outpaced their expectations. The platform launched publicly in January 2024 and gained significant traction after being featured on the Latent Space podcast. Developers embraced the tool so enthusiastically that both founders decided to pause their other projects and commit to Artificial Analysis full-time.
The pair joined AI Grant's fourth batch, an accelerator program run by Nat Friedman (former CEO of GitHub) and Daniel Gross, and relocated to San Francisco. AI Grant provided a $250,000 investment and access to a network of prominent AI investors and entrepreneurs. Andrew Ng also backed the company. As of late 2025, the team had grown to approximately 20 employees.
Artificial Analysis positions itself as an independent third party in the AI evaluation space. The platform runs its own evaluations rather than relying on self-reported numbers from AI labs. This independence is central to its value proposition: developers and enterprises need trustworthy, standardized data to make informed decisions about which models and providers to use.
The platform addresses several practical questions that developers face when building AI applications: Which model offers the best quality for a given task? Which API provider delivers the fastest response times? How much does it cost to run inference at scale? How do different providers compare when handling concurrent requests?
The LLM Leaderboard is the flagship product of Artificial Analysis, comparing over 100 AI models across intelligence, price, performance, and speed. The leaderboard covers both proprietary models from companies like OpenAI, Anthropic, and Google, as well as open-source and open-weight models.
The leaderboard includes models from all major AI labs and many smaller ones. Proprietary models tracked include the GPT series from OpenAI, Claude models from Anthropic, Gemini models from Google, and others. Open-weight models include the Llama series from Meta, Mistral models, Qwen models from Alibaba, DeepSeek models, and many more. As the AI landscape has expanded, the number of frontier labs tracked on the platform has grown from four or five at launch to over thirteen.
Artificial Analysis measures several performance indicators for each model and provider combination:
| Metric | Description | Unit |
|---|---|---|
| Time to First Token (TTFT) | The time between sending a request and receiving the first token of the response | Seconds |
| Output Speed | The average number of tokens received per second after the first token arrives | Tokens per second (t/s) |
| Total Response Time (100 tokens) | Synthetically calculated based on TTFT and output speed for a 100-token response | Seconds |
| End-to-End Response Time | Complete time from request to final token, including input processing and reasoning | Seconds |
| Context Window | Maximum number of tokens the model can process in a single request (input plus output) | Tokens |
For reasoning models that perform internal deliberation before generating a response, the platform also measures Time to First Answer Token, which excludes the thinking phase to provide a more meaningful latency figure.
Pricing data tracks both input and output token costs separately. To simplify comparisons, the platform calculates a "blended price" that assumes a 3:1 ratio of input to output tokens. This ratio reflects typical usage patterns in many applications. Prices are reported per million tokens.
Different models use different tokenizers, which means the same text can produce different token counts across models. To ensure fair comparisons, Artificial Analysis uses OpenAI tokens as a standard unit of measurement through the tiktoken package (o200k_base tokenizer). The platform also tracks native token counts separately, since providers typically charge based on their own tokenization.
Performance metrics are collected through real API calls to production endpoints. The platform tests every API endpoint eight times per day for single-request benchmarks and twice per day for parallel-request benchmarks. Reported figures represent the median measurement over the prior 14 days, with P5, P25, P75, and P95 percentile values also available.
The leaderboard supports six different workload configurations, varying across two dimensions:
| Dimension | Options |
|---|---|
| Prompt Length | ~100 tokens, ~1,000 tokens, ~10,000 tokens |
| Concurrency | 1 query, 10 parallel queries |
This allows developers to evaluate provider performance under conditions that match their specific use case, whether that involves short conversational exchanges or long document processing with high concurrency.
To prevent providers from giving preferential treatment to known benchmarking accounts, Artificial Analysis registers evaluation accounts without using their own domain. This "mystery shopper" approach ensures that the measured performance reflects what an ordinary customer would experience, rather than a specially optimized environment.
In addition to comparing models, Artificial Analysis maintains a separate Providers Leaderboard that compares over 500 API endpoints across different hosting services. This leaderboard helps developers understand how the same model performs differently depending on where it is hosted.
The platform benchmarks a wide range of API providers:
| Provider Category | Examples |
|---|---|
| First-party APIs | OpenAI, Anthropic, Google |
| Cloud platforms | Microsoft Azure, Amazon Bedrock, Google Cloud |
| Inference specialists | Groq, Fireworks, Together.ai, Cerebras, SambaNova |
| Other providers | DeepInfra, Nebius, Baseten, Databricks, Snowflake, Parasail, Cloudflare, Hyperbolic, FriendliAI, SiliconFlow, Eigen AI, Novita |
For each provider, the platform reports throughput, latency, pricing, and availability data, enabling developers to choose between first-party APIs (which may offer the latest model versions first) and third-party providers (which may offer better price-performance ratios).
The Artificial Analysis Intelligence Index (AAII) provides a composite measure of model quality, synthesizing multiple evaluation benchmarks into a single score on a 0-100 scale. The index has been updated several times since its introduction, with version 4.0.2 released in January 2026.
The Intelligence Index uses a four-category weighted framework, with each category contributing 25% to the overall score:
| Category | Weight | Evaluations Included |
|---|---|---|
| Agents | 25% | GDPval-AA (220 real-world knowledge work tasks), tau-squared-Bench Telecom (114 agent-user simulation tasks) |
| Coding | 25% | Terminal-Bench Hard (44 terminal-based agentic tasks), SciCode (338 scientific computing subproblems in Python) |
| General | 25% | AA-LCR (100 long-context reasoning questions at ~100k tokens each), AA-Omniscience (6,000 knowledge/hallucination questions), IFBench (294 instruction-following questions) |
| Scientific Reasoning | 25% | HLE/Humanity's Last Exam (2,158 frontier academic questions), GPQA Diamond (198 graduate-level science questions), CritPt (70 physics reasoning challenges) |
All models undergo identical testing conditions. The methodology emphasizes standardization, avoidance of bias, zero-shot instruction prompting, and full transparency. Scoring predominantly uses pass@1 metrics, where models must succeed on their first attempt. Multiple runs are aggregated into single pass@1 scores, with estimated 95% confidence intervals of less than plus or minus 1%.
For reasoning models that generate internal "thinking" tokens before producing a visible answer, the platform assumes 2,000 reasoning tokens when the actual count is not available from the provider. This estimate is derived from testing across 60 prompts covering diverse topics including math, coding, and science.
AA-Omniscience is a proprietary benchmark developed by Artificial Analysis to measure factual recall and knowledge calibration. It consists of 6,000 questions derived from authoritative academic and industry sources, spanning 42 economically relevant topics across six domains:
| Domain | Example Topics |
|---|---|
| Business | Finance, management, economics |
| Health | Medical knowledge, clinical reasoning |
| Law | Legal reasoning, regulatory knowledge |
| Software Engineering | Programming languages, system design |
| Humanities and Social Sciences | History, philosophy, social studies |
| Science, Engineering, and Mathematics | Physics, chemistry, biology, mathematics |
The AA-Omniscience Index uses a scoring system bounded between -100 and +100. A correct answer earns +1, an incorrect answer costs -1, and abstaining from answering earns 0. This design rewards models that recognize their own limitations and refuse to answer rather than hallucinate. A score of 0 means the model answers correctly as often as it answers incorrectly.
A key finding from AA-Omniscience is that all but three of the evaluated models are more likely to hallucinate than provide a correct answer when faced with difficult questions. The benchmark has been published as a research paper on arXiv and the dataset is publicly available on Hugging Face.
The Artificial Analysis Openness Index is a standardized, independently assessed measure of how "open" AI models are. The index evaluates models across two dimensions: availability and transparency.
| Dimension | Maximum Score | What It Measures |
|---|---|---|
| Availability | 6 points | API access, open weights for self-hosting, permissive licensing |
| Transparency | 12 points | Disclosure of methodology, training data, and approach documentation |
Each component is scored on a 0-3 qualitative scale based on best-fitting openness archetypes, with data elements averaged between pre-training and post-training phases. All component scores are summed (up to a maximum of 18 raw points) and normalized to a 0-100 scale. The index recognizes that "openness" in AI encompasses more than just the ability to download model weights; it also includes licensing terms, data transparency, and methodological disclosure.
Artificial Analysis operates a Text-to-Image Arena where users compare pairs of images generated from the same prompt without knowing which model produced each image. Models are ranked using an Elo rating system derived from these blind comparisons. Higher Elo scores indicate that a model is preferred more often by human evaluators.
The image generation leaderboard also tracks generation speed and pricing across providers. In addition to text-to-image generation, the platform maintains an image editing leaderboard that evaluates models on their ability to modify existing images based on text instructions.
The Elo scores are informed by tens of thousands of human image preferences. The methodology applies a linear regression model similar to how LMSYS calculates Elo scores for Chatbot Arena. The Image Arena Leaderboard is also published as a Hugging Face Space for broader accessibility.
The video generation section includes both text-to-video and image-to-video leaderboards. Models are ranked through the same blind comparison methodology used for image generation: users compare two videos generated from the same prompt without knowing which model created each video, and an Elo rating is computed from the accumulated votes.
The platform tracks video generation models both with and without audio output, maintaining separate rankings for each category. The leaderboard also highlights open-weight models separately, allowing developers to evaluate self-hostable alternatives to proprietary services.
The Text-to-Speech (TTS) Arena uses a similar blind comparison methodology. Users listen to pairs of speech samples generated from the same text and select which sounds more natural. The resulting Elo ratings rank models across naturalness and quality.
The TTS leaderboard supports filtering by use case (knowledge sharing, assistants, entertainment, customer service) and accent preference (US, UK). It tracks both proprietary and open-weight models, with pricing data included for cost comparisons.
For speech recognition, Artificial Analysis developed its own accuracy metric called AA-WER (Artificial Analysis Word Error Rate). The platform evaluates dozens of speech-to-text models and ranks them by transcription accuracy. The leaderboard distinguishes between proprietary and open-weight models, providing developers with options for both cloud-hosted and self-hosted deployments.
Beyond software models and API providers, Artificial Analysis provides benchmarking of AI accelerator hardware for inference workloads. The platform measures how performance scales with concurrent load across different GPU systems, including NVIDIA H100, H200, and B200 configurations, as well as AMD MI300X and Google TPU v6e (Trillium) chips.
| Configuration | Description |
|---|---|
| Max Throughput | Optimized for the highest sustained request volume |
| Minimum Latency | Tuned to deliver the fastest possible response times |
| Optimal | Balances throughput and latency for general-purpose use |
The cost per million tokens is calculated by combining system output throughput with the average cloud price per GPU per hour. Hardware benchmarks are conducted periodically, at least once per quarter, with full specifications published alongside results.
In December 2025, Artificial Analysis released Stirrup, an open-source lightweight framework for building AI agents. The framework was developed as part of the team's work on evaluating agentic capabilities through benchmarks like GDPval-AA.
Stirrup differs from many existing agent frameworks by letting models drive their own workflow rather than imposing rigid step-by-step processes. The framework provides models with essential tools including code execution environments, web search, web browsing, and bash command execution in a sandboxed environment. The design philosophy draws from analysis of leading agents such as Claude Code and Codex. Stirrup is available on GitHub under the Artificial Analysis organization and includes features like context management and MCP (Model Context Protocol) support.
Artificial Analysis publishes several of its leaderboards as Hugging Face Spaces, making them accessible to the broader AI research community. The LLM Performance Leaderboard was brought to Hugging Face in May 2024 as a collaborative effort. The Text-to-Image Arena Leaderboard and Text-to-Video Arena Leaderboard are also available as Hugging Face Spaces.
This integration gives researchers and developers access to the same data through the Hugging Face ecosystem, where it can be referenced alongside model cards, datasets, and other evaluation tools.
Artificial Analysis provides a free public API that gives developers programmatic access to its benchmark data. The API covers model intelligence evaluations, speed benchmarks, pricing data, and Elo ratings across different model categories. The free tier is rate-limited to 1,000 requests per day.
The website and all public leaderboards are freely accessible without an account. Enterprise customers who require more detailed analysis, custom evaluations, or standardized reports subscribe to the Artificial Analysis Insights Platform.
Artificial Analysis generates revenue through two primary streams. The first is enterprise subscriptions, which provide standardized reports on model deployment decisions covering topics like serverless versus managed infrastructure versus leasing chips. The second is private benchmarking, where AI companies commission custom evaluations of their models. The founders have emphasized that no company pays to appear on the public website, maintaining the platform's independence.
The Stanford AI Index Report 2025 cited Artificial Analysis benchmarks as key reference data for understanding the AI model landscape. Academic papers on LLM pricing and performance have also referenced the platform's data. As of November 2025, the site recorded 27.56% month-over-month traffic growth, reflecting increasing adoption among developers and decision-makers.
Groq, the inference chip company, has publicly highlighted its performance on Artificial Analysis leaderboards, demonstrating the platform's influence on how AI companies market their products. Multiple cloud providers and inference startups reference their Artificial Analysis rankings in marketing materials and press releases.
Artificial Analysis occupies a distinct niche in the AI evaluation ecosystem. While platforms like LMSYS Chatbot Arena focus on human preference rankings through pairwise comparisons, and the Open LLM Leaderboard on Hugging Face focuses on academic benchmarks for open models, Artificial Analysis combines quality evaluation with real-world performance and pricing data.
| Platform | Primary Focus | Methodology |
|---|---|---|
| Artificial Analysis | Quality, speed, price across models and providers | Independent API testing, composite intelligence index, human preference arenas |
| LMSYS Chatbot Arena | Human preference rankings for chat models | Crowdsourced blind pairwise comparisons |
| Open LLM Leaderboard | Academic benchmark scores for open models | Standardized academic evaluations |
| MTEB | Text embedding model quality | Standardized embedding task evaluations |
The platform's breadth across multiple modalities (text, image, video, speech) and its focus on practical metrics like pricing and provider-level performance differentiate it from purely academic benchmarking efforts.