Arize Phoenix is an open-source AI observability and evaluation platform developed by Arize AI. First released on April 26, 2023, Phoenix provides developers and data scientists with tools for tracing AI application execution, evaluating large language model (LLM) outputs, managing datasets, running experiments, and optimizing prompts. Built on OpenTelemetry infrastructure and powered by the OpenInference instrumentation specification, Phoenix is designed to work with any LLM framework or provider without vendor lock-in. As of mid-2026 the project has accumulated more than 9,600 GitHub stars and the broader Phoenix platform receives over two million monthly downloads.
Arize AI, the company behind Phoenix, is a venture-backed startup headquartered in Berkeley, California. Founded in 2020 by Jason Lopatecki and Aparna Dhinakaran, Arize initially focused on machine learning observability for traditional predictive models before expanding into generative AI. The company operates two main products: Phoenix OSS, the open-source community offering, and Arize AX, an enterprise SaaS platform built on the same instrumentation foundation. Arize AI has raised over $130 million in total funding, including a $70 million Series C in February 2025.
Arize AI was founded in January 2020 by Jason Lopatecki and Aparna Dhinakaran. The two founders met through TubeMogul, a programmatic advertising technology company where Lopatecki served as Chief Strategy Officer and Chief Innovation Officer before the company was acquired by Adobe for over $500 million in 2016. Dhinakaran, who serves as co-founder and Chief Product Officer, previously worked at Uber on Michelangelo, Uber's internal machine learning platform team.
The founding premise was that machine learning models deployed in production were effectively black boxes: organizations could build and train models, but had limited tools to understand why models were failing, drifting, or producing unexpected outputs in live environments. Lopatecki described the problem as similar to the challenge of debugging distributed software systems, but with the added complexity of data-driven statistical behavior rather than deterministic code. The company's name references this analogy: just as the Arize platform was designed to make hidden model behavior visible, the name evokes emergence and surfacing.
At launch, Arize AI also announced the acquisition of Y Combinator-backed Monitor ML, absorbing the entire Monitor ML team. Initial funding of $4 million was led by Foundation Capital, with participation from Brett Wilson (TubeMogul co-founder and former CEO) and Trinity Ventures.
In its early years Arize AI focused primarily on production observability for traditional machine learning models, addressing problems like data drift, model performance degradation, and bias monitoring. The platform targeted data science and MLOps teams at enterprises deploying models for use cases such as fraud detection, recommendation systems, and demand forecasting.
Series A funding of $19 million was raised in September 2021, led by Battery Ventures, with participation from existing investors Foundation Capital, Trinity Ventures, The House Fund, and Swift Ventures. Dharmesh Thakker of Battery Ventures joined the board as part of this round.
In September 2022 Arize raised a $38 million Series B led by TCV, with existing investors Battery Ventures, Foundation Capital, and Swift Ventures also participating. The company used this funding to expand its platform capabilities and enterprise go-to-market operations.
The rapid rise of large language models in 2022 and 2023 created a new class of observability problems distinct from traditional ML. LLMs produce unstructured text outputs, operate through multi-step reasoning chains, are frequently composed into retrieval-augmented generation (RAG) pipelines, and can exhibit subtle failure modes like hallucination that are difficult to detect with conventional monitoring metrics.
Arize responded to this shift by building Phoenix, a purpose-built open-source tool for LLM observability and evaluation. Phoenix was officially launched on April 26, 2023 at Arize's Arize:Observe 2023 summit. The press release described Phoenix as "the first open source library for evaluating large language models," and the launch received coverage in VentureBeat and other technology publications.
At launch, Phoenix operated primarily as a notebook-first tool, instantiated via a simple Python import and capable of ingesting Pandas dataframes for interactive analysis. Core features included LLM embedding visualization, drift detection, anomaly detection, dataset comparison for uncovering training data gaps, and an early version of LLM-based evaluation for detecting hallucinations and assessing response quality.
Harrison Chase, co-founder of LangChain, commented at the launch: "With Phoenix, Arize is offering an open source way to visualize complex LLM decision-making."
Phoenix continued rapid development through 2024 and into 2025. Version numbers advanced quickly, with the repository reaching over 8,400 commits and 690+ releases. The platform expanded beyond notebook-based debugging to become a production-grade observability system capable of running as a standalone server with Docker and Kubernetes deployment options. Multi-agent tracing became a core capability, with Phoenix adding support for tracing across frameworks including CrewAI, LangGraph, OpenAI Agents SDK, and Claude Agent SDK.
In February 2025, Arize AI closed a $70 million Series C round led by Adams Street Partners, with participation from M12 (Microsoft's venture fund), Sinewave Ventures, OMERS Ventures, Datadog, PagerDuty, Industry Ventures, and Archerman Capital. Existing investors Foundation Capital, Battery Ventures, TCV, and Swift VC also participated. The total raised across all rounds exceeded $130 million.
The Series C announcement included the simultaneous launch of Arize AX, the company's consolidated enterprise platform, and Arize AI Copilot (later also called Alyx), an AI engineering assistant with over 50 built-in skills for debugging, evaluation, and optimization tasks. CEO Jason Lopatecki described growth over the prior two years as "explosive," attributing this to the democratization of AI development and the resulting surge in teams needing to monitor and evaluate AI systems in production.
By mid-2025 Phoenix had added a prompt management module allowing teams to create, version, store, and reuse prompt templates. The platform was processing data at a scale that Arize characterized as one trillion spans and 50 million evaluations per month across its user base.
Arize AI's funding rounds are as follows:
Total funding exceeds $130 million. The company is headquartered in Berkeley, California.
Arize AI operates a dual-product strategy with a clear open-core model: Phoenix OSS as the freely available community offering and Arize AX as the enterprise SaaS platform.
Phoenix OSS is the open-source tier, available on GitHub under a permissive license. It can be run locally, in a Jupyter notebook, as a Docker container, or on Kubernetes via Helm charts. Phoenix OSS is intended to remain permanently free and open-source. It includes the full observability and evaluation feature set: distributed tracing, the LLM-as-judge evaluation framework, dataset management, experiment tracking, prompt playground, and prompt management. The OSS edition uses PostgreSQL as its primary data store.
Phoenix OSS is suitable for individual developers, research teams, and organizations that prefer to manage their own infrastructure. There is no limit on the volume of data processed when self-hosting.
Arize AX is the enterprise SaaS platform, available in tiered pricing: a free tier (25,000 spans per month), a Pro tier ($50 per month), and an Enterprise tier with custom pricing. AX is built on the same OpenInference trace schema as Phoenix OSS, ensuring instrumentation code written for Phoenix works without modification when connecting to AX.
AX includes capabilities not available in the OSS edition:
The migration path from Phoenix OSS to AX is designed to be frictionless: because both platforms share the OpenInference instrumentation standard, teams can move from self-hosted Phoenix to cloud-managed AX by changing a single endpoint configuration without rewriting any instrumentation code.
OpenInference is an open-source instrumentation specification and library developed by Arize AI. It extends OpenTelemetry, the widely adopted observability standard, with semantic conventions specifically designed for AI and LLM applications. OpenInference is the foundational layer that enables Phoenix and Arize AX to collect structured telemetry data from AI applications.
OpenTelemetry provides a vendor-neutral framework for distributed tracing, metrics, and logs across software systems. However, its generic span model does not natively represent concepts specific to AI workloads, such as LLM input and output messages, token counts, embedding vectors, retrieved documents, tool call arguments, or model parameters.
OpenInference addresses this gap by defining a complementary set of semantic conventions layered on top of OpenTelemetry. An OpenInference span carries all standard OpenTelemetry fields plus AI-specific attributes using dot-notation naming conventions. For example, llm.model_name identifies the model, llm.input_messages captures the prompt, llm.output_messages captures the completion, llm.token_count.prompt and llm.token_count.completion record token usage, and llm.invocation_parameters stores serialized model configuration.
Because OpenInference extends rather than replaces OpenTelemetry, traces produced by OpenInference instrumentation can be sent to any OpenTelemetry-compatible backend in addition to Phoenix and Arize AX. This design avoids vendor lock-in at the instrumentation layer.
OpenInference defines a required openinference.span.kind attribute that identifies the semantic type of each operation. The supported span kinds are:
These span kinds enable Phoenix and downstream analysis tools to precisely filter, aggregate, and visualize the specific components of complex AI pipelines.
OpenInference provides auto-instrumentation packages for three languages:
Python has the broadest support with over 40 packages covering major LLM frameworks (LangChain, LlamaIndex, DSPy, Haystack), LLM providers (OpenAI, Anthropic, Amazon Bedrock, Groq, Mistral, Cohere), agent frameworks (CrewAI, AutoGen, PydanticAI, Agno), and specialized tools (Guardrails AI, Portkey).
JavaScript and TypeScript offer more than 10 packages including integrations for the OpenAI SDK, Anthropic SDK, Amazon Bedrock, LangChain.js, Vercel AI SDK, and TanStack AI.
Java provides four packages targeting enterprise frameworks including LangChain4j and Spring AI.
Auto-instrumentation packages require only an import and a brief configuration call to activate. Manual instrumentation using the standard OpenTelemetry SDK is also supported for custom components not covered by existing packages.
Phoenix operates as a server that receives OpenTelemetry-formatted trace data and stores it in a relational database (PostgreSQL for self-hosted deployments). Client applications instrument their AI code using OpenInference packages, which emit spans to the Phoenix server over the OTLP (OpenTelemetry Protocol) endpoint.
When an instrumented application handles a request, each logical operation within the AI pipeline generates a span. These spans form a trace: a hierarchical tree representing the full execution of a request. A RAG pipeline trace, for example, might contain a top-level CHAIN span enclosing a RETRIEVER span (vector search), an LLM span (the language model call with retrieved context), and optionally an EVALUATOR span (online evaluation of the response).
Phoenix stores traces with full fidelity, including input and output content, token counts, latency, model parameters, and any custom attributes set by the application. The web UI provides trace visualization, allowing engineers to inspect individual runs, identify slow or failing steps, and compare across sessions.
Phoenix's evaluation capabilities operate in two modes: offline evaluation against stored datasets and online evaluation running at inference time in production. Both modes use the same evaluation primitives.
Evaluations are attached to spans and traces as structured scores with optional explanatory labels. Results are stored in the Phoenix database and surfaced in the platform's experiment comparison views.
Phoenix supports systematic experimentation through dataset management. Traces can be grouped into versioned datasets of representative examples. Teams can then run different versions of their application (different prompts, models, retrieval strategies) against the same dataset and compare evaluation metrics across runs. This workflow supports data-driven iteration rather than ad hoc testing.
Added in April 2025, Phoenix's prompt management module provides version control, tagging, and experimentation for prompt templates. Prompts can be stored centrally, tagged with metadata, and pulled at runtime by applications, enabling prompt changes to be tracked and rolled back independently of application code deployments.
Phoenix can be deployed in multiple configurations:
Phoenix's tracing feature captures the full execution path of AI application requests. For simple LLM calls this means recording the input prompt, model parameters, output completion, and token usage. For complex multi-step applications such as agents or RAG pipelines, tracing captures every operation in the chain as a structured span hierarchy.
Distributed tracing support means Phoenix can track requests that span multiple services or processes. This is increasingly important for multi-agent systems where different agents may run as separate services and communicate via API calls or message queues.
Session-level tracing groups related traces from multi-turn conversations or task sequences, enabling analysis of agent behavior over an entire workflow rather than just individual interactions.
Phoenix provides three types of evaluation: LLM-based evaluators, code-based (deterministic) evaluators, and human annotation.
Code-based evaluators apply rule-based logic, regular expressions, or structured checks to outputs. These are fast and cheap but limited to properties that can be defined algorithmically.
Human annotation allows team members to manually label traces in the Phoenix UI, building gold-standard datasets for benchmarking automated evaluators.
LLM-based evaluation is Phoenix's primary evaluation mechanism for semantic quality assessment, described in detail in the Evaluation Framework section below.
Phoenix also integrates with third-party evaluation libraries including Ragas, Deepeval, and Cleanlab, allowing teams to use specialized evaluation tools while storing results in Phoenix for centralized tracking.
The Phoenix prompt playground allows engineers to select a span from a trace, modify the prompt or model configuration, and replay the call to compare outputs side by side. This enables rapid prompt iteration grounded in real production examples rather than artificially constructed test cases.
In production deployments, Phoenix provides ongoing monitoring of evaluation scores, latency, error rates, and token costs. Arize AX extends these capabilities with alerting rules that notify teams when metrics cross defined thresholds.
Arize Phoenix's evaluation framework is one of its most developed capabilities. The framework addresses a fundamental challenge of LLM applications: standard software quality metrics do not capture semantic correctness. A response can be syntactically valid, complete, and grammatically correct while still hallucinating facts, ignoring the user's question, or producing content that violates safety policies. Evaluating these properties at production scale requires an approach that goes beyond deterministic rules.
Phoenix uses a paradigm called LLM-as-judge (or LLM-as-a-judge), where a second language model is used to assess the outputs of the primary application model. The judge model receives the input, output, and optionally the retrieved context or other relevant information, along with a written rubric defining the evaluation criteria. It then returns a structured score and optionally an explanation.
Binary judgments (pass/fail) tend to produce more reliable results than granular numeric scoring, as binary classifications are less sensitive to rubric ambiguity. Chain-of-thought explanations from the judge model can improve alignment with human judgment, though at higher inference cost.
According to Arize's evaluation research, LLM-as-judge approaches typically achieve 75 to 90 percent agreement with human evaluators when applied to well-defined criteria, making them suitable for high-throughput automated evaluation at production scale.
Phoenix and Arize AX provide a library of pre-built evaluation templates tested against benchmarked datasets. These evaluators are available through the arize-phoenix-evals Python package on PyPI and are integrated into the Phoenix UI as one-click evaluation runs. The pre-built templates cover:
Phoenix supports both offline and online evaluation modes.
Offline evaluation runs against stored datasets after inference. This is suited for systematic benchmarking, experiment comparison, and regression testing. Engineers can batch-evaluate hundreds or thousands of traces to build a comprehensive picture of application quality before deploying changes.
Online evaluation runs at inference time in production, attaching evaluation scores to traces as they are generated. This provides continuous quality monitoring but must account for latency constraints, as calling an LLM evaluator adds time to the response path. Online evaluations in Phoenix are typically implemented asynchronously to avoid blocking the primary response.
Phoenix does not require teams to use its built-in evaluators. The platform integrates with Ragas (an evaluation framework specialized for RAG systems), Deepeval, and Cleanlab, allowing evaluation scores from these tools to be stored in Phoenix alongside traces for unified reporting.
Arize AI operates in a market that grew rapidly with the adoption of large language models in production applications beginning in 2023. The company's CEO Jason Lopatecki described growth as "explosive" over the 2023-2025 period, attributing this to the lowering barriers to AI development that brought a large new population of developers into production AI for the first time. As of the Series C announcement in February 2025, the company cited over two million monthly downloads for the Phoenix platform.
The broader LLM observability category attracted significant investment and multiple well-funded competitors. In the context of the Series C, Lopatecki noted that Galileo (which had raised $68 million) and Patronus AI (which had raised $20 million) were among the competitors, while observing that the involvement of strategic backers like Microsoft and Datadog reflected expectations that the market would grow rapidly: "It's kind of why I think the Microsofts and Datadogs are investing in us...I expect it to be a fast, growing, large market."
Co-founder Aparna Dhinakaran described the broader problem the company is solving: "Models are making billions of dollars of decisions...if we have no way of understanding why they're going wrong, we're not putting the best decision makers into our world." This framing positioned Arize not as a developer tooling company but as essential infrastructure for responsible AI deployment.
Note on OpenEvals and AgentEvals: In connection with the Series C, Arize also announced two open-source research initiatives. OpenEvals is a repository of standardized LLM evaluation benchmarks and methodologies contributed to the broader community. AgentEvals focuses specifically on evaluation frameworks for autonomous agent behavior, including metrics for task completion, plan quality, and multi-step coherence.
The LLM observability and evaluation space includes several tools with different design philosophies, pricing models, and target users.
| Feature | Phoenix (Arize) | LangSmith | LangFuse | Helicone |
|---|---|---|---|---|
| License | Open source (self-hosted) + managed cloud | Proprietary SaaS | MIT (self-hosted) + managed cloud | Open source + managed cloud |
| Instrumentation | OpenTelemetry / OpenInference | LangChain SDK (also supports OTEL) | SDK-based | HTTP proxy |
| Framework lock-in | None | LangChain dependency for best experience | Minimal | Minimal |
| Self-hosted free tier | Unlimited | Not available | Unlimited | Limited |
| Cloud free tier | 25k spans/month (AX) | 5k traces/month | 50k events/month | 10k requests/month |
| LLM evaluation | Native, extensive pre-built evals | Yes | Yes (Boolean and Categorical LLM-as-Judge) | Limited |
| Agent tracing | Yes (10 span kinds, multi-agent) | Yes | Yes | Limited |
| Prompt management | Yes (added April 2025) | Yes | Yes | No |
| Setup complexity | 2-4 hours | 30 minutes (with LangChain) | 4-8 hours | 15 minutes |
| Enterprise compliance | SOC2, HIPAA (AX Enterprise) | Yes | Limited | Limited |
| Status (2026) | Active development | Active development | Active development | Acquired by Mintlify, maintenance mode |
LangSmith is developed by LangChain and benefits from deep integration with the LangChain and LangGraph ecosystems. Teams building primarily with LangChain can instrument with zero code changes and gain immediate debugging visibility. The tradeoff is that LangSmith is a closed-source proprietary service, and its per-trace pricing structure can scale steeply at high volumes. Teams not using LangChain derive less value from the platform.
LangFuse is a fully open-source platform licensed under the MIT license. It emphasizes prompt engineering workflows and collaborative development features. LangFuse's cloud pricing is relatively expensive at higher tiers, but the self-hosted option is free without volume limits. The platform uses ClickHouse as its backing store, providing strong analytics performance.
Helicone takes a different architectural approach, operating as an HTTP proxy layer in front of LLM providers rather than through SDK instrumentation. This makes setup extremely fast (as little as 15 minutes) but limits the depth of observability to what can be observed at the HTTP layer. As of 2025, Helicone was acquired by Mintlify and has entered maintenance mode, receiving security updates but no new feature development. It is generally not recommended for new projects.
Phoenix's primary differentiators are its vendor-neutral instrumentation (based on open standards), its depth of evaluation capabilities, its strong support for complex agent architectures, and its clear upgrade path from open-source self-hosting to enterprise cloud. The main tradeoff relative to hosted alternatives is the operational overhead of managing self-hosted infrastructure and the learning curve associated with OpenTelemetry concepts.
Arize AI's customer base spans large technology companies, consumer internet platforms, and enterprises deploying AI in regulated industries. Named enterprise customers include DoorDash, Instacart, Reddit, Uber, Booking.com, PepsiCo, Handshake, TripAdvisor, Siemens, Flipkart, Condé Nast, Discord, Etsy, and Honeywell. Microsoft (via its M12 venture fund) and Datadog have invested in the company as strategic backers, adding credibility in the enterprise software and infrastructure monitoring markets.
Retrieval-augmented generation systems are among the most common production deployments of LLMs, combining document retrieval with language model generation to answer questions over private knowledge bases. RAG pipelines introduce multiple potential failure points: the retrieval system may return irrelevant documents, the model may hallucinate facts not present in the retrieved context, or the model may ignore the retrieved context entirely. Phoenix's tracing captures each stage of the pipeline separately, while its evaluation templates specifically address RAG quality metrics including context relevance, faithfulness, and answer correctness.
Autonomous AI agents that use tool calls, multi-step reasoning, and iterative planning present particular observability challenges. A single user request may trigger dozens of tool calls across multiple turns, with each tool call potentially triggering additional sub-agents or API requests. Phoenix's span kind taxonomy, especially the AGENT and TOOL span types, enables structured capture of agent trajectories. The platform's session-level observability allows teams to evaluate whether agents successfully complete multi-turn tasks, maintain context across turns, and achieve stated goals.
For teams running LLM applications in production, Phoenix provides continuous evaluation of response quality. By running asynchronous LLM-as-judge evaluations on sampled production traffic, teams can detect quality regressions introduced by model provider updates, prompt changes, or shifts in user query patterns without waiting for user complaints or downstream business metrics to surface problems.
Healthcare and financial services organizations deploying AI applications face regulatory requirements around auditability, data residency, and compliance. Arize AX's HIPAA compliance certification and SOC 2 Type II compliance make it suitable for these environments. The platform's comprehensive audit trails and access control features support the governance requirements of regulated deployments.
Large engineering organizations frequently operate AI applications built on heterogeneous stacks: one product may use LangChain with OpenAI, another may use LlamaIndex with Amazon Bedrock, and a third may use a custom agent framework with Anthropic's Claude. Because Phoenix's instrumentation is based on OpenTelemetry and OpenInference standards rather than proprietary SDKs, it can ingest traces from all of these configurations into a single platform without requiring different tools for different stacks. This unified view is valuable for platform and infrastructure teams that need to provide observability across an organization's entire AI footprint rather than for a single application.
Phoenix is also used during development and in continuous integration pipelines rather than only in production. By running offline evaluations against representative datasets as part of a pull request review process, teams can detect evaluation regressions before deploying changes to production. This shift-left approach to LLM quality assurance mirrors established practices in traditional software testing. Phoenix's dataset and experiment APIs expose programmatic interfaces that allow evaluation runs to be triggered from CI/CD systems and integrated with build status reporting.
Despite Phoenix's broad capabilities, several limitations are relevant for prospective adopters.
OpenTelemetry learning curve: Teams without existing OpenTelemetry experience face a non-trivial onboarding process. Understanding the span model, configuring exporters, and debugging instrumentation issues requires familiarity with distributed tracing concepts that may be new to ML-focused engineers.
Infrastructure management overhead: Self-hosted Phoenix requires teams to provision, operate, and maintain the server and database. At production scale this adds operational complexity compared to fully managed hosted solutions.
Limited operational metrics: Phoenix is primarily an observability tool oriented around trace analysis and quality evaluation. It does not provide the rich infrastructure and operational monitoring capabilities of dedicated infrastructure observability platforms. Teams typically use Phoenix alongside infrastructure monitoring tools rather than as a replacement.
Evaluation cost: Running LLM-as-judge evaluations at production scale incurs inference costs for the evaluator model. Evaluating every production trace in real time may be cost-prohibitive for high-volume applications, requiring teams to implement sampling strategies.
Enterprise feature separation: Several capabilities of practical importance in enterprise environments, including Arize AI Copilot/Alyx, ML and computer vision support, HIPAA compliance, and advanced security controls, are available only in the paid Arize AX tier rather than the open-source Phoenix distribution.
Relative maturity of agent evaluation: While Phoenix has expanded rapidly into agent tracing, the evaluation primitives for agentic workflows (trajectory evaluation, goal-completion scoring, multi-turn coherence) are less mature and standardized than those for simpler LLM call or RAG use cases. This is an active area of development across the entire LLM observability field, not unique to Phoenix.
Embedding analysis reduced prominence: Early versions of Phoenix placed heavy emphasis on embedding-based analysis for understanding LLM behavior, including t-SNE and UMAP visualizations of embedding clusters. While these features remain available, the shift toward text-based evaluation and structured tracing has reduced their prominence in the typical Phoenix workflow. Teams primarily evaluating LLM response quality rather than investigating embedding spaces may find these features less relevant to their daily work.
Phoenix maintains an active open-source community centered on its GitHub repository, which had accumulated over 9,600 stars and 851 forks as of mid-2026. The repository uses an issue tracker for feature requests and bug reports, and Arize has maintained a rapid release cadence, with over 690 releases published since 2023.
The OpenInference specification is maintained as a separate GitHub repository and has been adopted beyond the Arize ecosystem. Oracle's AI and Data Science platform announced integration with Phoenix and OpenInference, and Google's Agent Development Kit (ADK) documentation includes official Phoenix integration instructions. Amazon Web Services published a reference architecture for Amazon Bedrock Agents observability using Arize. These integrations indicate that OpenInference is gaining traction as a standard instrumentation layer in the AI ecosystem, not only within the Arize user base.
Arize hosts an annual user conference called Arize:Observe, which served as the launch venue for Phoenix in 2023. The company also publishes research through its blog and public repositories, including contributions to the emerging standards for LLM and agent evaluation methodology.