Arize Phoenix

Arize Phoenix is an open-source AI observability and evaluation platform developed by Arize AI. First released on April 26, 2023, Phoenix provides developers and data scientists with tools for tracing AI application execution, evaluating large language model (LLM) outputs, managing datasets, running experiments, and optimizing prompts.[^1][^2] Built on OpenTelemetry infrastructure and powered by the OpenInference instrumentation specification, Phoenix is designed to work with any LLM framework or provider without vendor lock-in.[^3][^4] As of mid-2026 the project has accumulated more than 9,600 GitHub stars and the broader Phoenix platform receives over two million monthly downloads.[^5][^6]

Arize AI, the company behind Phoenix, is a venture-backed startup headquartered in Berkeley, California. Founded in 2020 by Jason Lopatecki and Aparna Dhinakaran, Arize initially focused on machine learning observability for traditional predictive models before expanding into generative AI.[^7][^8] The company operates two main products: Phoenix OSS, the open-source community offering, and Arize AX, an enterprise SaaS platform built on the same instrumentation foundation.[^9] Arize AI has raised over $130 million in total funding, including a $70 million Series C in February 2025.[^6][^10]

History

Founding of Arize AI (2020)

Arize AI was founded in January 2020 by Jason Lopatecki and Aparna Dhinakaran.[^7][^8] The two founders met through TubeMogul, a programmatic advertising technology company where Lopatecki served as Chief Strategy Officer and Chief Innovation Officer before the company was acquired by Adobe for over $500 million in 2016.[^11] Dhinakaran, who serves as co-founder and Chief Product Officer, previously worked at Uber on Michelangelo, Uber's internal machine learning platform team.[^12]

The founding premise was that machine learning models deployed in production were effectively black boxes: organizations could build and train models, but had limited tools to understand why models were failing, drifting, or producing unexpected outputs in live environments.[^7] Lopatecki described the problem as similar to the challenge of debugging distributed software systems, but with the added complexity of data-driven statistical behavior rather than deterministic code.[^11] The company's name references this analogy: just as the Arize platform was designed to make hidden model behavior visible, the name evokes emergence and surfacing.

At launch, Arize AI also announced the acquisition of Y Combinator-backed Monitor ML, absorbing the entire Monitor ML team.[^7] Initial funding of $4 million was led by Foundation Capital, with participation from Brett Wilson (TubeMogul co-founder and former CEO) and Trinity Ventures.[^7]

Early growth and machine learning focus (2020-2022)

In its early years Arize AI focused primarily on production observability for traditional machine learning models, addressing problems like data drift, model performance degradation, and bias monitoring. The platform targeted data science and MLOps teams at enterprises deploying models for use cases such as fraud detection, recommendation systems, and demand forecasting.[^11]

Series A funding of $19 million was raised in September 2021, led by Battery Ventures, with participation from existing investors Foundation Capital, Trinity Ventures, The House Fund, and Swift Ventures.[^13] Dharmesh Thakker of Battery Ventures joined the board as part of this round.[^13]

In September 2022 Arize raised a $38 million Series B led by TCV, with existing investors Battery Ventures, Foundation Capital, and Swift Ventures also participating.[^14] The company used this funding to expand its platform capabilities and enterprise go-to-market operations, scaling research and development and roughly doubling its 50-person headcount.[^14]

Launch of Phoenix (2023)

The rapid rise of large language models in 2022 and 2023 created a new class of observability problems distinct from traditional ML. LLMs produce unstructured text outputs, operate through multi-step reasoning chains, are frequently composed into retrieval-augmented generation (RAG) pipelines, and can exhibit subtle failure modes like hallucination that are difficult to detect with conventional monitoring metrics.[^1]

Arize responded to this shift by building Phoenix, a purpose-built open-source tool for LLM observability and evaluation. Phoenix was officially launched on April 26, 2023 at Arize's Arize:Observe 2023 summit in Berkeley.[^1][^2] The press release described Phoenix as "the first open source library for evaluating large language models," and the launch received coverage in VentureBeat and other technology publications.[^1][^2]

At launch, Phoenix operated primarily as a notebook-first tool, instantiated via a simple Python import and capable of ingesting Pandas dataframes for interactive analysis. Core features included LLM embedding visualization, drift detection, anomaly detection, dataset comparison for uncovering training data gaps, and an early version of LLM-based evaluation for detecting hallucinations and assessing response quality.[^1][^2]

Harrison Chase, co-founder of LangChain, commented at the launch: "With Phoenix, Arize is offering an open source way to visualize complex LLM decision-making."[^1] Lopatecki characterized Phoenix as "the first software designed to help data scientists understand how GPT-4 and LLMs think."[^1]

Expansion to agent observability and enterprise (2024-2025)

Phoenix continued rapid development through 2024 and into 2025. Version numbers advanced quickly, with the repository reaching over 8,400 commits and 690+ releases.[^5] The platform expanded beyond notebook-based debugging to become a production-grade observability system capable of running as a standalone server with Docker and Kubernetes deployment options.[^15] Multi-agent tracing became a core capability, with Phoenix adding support for tracing across frameworks including CrewAI, LangGraph, OpenAI Agents SDK, and Claude Agent SDK.[^4][^5]

In February 2025, Arize AI closed a $70 million Series C round led by Adams Street Partners, with participation from M12 (Microsoft's venture fund), Sinewave Ventures, OMERS Ventures, Datadog, PagerDuty, Industry Ventures, and Archerman Capital.[^6][^10] Existing investors Foundation Capital, Battery Ventures, TCV, and Swift VC also participated.[^6] The total raised across all rounds exceeded $130 million.[^6][^11]

The Series C announcement included the simultaneous launch of Arize AX, the company's consolidated enterprise platform, and Arize AI Copilot (later evolved into Alyx in 2025-2026), an AI engineering assistant with over 50 built-in skills for debugging, evaluation, and optimization tasks.[^6][^16] CEO Jason Lopatecki described growth over the prior two years as "explosive, explosive in growth," attributing this to the democratization of AI development and the resulting surge in teams needing to monitor and evaluate AI systems in production.[^11]

By 2025 Phoenix had added a prompt management module allowing teams to create, version, store, and reuse prompt templates.[^17][^18] The platform was processing data at a scale that Arize characterized as one trillion spans and 50 million evaluations per month across its user base, with roughly five million open-source downloads per month.[^19]

Funding history

Arize AI's funding rounds are as follows:

Seed (2020): $4 million, led by Foundation Capital, with participation from Brett Wilson and Trinity Ventures[^7]
Series A (September 2021): $19 million, led by Battery Ventures[^13]
Series B (September 2022): $38 million, led by TCV[^14]
Series C (February 2025): $70 million, led by Adams Street Partners, with M12, Sinewave Ventures, OMERS Ventures, Datadog, and PagerDuty among the participants[^6][^10]

Total funding exceeds $130 million.[^6][^11] The company is headquartered in Berkeley, California.[^8]

Arize AX and Phoenix: two-product strategy

Arize AI operates a dual-product strategy with a clear open-core model: Phoenix OSS as the freely available community offering and Arize AX as the enterprise SaaS platform.[^9][^20]

Phoenix OSS

Phoenix OSS is the open-source tier, available on GitHub under the Elastic License 2.0 (ELv2).[^21] It can be run locally, in a Jupyter notebook, as a Docker container, or on Kubernetes via Helm charts.[^15][^22] Phoenix OSS is intended to remain permanently free and self-hostable with no feature gates.[^20] It includes the full observability and evaluation feature set: distributed tracing, the LLM-as-judge evaluation framework, dataset management, experiment tracking, prompt playground, and prompt management.[^4][^17] The OSS edition uses PostgreSQL as its primary data store, with SQLite available for lightweight deployments.[^15]

Phoenix OSS is suitable for individual developers, research teams, and organizations that prefer to manage their own infrastructure. There is no limit on the volume of data processed when self-hosting.[^20] Some legal teams treat ELv2 as non-OSI open source because it restricts hosted-service offerings, which is a consideration for organizations with strict open source policies.[^21]

Arize AX

Arize AX is the enterprise SaaS platform, available in tiered pricing including a free tier (25,000 spans per month), a Pro tier ($50 per month), and an Enterprise tier with custom pricing.[^9] AX is built on the same OpenInference trace schema as Phoenix OSS, ensuring instrumentation code written for Phoenix works without modification when connecting to AX.[^3][^9]

AX includes capabilities not available in the OSS edition:[^6][^9][^16]

Arize AI Copilot / Alyx: An AI engineering assistant embedded in the platform, providing automated analysis, root-cause suggestions, and debugging guidance across 30+ pre-built skills, running on Azure OpenAI and Claude models hosted within Arize's infrastructure[^16]
Machine learning and computer vision support: Observability for traditional ML models and CV systems in addition to LLM applications
Enterprise security: Single sign-on (SSO), role-based access control (RBAC), audit trails, and SOC 2 compliance
HIPAA compliance: Available at the Enterprise tier for healthcare use cases
Dedicated customer success: Assigned support and onboarding resources
adb datastore: A purpose-built datastore for generative AI workloads handling real-time ingestion, sub-second queries, and elastic compute at the scale of billions of requests per day

The migration path from Phoenix OSS to AX is designed to be frictionless: because both platforms share the OpenInference instrumentation standard, teams can move from self-hosted Phoenix to cloud-managed AX by changing a single endpoint configuration without rewriting any instrumentation code.[^3][^9]

OpenInference library

OpenInference is an open-source instrumentation specification and library developed by Arize AI.[^3][^4] It extends OpenTelemetry, the widely adopted observability standard, with semantic conventions specifically designed for AI and LLM applications.[^3][^23] OpenInference is the foundational layer that enables Phoenix and Arize AX to collect structured telemetry data from AI applications. The OpenInference repository itself is licensed under Apache 2.0.[^4]

Relationship to OpenTelemetry

OpenTelemetry provides a vendor-neutral framework for distributed tracing, metrics, and logs across software systems. However, its generic span model does not natively represent concepts specific to AI workloads, such as LLM input and output messages, token counts, embedding vectors, retrieved documents, tool call arguments, or model parameters.[^3][^23]

OpenInference addresses this gap by defining a complementary set of semantic conventions layered on top of OpenTelemetry.[^3] An OpenInference span carries all standard OpenTelemetry fields plus AI-specific attributes using dot-notation naming conventions. For example, llm.model_name identifies the model, llm.input_messages captures the prompt, llm.output_messages captures the completion, llm.token_count.prompt and llm.token_count.completion record token usage, and llm.invocation_parameters stores serialized model configuration.[^24]

Because OpenInference extends rather than replaces OpenTelemetry, traces produced by OpenInference instrumentation can be sent to any OpenTelemetry-compatible backend in addition to Phoenix and Arize AX. This design avoids vendor lock-in at the instrumentation layer.[^3][^23]

Span kinds

OpenInference defines a required openinference.span.kind attribute that identifies the semantic type of each operation.[^25] The supported span kinds are:[^25]

LLM: A call to a language model
CHAIN: A sequential or conditional sequence of operations
TOOL: External tool or function execution within an agent
RETRIEVER: A vector store or document retrieval operation
EMBEDDING: Conversion of text to an embedding vector
AGENT: A top-level agent invocation
RERANKER: A reranking pass over retrieved documents
GUARDRAIL: A safety or policy check applied to model input or output
EVALUATOR: An evaluation step run over model output

These span kinds enable Phoenix and downstream analysis tools to precisely filter, aggregate, and visualize the specific components of complex AI pipelines.[^25]

Language support

OpenInference provides auto-instrumentation packages for multiple languages, including Python, JavaScript/TypeScript, Java, and Go.[^4]

Python has the broadest support with over 40 packages covering major LLM frameworks (LangChain, LlamaIndex, DSPy, Haystack), LLM providers (OpenAI, Anthropic, Amazon Bedrock, Groq, Mistral, Cohere, Google GenAI), agent frameworks (CrewAI, AutoGen, PydanticAI, OpenAI Agents SDK, Claude Agent SDK), and specialized tools (Guardrails AI, Portkey, LiteLLM).[^4]

JavaScript and TypeScript offer more than 10 packages including integrations for the OpenAI SDK, Anthropic SDK, Amazon Bedrock, LangChain.js, Vercel AI SDK, and TanStack AI.[^4]

Java provides packages targeting enterprise frameworks including LangChain4j and Spring AI.[^4]

Auto-instrumentation packages require only an import and a brief configuration call to activate. Manual instrumentation using the standard OpenTelemetry SDK is also supported for custom components not covered by existing packages.[^25]

Architecture

Phoenix operates as a server that receives OpenTelemetry-formatted trace data and stores it in a relational database (PostgreSQL for self-hosted deployments).[^15] Client applications instrument their AI code using OpenInference packages, which emit spans to the Phoenix server over the OTLP (OpenTelemetry Protocol) endpoint.[^3][^15]

Trace collection

When an instrumented application handles a request, each logical operation within the AI pipeline generates a span. These spans form a trace: a hierarchical tree representing the full execution of a request. A RAG pipeline trace, for example, might contain a top-level CHAIN span enclosing a RETRIEVER span (vector search), an LLM span (the language model call with retrieved context), and optionally an EVALUATOR span (online evaluation of the response).[^25]

Phoenix stores traces with full fidelity, including input and output content, token counts, latency, model parameters, and any custom attributes set by the application.[^24] The web UI provides trace visualization, allowing engineers to inspect individual runs, identify slow or failing steps, and compare across sessions.[^4][^5]

Evaluation engine

Phoenix's evaluation capabilities operate in two modes: offline evaluation against stored datasets and online evaluation running at inference time in production.[^26] Both modes use the same evaluation primitives.

Evaluations are attached to spans and traces as structured scores with optional explanatory labels. Results are stored in the Phoenix database and surfaced in the platform's experiment comparison views.[^26]

Experiments and datasets

Phoenix supports systematic experimentation through dataset management. Traces can be grouped into versioned datasets of representative examples. Teams can then run different versions of their application (different prompts, models, retrieval strategies) against the same dataset and compare evaluation metrics across runs. This workflow supports data-driven iteration rather than ad hoc testing.[^5]

Prompt management

Phoenix's prompt management module provides version control, tagging, and experimentation for prompt templates.[^17][^18] Prompts can be stored centrally, tagged with metadata, and pulled at runtime by applications, enabling prompt changes to be tracked and rolled back independently of application code deployments. Phoenix 8.0, released in February 2025, brought versioning and iteration features for managing prompts alongside Python and TypeScript clients with native support for OpenAI and Anthropic.[^18]

Deployment

Phoenix can be deployed in multiple configurations:[^15][^22]

Local notebook: Instantiated via Python import within a Jupyter session, suitable for development and debugging
Local server: A persistent process running on a developer's machine
Docker: Official Docker images are provided for containerized deployment
Kubernetes: Helm charts (available in Phoenix 10.6+) support deployment in Kubernetes clusters via oci://registry-1.docker.io/arizephoenix/phoenix-helm
Cloud (Arize AX): Fully managed by Arize AI with enterprise SLAs

Features

Tracing and distributed tracing

Phoenix's tracing feature captures the full execution path of AI application requests. For simple LLM calls this means recording the input prompt, model parameters, output completion, and token usage. For complex multi-step applications such as agents or RAG pipelines, tracing captures every operation in the chain as a structured span hierarchy.[^24][^25]

Distributed tracing support means Phoenix can track requests that span multiple services or processes. This is increasingly important for multi-agent systems where different agents may run as separate services and communicate via API calls or message queues.[^4]

Session-level tracing groups related traces from multi-turn conversations or task sequences, enabling analysis of agent behavior over an entire workflow rather than just individual interactions.[^4][^5]

Evaluation

Phoenix provides three types of evaluation: LLM-based evaluators, code-based (deterministic) evaluators, and human annotation.[^26][^27]

Code-based evaluators apply rule-based logic, regular expressions, or structured checks to outputs. These are fast and cheap but limited to properties that can be defined algorithmically.

Human annotation allows team members to manually label traces in the Phoenix UI, building gold-standard datasets for benchmarking automated evaluators.[^26]

LLM-based evaluation is Phoenix's primary evaluation mechanism for semantic quality assessment, described in detail in the Evaluation framework section below.

Phoenix also integrates with third-party evaluation libraries including Ragas, Deepeval, and Cleanlab, allowing teams to use specialized evaluation tools while storing results in Phoenix for centralized tracking.[^28]

Prompt playground

The Phoenix prompt playground allows engineers to select a span from a trace, modify the prompt or model configuration, and replay the call to compare outputs side by side.[^17] This enables rapid prompt iteration grounded in real production examples rather than artificially constructed test cases.

Monitoring and alerting

In production deployments, Phoenix provides ongoing monitoring of evaluation scores, latency, error rates, and token costs. Arize AX extends these capabilities with alerting rules that notify teams when metrics cross defined thresholds.[^9]

Evaluation framework

Arize Phoenix's evaluation framework is one of its most developed capabilities. The framework addresses a fundamental challenge of LLM applications: standard software quality metrics do not capture semantic correctness. A response can be syntactically valid, complete, and grammatically correct while still hallucinating facts, ignoring the user's question, or producing content that violates safety policies. Evaluating these properties at production scale requires an approach that goes beyond deterministic rules.[^27][^29]

LLM-as-judge

Phoenix uses a paradigm called LLM-as-judge (or LLM-as-a-judge), where a second language model is used to assess the outputs of the primary application model.[^27] The judge model receives the input, output, and optionally the retrieved context or other relevant information, along with a written rubric defining the evaluation criteria. It then returns a structured score and optionally an explanation.[^27]

Binary judgments (pass/fail) tend to produce more reliable results than granular numeric scoring, as binary classifications are less sensitive to rubric ambiguity. Chain-of-thought explanations from the judge model can improve alignment with human judgment, though at higher inference cost.[^27]

According to Arize's evaluation research, LLM-as-judge approaches typically achieve high agreement with human evaluators when applied to well-defined criteria, making them suitable for high-throughput automated evaluation at production scale.[^27][^29]

Pre-built evaluators

Phoenix and Arize AX provide a library of pre-built evaluation templates tested against benchmarked datasets. These evaluators are available through the arize-phoenix-evals Python package on PyPI and are integrated into the Phoenix UI as one-click evaluation runs.[^30] The pre-built templates cover:[^29][^30]

Hallucination detection: Determines whether a response contains content not grounded in the provided context
Question answering: Assesses whether the response correctly addresses the user's question
Retrieved document relevancy: Evaluates whether retrieved documents are relevant to the query
Toxicity: Detects harmful, offensive, or inappropriate content
Summarization: Evaluates the quality and faithfulness of summaries
Code generation: Assesses correctness and style of generated code
Human vs. AI comparison: Determines whether a response appears human-generated or AI-generated
Citation evaluation: Checks whether citations are accurate and properly attributed
User frustration detection: Identifies signals of user dissatisfaction in conversational traces
Tool call correctness: Evaluates whether function/tool calls were made correctly with appropriate parameters
Agent planning quality: Assesses whether an agent's plan for task decomposition is appropriate

The hallucination evaluator is specifically designed for retrieval-augmented contexts rather than checking model output against world knowledge: it determines whether the response is grounded in the provided retrieved context, not whether it matches arbitrary public facts.[^29]

Offline and online evaluation

Phoenix supports both offline and online evaluation modes.[^26]

Offline evaluation runs against stored datasets after inference. This is suited for systematic benchmarking, experiment comparison, and regression testing. Engineers can batch-evaluate hundreds or thousands of traces to build a comprehensive picture of application quality before deploying changes.[^26]

Online evaluation runs at inference time in production, attaching evaluation scores to traces as they are generated. This provides continuous quality monitoring but must account for latency constraints, as calling an LLM evaluator adds time to the response path. Online evaluations in Phoenix are typically implemented asynchronously to avoid blocking the primary response.[^26]

Integration with external evaluation libraries

Phoenix does not require teams to use its built-in evaluators. The platform integrates with Ragas (an evaluation framework specialized for RAG systems), Deepeval, and Cleanlab, allowing evaluation scores from these tools to be stored in Phoenix alongside traces for unified reporting.[^28]

Competition and industry context

Arize AI operates in a market that grew rapidly with the adoption of large language models in production applications beginning in 2023. The company's CEO Jason Lopatecki described growth as "explosive" over the 2023-2025 period, attributing this to the lowering barriers to AI development that brought a large new population of developers into production AI for the first time.[^11] As of the Series C announcement in February 2025, the company cited over two million monthly downloads for the Phoenix platform.[^6][^11]

The broader LLM observability category attracted significant investment and multiple well-funded competitors. In the context of the Series C, Lopatecki noted that Galileo (which had raised $68 million) and Patronus AI (which had raised $20 million) were among the competitors, while observing that the involvement of strategic backers like Microsoft and Datadog reflected expectations that the market would grow rapidly.[^11]

Co-founder Aparna Dhinakaran has described the broader problem the company is solving: production models make billions of dollars of decisions, and without tools to understand why they go wrong, organizations cannot deploy them responsibly.[^12] This framing positions Arize not as a developer tooling company but as essential infrastructure for responsible AI deployment.

Note on OpenEvals and AgentEvals: In connection with the Series C, Arize also announced two open-source research initiatives. OpenEvals is a repository of standardized LLM evaluation benchmarks and methodologies contributed to the broader community. AgentEvals focuses specifically on evaluation frameworks for autonomous agent behavior, including metrics for task completion, plan quality, and multi-step coherence.[^6]

Comparison with other platforms

The LLM observability and evaluation space includes several tools with different design philosophies, pricing models, and target users.

Feature	Phoenix (Arize)	LangSmith	LangFuse	Helicone
License	Elastic License 2.0 (self-hosted) + managed cloud[^21]	Proprietary SaaS	MIT (self-hosted) + managed cloud	Open source + managed cloud
Instrumentation	OpenTelemetry / OpenInference	LangChain SDK (also supports OTEL)	SDK-based / OTel-compliant	HTTP proxy
Framework lock-in	None	LangChain dependency for best experience	Minimal	Minimal
Self-hosted free tier	Unlimited	Not available	Unlimited	Limited
Cloud free tier	25k spans/month (AX)	5k traces/month	50k events/month	10k requests/month
LLM evaluation	Native, extensive pre-built evals	Yes	Yes (Boolean and Categorical LLM-as-Judge)	Limited
Agent tracing	Yes (10 span kinds, multi-agent)	Yes	Yes	Limited
Prompt management	Yes (added 2025)	Yes	Yes	No
Setup complexity	2-4 hours	30 minutes (with LangChain)	4-8 hours	15 minutes
Enterprise compliance	SOC2, HIPAA (AX Enterprise)	Yes	Limited	Limited
Status (2026)	Active development	Active development	Active development	Acquired by Mintlify, maintenance mode

LangSmith is developed by LangChain and benefits from deep integration with the LangChain and LangGraph ecosystems.[^31] Teams building primarily with LangChain can instrument with zero code changes and gain immediate debugging visibility. The tradeoff is that LangSmith is a closed-source proprietary service, and its per-trace pricing structure can scale steeply at high volumes. Teams not using LangChain derive less value from the platform.[^31]

LangFuse is a fully open-source platform licensed under the MIT license.[^32] It emphasizes prompt engineering workflows and collaborative development features. LangFuse's cloud pricing is relatively expensive at higher tiers, but the self-hosted option is free without volume limits. The platform uses ClickHouse as its backing store, providing strong analytics performance.[^32]

Helicone takes a different architectural approach, operating as an HTTP proxy layer in front of LLM providers rather than through SDK instrumentation. This makes setup extremely fast (as little as 15 minutes) but limits the depth of observability to what can be observed at the HTTP layer.

Phoenix's primary differentiators are its vendor-neutral instrumentation (based on open standards), its depth of evaluation capabilities, its strong support for complex agent architectures, and its clear upgrade path from open-source self-hosting to enterprise cloud.[^4][^20] The main tradeoff relative to hosted alternatives is the operational overhead of managing self-hosted infrastructure and the learning curve associated with OpenTelemetry concepts.[^23]

Customers and use cases

Arize AI's customer base spans large technology companies, consumer internet platforms, and enterprises deploying AI in regulated industries. Named enterprise customers include DoorDash, Instacart, Reddit, Uber, Booking.com, PepsiCo, Handshake, TripAdvisor, Siemens, Flipkart, Condé Nast, Discord, Etsy, and Honeywell.[^19][^11] Microsoft (via its M12 venture fund) and Datadog have invested in the company as strategic backers, adding credibility in the enterprise software and infrastructure monitoring markets.[^6][^10]

RAG application monitoring

Retrieval-augmented generation systems are among the most common production deployments of LLMs, combining document retrieval with language model generation to answer questions over private knowledge bases. RAG pipelines introduce multiple potential failure points: the retrieval system may return irrelevant documents, the model may hallucinate facts not present in the retrieved context, or the model may ignore the retrieved context entirely.[^29] Phoenix's tracing captures each stage of the pipeline separately, while its evaluation templates specifically address RAG quality metrics including context relevance, faithfulness, and answer correctness.[^29][^30]

Agent debugging

Autonomous AI agents that use tool calls, multi-step reasoning, and iterative planning present particular observability challenges. A single user request may trigger dozens of tool calls across multiple turns, with each tool call potentially triggering additional sub-agents or API requests. Phoenix's span kind taxonomy, especially the AGENT and TOOL span types, enables structured capture of agent trajectories.[^25] The platform's session-level observability allows teams to evaluate whether agents successfully complete multi-turn tasks, maintain context across turns, and achieve stated goals.[^5]

Production quality monitoring

For teams running LLM applications in production, Phoenix provides continuous evaluation of response quality. By running asynchronous LLM-as-judge evaluations on sampled production traffic, teams can detect quality regressions introduced by model provider updates, prompt changes, or shifts in user query patterns without waiting for user complaints or downstream business metrics to surface problems.[^26][^27]

Regulated industries

Healthcare and financial services organizations deploying AI applications face regulatory requirements around auditability, data residency, and compliance. Arize AX's HIPAA compliance certification and SOC 2 Type II compliance make it suitable for these environments.[^9] The platform's comprehensive audit trails and access control features support the governance requirements of regulated deployments.

Multi-framework and multi-provider environments

Large engineering organizations frequently operate AI applications built on heterogeneous stacks: one product may use LangChain with OpenAI, another may use LlamaIndex with Amazon Bedrock, and a third may use a custom agent framework with Anthropic's Claude. Because Phoenix's instrumentation is based on OpenTelemetry and OpenInference standards rather than proprietary SDKs, it can ingest traces from all of these configurations into a single platform without requiring different tools for different stacks.[^3][^4] This unified view is valuable for platform and infrastructure teams that need to provide observability across an organization's entire AI footprint rather than for a single application.

Development and CI/CD integration

Phoenix is also used during development and in continuous integration pipelines rather than only in production. By running offline evaluations against representative datasets as part of a pull request review process, teams can detect evaluation regressions before deploying changes to production. This shift-left approach to LLM quality assurance mirrors established practices in traditional software testing. Phoenix's dataset and experiment APIs expose programmatic interfaces that allow evaluation runs to be triggered from CI/CD systems and integrated with build status reporting.[^5][^26]

Limitations

Despite Phoenix's broad capabilities, several limitations are relevant for prospective adopters.

OpenTelemetry learning curve: Teams without existing OpenTelemetry experience face a non-trivial onboarding process. Understanding the span model, configuring exporters, and debugging instrumentation issues requires familiarity with distributed tracing concepts that may be new to ML-focused engineers.[^23]

Infrastructure management overhead: Self-hosted Phoenix requires teams to provision, operate, and maintain the server and database. At production scale this adds operational complexity compared to fully managed hosted solutions.[^22]

Limited operational metrics: Phoenix is primarily an observability tool oriented around trace analysis and quality evaluation. It does not provide the rich infrastructure and operational monitoring capabilities of dedicated infrastructure observability platforms. Teams typically use Phoenix alongside infrastructure monitoring tools rather than as a replacement.[^4]

Evaluation cost: Running LLM-as-judge evaluations at production scale incurs inference costs for the evaluator model. Evaluating every production trace in real time may be cost-prohibitive for high-volume applications, requiring teams to implement sampling strategies.[^26][^27]

Enterprise feature separation: Several capabilities of practical importance in enterprise environments, including Arize AI Copilot/Alyx, ML and computer vision support, HIPAA compliance, and advanced security controls, are available only in the paid Arize AX tier rather than the open-source Phoenix distribution.[^9][^16]

License considerations: Phoenix is distributed under the Elastic License 2.0 rather than an OSI-approved license. This restricts third parties from offering Phoenix as a hosted service competitive with Arize AX, and some legal teams treat ELv2 the same way they treat the Business Source License (BSL).[^21]

Relative maturity of agent evaluation: While Phoenix has expanded rapidly into agent tracing, the evaluation primitives for agentic workflows (trajectory evaluation, goal-completion scoring, multi-turn coherence) are less mature and standardized than those for simpler LLM call or RAG use cases. This is an active area of development across the entire LLM observability field, not unique to Phoenix.

Embedding analysis reduced prominence: Early versions of Phoenix placed heavy emphasis on embedding-based analysis for understanding LLM behavior, including t-SNE and UMAP visualizations of embedding clusters.[^1][^2] While these features remain available, the shift toward text-based evaluation and structured tracing has reduced their prominence in the typical Phoenix workflow. Teams primarily evaluating LLM response quality rather than investigating embedding spaces may find these features less relevant to their daily work.

Community and ecosystem

Phoenix maintains an active open-source community centered on its GitHub repository, which had accumulated over 9,600 stars and 887 forks as of mid-2026.[^5] The repository uses an issue tracker for feature requests and bug reports, and Arize has maintained a rapid release cadence, with over 690 releases published since 2023.[^5]

The OpenInference specification is maintained as a separate GitHub repository under the Apache 2.0 license and has been adopted beyond the Arize ecosystem.[^4] Google's Agent Development Kit (ADK) documentation includes official Phoenix integration instructions for tracing and evaluating agentic workflows.[^33] Amazon Web Services published documentation describing Phoenix integration with Amazon Bedrock AgentCore Observability via OpenTelemetry compatibility.[^34] These integrations indicate that OpenInference is gaining traction as a standard instrumentation layer in the AI ecosystem, not only within the Arize user base.

Arize hosts an annual user conference called Arize:Observe, which served as the launch venue for Phoenix in 2023.[^1] The company also publishes research through its blog and public repositories, including contributions to the emerging standards for LLM and agent evaluation methodology.[^6]

References

History

Founding of Arize AI (2020)

Early growth and machine learning focus (2020-2022)

Launch of Phoenix (2023)

Expansion to agent observability and enterprise (2024-2025)

Funding history

Arize AX and Phoenix: two-product strategy

Phoenix OSS

Arize AX

OpenInference library

Relationship to OpenTelemetry

Span kinds

Language support

Architecture

Trace collection

Evaluation engine

Experiments and datasets

Prompt management

Deployment

Features

Tracing and distributed tracing

Evaluation

Prompt playground

Monitoring and alerting

Evaluation framework

LLM-as-judge

Pre-built evaluators

Offline and online evaluation

Integration with external evaluation libraries

Competition and industry context

Comparison with other platforms

Customers and use cases

RAG application monitoring

Agent debugging

Production quality monitoring

Regulated industries

Multi-framework and multi-provider environments

Development and CI/CD integration

Limitations

Community and ecosystem

See also

References

Improve this article

Related Articles

Helicone

Langfuse

Patronus AI

LangSmith

OpenClaw

GAIA benchmark

History

Founding of Arize AI (2020)

Early growth and machine learning focus (2020-2022)

Launch of Phoenix (2023)

Expansion to agent observability and enterprise (2024-2025)

Funding history

Arize AX and Phoenix: two-product strategy

Phoenix OSS

Arize AX

OpenInference library

Relationship to OpenTelemetry

Span kinds

Language support

Architecture

Trace collection

Evaluation engine

Experiments and datasets

Prompt management

Deployment

Features

Tracing and distributed tracing

Evaluation

Prompt playground

Monitoring and alerting

Evaluation framework

LLM-as-judge

Pre-built evaluators

Offline and online evaluation

Integration with external evaluation libraries

Competition and industry context