LangSmith
Last reviewed
Jun 5, 2026
Sources
36 citations
Review status
Source-backed
Revision
v2 ยท 7,432 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 5, 2026
Sources
36 citations
Review status
Source-backed
Revision
v2 ยท 7,432 words
Add missing citations, update stale details, or suggest a clearer explanation.
LangSmith is a commercial observability, evaluation, and deployment platform for large language model (LLM) applications and AI agents, developed and operated by LangChain Inc. It provides developers and operations teams with tooling to trace, debug, test, evaluate, and monitor the behavior of LLM-powered systems throughout their lifecycle, from early prototyping through production. LangSmith is available as a hosted cloud service at smith.langchain.com and as a self-hosted enterprise deployment.
Launched in closed beta in August 2023 and reaching general availability in February 2024, LangSmith has become one of the most widely adopted platforms in the LLM observability category. By October 2025, when LangChain Inc. raised a $125 million Series B at a $1.25 billion valuation, the platform had processed more than one billion traces and demonstrated 12x year-over-year growth in monthly trace volume. As of 2025, LangSmith is used by organizations including Replit, Clay, Harvey, Rippling, Cloudflare, Workday, Cisco, LinkedIn, Uber, Rakuten, J.P. Morgan, and BlackRock.
Although LangSmith integrates most naturally with LangChain and LangGraph, it was explicitly designed to function as a framework-agnostic platform, and as of March 2025 it supports end-to-end OpenTelemetry ingestion, enabling it to ingest trace data from any application that emits standard OpenTelemetry signals. Through late 2025 and the first half of 2026, LangSmith broadened from an observability and evaluation tool into a fuller agent engineering platform, adding a no-code Agent Builder (later renamed Fleet), a Rust-based observability database called SmithDB, and an autonomous debugging system called LangSmith Engine. [25][26]
LangChain began in October 2022 as an open-source Python library created by Harrison Chase, then an engineer at Robust Intelligence, in the weeks following OpenAI's launch of ChatGPT. The library gave developers a structured way to chain LLM calls, connect to external tools and data sources, and build applications that required multiple reasoning steps. It grew rapidly, accumulating tens of thousands of GitHub stars within months and establishing itself as the dominant framework for LLM application development in 2023.
As LangChain applications moved from demos into production, teams encountered a problem that had no adequate solution: LLM systems were opaque. Traditional software monitoring tools reported latency, error rates, and resource utilization, but they could not answer the questions that mattered most for LLM applications: Why did the model give a wrong answer? What context was sent to the model? Which retrieval step returned irrelevant documents? How does changing a prompt affect output quality across hundreds of examples?
Standard APM (application performance monitoring) tools such as Datadog or New Relic could observe the infrastructure around an LLM call but had no visibility into the semantic content of the call itself. Developers resorted to ad hoc logging, printing prompts and responses to console output, or writing custom evaluation scripts for each project.
LangChain Inc. formed as a company around the open-source library in early 2023, with Harrison Chase as chief executive officer. The team concluded that the observability gap was blocking adoption at scale and began designing LangSmith as a dedicated platform to address it.
LangChain Inc. is headquartered in San Francisco. Beyond the LangSmith platform, the company maintains LangChain (the open-source orchestration framework), LangGraph (a stateful agent orchestration library), and LangGraph Platform (a hosted deployment environment for LangGraph applications). The company's business model centers on LangSmith subscriptions and LangGraph Platform compute, with the open-source frameworks serving as the primary developer acquisition channel.
LangChain and LangGraph together reached 90 million combined monthly downloads as of late 2025. Approximately 35% of Fortune 500 companies were using LangChain products in some capacity by that time.
As of mid-2026, the October 2025 Series B remained LangChain Inc.'s most recent funding round, leaving cumulative funding at roughly $260 million across four rounds, and no Series C had been announced. The company had disclosed annual recurring revenue in the range of $12 to $16 million as of mid-2025 and indicated that ARR had continued to grow afterward, though it did not publish an updated figure. [27]
LangChain Inc. announced LangSmith on August 18, 2023, launching it in closed beta. The announcement framed LangSmith as "a unified platform for debugging, testing, evaluating, and monitoring your LLM applications." At the time, developers working with LangChain had no systematic way to inspect the intermediate steps of a chain or agent, compare the effect of prompt changes across a test set, or annotate model outputs for quality assessment.
The closed beta required invitation-based signup. LangChain Inc. positioned LangSmith as separate from the LangChain framework from the outset, committing to support any LLM framework rather than locking the tool to LangChain-native workflows. This design decision was intentional: the team recognized that useful observability infrastructure needs to function regardless of how the underlying application was built.
On February 15, 2024, LangChain Inc. announced the general availability of LangSmith and simultaneously announced a $25 million Series A funding round led by Sequoia Capital, with Sonya Huang representing Sequoia on the deal. At general availability, the platform had accumulated more than 80,000 signups and was serving more than 5,000 monthly active teams. In January 2024 alone, users had logged 40 million traces through the platform.
The GA announcement introduced a formal pricing structure with a free Developer tier, a paid Plus tier aimed at teams, and an Enterprise tier for large organizations. Early named customers included Rakuten, Elastic, Moody's, and Retool.
The Series A was valued at approximately $200 million post-money, establishing LangChain Inc. as a significant venture-backed player in the AI developer tools market.
Through 2024 and into 2025, LangSmith expanded beyond core tracing and evaluation to encompass prompt management, online evaluation of production traffic, annotation queue workflows for human review, and deployment capabilities for LangGraph agents. The platform began incorporating an AI assistant called Polly for trace analysis and an Insights Agent that automatically categorizes agent behavior patterns from production data.
In July 2025, LangSmith became available on AWS Marketplace, enabling enterprise customers to procure it through existing AWS accounts and committed cloud spend agreements.
In October 2025, LangChain Inc. announced a $125 million Series B funding round at a $1.25 billion valuation, achieving unicorn status. The round was led by IVP, with participation from existing investors Sequoia, Benchmark, and Amplify, and new investors including CapitalG, Sapphire Ventures, ServiceNow Ventures, Workday Ventures, Cisco Investments, Datadog Ventures, Databricks Ventures, and Frontline.
The company disclosed that LangSmith had logged more than one billion cumulative traces and had grown monthly trace volume by 12x year-over-year. LangChain Inc. disclosed annual recurring revenue (ARR) in the range of $12-16 million and reported serving approximately 1,000 paying customers.
The Series B announcement coincided with the release of LangChain 1.0, built on the LangGraph runtime, and the private preview launch of Agent Builder, a no-code tool for building agents through natural language descriptions.
The Agent Builder private preview first appeared in LangChain Inc.'s changelog on October 29, 2025, positioned as a way to build "real agents, not visual workflows" by delegating decision-making to an LLM rather than mapping out every step in a fixed flowchart. The tool was aimed primarily at internal productivity use cases and at users without a software engineering background. [28]
On December 2, 2025, LangChain Inc. opened Agent Builder to a public beta available to all LangSmith users. The beta added custom tools through Model Context Protocol (MCP) servers, workspace agents that teams could browse and clone, multi-model support spanning OpenAI and Anthropic models, programmatic invocation through an API, and event-based triggers from sources such as Gmail and Slack. The company highlighted use cases including sales research, competitor tracking, recruiting, issue creation, email triage, calendar management, and Slack summarization. [29][30]
Two further developer-facing tools shipped in December 2025. LangSmith Fetch, released on December 10, 2025, is a command-line tool that lets developers pull and inspect traces directly from the terminal. Pairwise Annotation Queues, released on December 17, 2025, generalized the existing annotation workflow to side-by-side comparison of two agent outputs for preference labeling. [25]
LangSmith Agent Builder reached general availability around January 13 to 14, 2026, roughly six weeks after the public beta. At GA, Agent Builder was made available on all LangSmith plans, including the free Developer tier, and was pitched as letting anyone build agents for complex daily tasks without writing code. The release was framed as part of LangChain Inc.'s push toward enterprise adoption following the October 2025 LangChain 1.0 milestone. [31][32]
On January 16, 2026, LangChain Inc. shipped LangSmith Self-Hosted v0.13, a major release that expanded feature parity between the self-hosted and cloud versions and brought the Insights capability to self-hosted deployments. [25]
A run of feature updates followed in February 2026. Cost Tracking, released on February 5, 2026, gave teams a unified view of cost across an entire agent workflow rather than LLM calls alone, including the ability to submit custom cost metadata. Configurable trace previews arrived on February 6, 2026, letting users choose which input and output components appear in the tracing table. The Insights Agent gained recurring schedules (daily, weekly, or custom cron expressions) on February 17, 2026. Agent Builder added chat, file upload, and a centralized tool registry on February 18, 2026. Baseline Experiments, released on February 19, 2026, let teams designate any experiment as a baseline so that later runs are automatically compared against it. [25]
On March 19, 2026, LangChain Inc. renamed Agent Builder to LangSmith Fleet, reflecting an emphasis on managing a fleet of agents across an organization with agent identity, sharing, and permissions. The rebrand kept the underlying no-code agent product and its capabilities intact. [25][26]
At Interrupt 2026, the company's second annual agent conference, held May 13 to 14, 2026 at The Midway in San Francisco's Dogpatch neighborhood, LangChain Inc. announced a wave of new LangSmith infrastructure aimed at building, debugging, and operating agents at production scale. The lineup included LangSmith Engine, SmithDB, Managed Deep Agents, Context Hub, LLM Gateway, and the general availability of Sandboxes, alongside expansions to Fleet and the Deep Agents library. [33][34]
LangSmith Engine, introduced in public beta on May 13, 2026, is an autonomous debugging system that closes the loop from production failure to fix. Engine continuously watches production traces, clusters failures into named issues, diagnoses root causes against the team's code, drafts pull requests with targeted prompt or code fixes, proposes custom online evaluators, and extracts failing traces into offline evaluation datasets so regressions do not recur. Teams connect a tracing project and optionally link a GitHub repository, after which Engine begins surfacing issues automatically; the initial analysis run takes up to roughly 20 minutes. Industry coverage framed Engine as an attempt to automate the manual cycle of reading traces and writing fixes, while noting that multi-model enterprises may still want a vendor-neutral observability layer given LangSmith's tie to the LangChain ecosystem. [34][35]
SmithDB, also announced May 14, 2026, is a database purpose-built for agent observability that now backs core LangSmith workloads. It is written in Rust on top of Apache DataFusion and the Vortex columnar format, with an object-storage backend and stateless query services. LangChain Inc. reported that SmithDB delivers up to 15x faster performance on core LangSmith experiences, with median (P50) trace tree loads at 92 milliseconds and P50 single run loads at 71 milliseconds, and said it was live for all United States cloud customers. [33][34]
Managed Deep Agents is an API-first hosted runtime for deep agents, exposing a /v1/deepagents API and supporting durable execution, persistent context, tool access, sandboxed code execution, and long-running task workflows. Context Hub is a central system for managing agent behavioral files such as AGENTS.md, skills, and policies, with versioning, environment tags (dev, staging, prod), and collaborative comments. LangSmith LLM Gateway, launched in private beta, is a runtime governance layer that enforces spend limits at the organization, workspace, user, and API-key levels and detects and redacts personally identifiable information and secrets before requests leave a customer's environment, with integrated audit logging. [33]
LangSmith Sandboxes reached general availability at Interrupt 2026. Sandboxes are hardware-virtualized microVM environments for safe code execution, supporting snapshots, copy-on-write forking, reusable blueprints, automatic pause when idle, command-line tooling, and an authentication proxy with custom callbacks. Fleet was extended with public-beta Sandbox integration for data analysis, file generation, and coding tasks, and shipped five prebuilt agents: a coding agent based on Open SWE, a go-to-market agent, an X content manager, an executive assistant, and a competitive researcher. The Developer and Plus plans began including free model usage backed by Fireworks inference. The Deep Agents library reached version 0.6 with a lightweight code interpreter, typed streaming for structured events, a DeltaChannel mechanism for efficient checkpoint storage, and frontend streaming support for React, Vue, Svelte, and Angular. LangChain Inc. also launched LangChain Labs, an applied research initiative working with partners including Harvey, NVIDIA, Prime Intellect, Fireworks, and Baseten on continual learning from agent traces, evaluation generation, and prompt optimization. [33]
LangSmith organizes observability data in a four-level hierarchy: organizations, workspaces, projects, and traces.
An organization is the top-level billing and administrative unit. Within an organization, workspaces provide logical isolation between environments (for example, development, staging, and production). The Plus tier allows up to three workspaces per organization; Enterprise accounts can configure custom workspace counts.
A project is a named container that groups all traces for a single application or service. Developers typically create one project per application or deployment environment.
A trace represents a single end-to-end operation: one user request processed through a chain or agent. Each trace is a tree of runs. A run is the atomic unit of observation in LangSmith, equivalent to a span in OpenTelemetry terminology. Runs can represent any discrete step: an LLM call, a retrieval operation, a tool invocation, a document formatting step, or a custom function. The root run of a trace represents the top-level entry point, and child runs represent nested operations invoked during execution. A single trace can contain up to 25,000 runs.
Multiple traces from a multi-turn conversation can be linked into a thread using a shared session_id, thread_id, or conversation_id identifier, enabling analysis of multi-turn interactions as a coherent unit.
The LangSmith backend uses three storage systems optimized for different workload types. ClickHouse stores high-volume trace and feedback data, optimized for the analytical queries used in dashboards and evaluation reports. PostgreSQL handles transactional and operational data including user accounts, project configurations, and access control records. Redis provides in-memory caching and queue management for low-latency operations.
Beginning in May 2026, LangChain Inc. moved the core trace and feedback workload onto SmithDB, a purpose-built columnar database written in Rust on Apache DataFusion and the Vortex format with an object-storage backend. SmithDB was introduced to handle the very high write volume of agent traces and to speed up the analytical reads that power the trace viewer and dashboards, and the company reported up to 15x faster performance on core experiences after the migration. SmithDB rolled out first to United States cloud customers. [33][34]
LangSmith provides official SDKs for Python and TypeScript. The Python SDK is distributed as the langsmith package on PyPI. Instrumentation can be applied at three levels of invasiveness:
The lowest-friction path for LangChain and LangGraph applications requires only two environment variables: LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY. All LangChain and LangGraph operations are automatically traced through the LangChain callback system with no code changes.
For non-LangChain Python code, the @traceable decorator wraps any function to create a run automatically, capturing function inputs and outputs as run inputs and outputs.
For the most granular control, developers can use the RunTree API to construct trace hierarchies programmatically, explicitly creating parent and child runs and managing their relationships.
The SDK attaches metadata to traces, including tags (string labels for filtering), key-value metadata fields (for environment information, versions, or user identifiers), and feedback scores (numeric or categorical ratings attached to individual runs).
LangSmith Fetch, a command-line tool released in December 2025, complements the SDKs by letting developers retrieve and inspect traces directly from a terminal session for debugging. [25]
In March 2025, LangChain Inc. announced end-to-end OpenTelemetry (OTel) support for LangSmith. The integration works in both directions: LangSmith can export trace data to external OpenTelemetry-compatible backends such as Datadog, Grafana, and Jaeger, and it can ingest OpenTelemetry traces from any application that emits standard OTel signals.
For export, the LangSmith SDK converts LangSmith trace data into OpenTelemetry format and transmits it through the OpenTelemetry SDK. For ingest, any application instrumented with OpenTelemetry can send spans to LangSmith's OTel endpoint, which maps OTel span attributes to LangSmith's data model and renders them in the platform's LLM-specific visualization.
OpenTelemetry mode is enabled by setting the environment variables LANGSMITH_OTEL_ENABLED=true and LANGSMITH_TRACING=true and installing the langsmith[otel] extra. The OTel path carries slightly higher overhead than LangSmith's native binary tracing format; for workloads where LangSmith is the exclusive observability destination, the native format remains recommended for optimal performance.
The OTel support makes LangSmith interoperable with standard enterprise observability stacks and eliminates the requirement to choose between LangSmith and an existing OpenTelemetry pipeline.
The LangSmith web interface renders traces as interactive trees. Each node in the tree corresponds to a run, and users can expand any node to inspect its inputs, outputs, latency, token counts, and associated metadata. For LLM runs, the interface displays the full prompt sent to the model, the model's response, model name, temperature, token usage, and cost estimate.
The platform supports filtering traces by project, time range, metadata values, tags, feedback scores, error status, latency thresholds, and token counts. Saved filters can be applied as monitoring rules that trigger alerts or automated actions when matching traces appear. As of February 2026, the trace table itself is configurable, allowing teams to choose which input and output components are shown in the preview for each run. [25]
LangSmith includes Polly, an AI assistant embedded in the trace interface that analyzes complex multi-step traces and explains what happened in natural language. Users can ask Polly questions about a specific trace, such as why a retrieval step returned irrelevant documents or why an agent took an unexpected tool call sequence.
The Insights Agent operates at the project level rather than the individual trace level. It analyzes patterns across many traces and generates summaries of failure modes, common input categories, output quality trends, and usage patterns. Teams use the Insights Agent to identify systemic issues that would be difficult to detect through manual trace inspection. Since February 2026, the Insights Agent can run on a recurring schedule (daily, weekly, or a custom cron expression) so that these summaries are produced automatically rather than on demand. [25]
LangSmith Engine, announced in public beta in May 2026, extends production monitoring from passive observation toward automated remediation. Rather than waiting for an engineer to read traces, spot a pattern, and write a fix, Engine continuously clusters production failures into named issues, prioritizes them (for example, flagging an issue that affected a given percentage of sessions in a week), diagnoses the root cause against the connected codebase, and drafts a pull request with a proposed prompt or code fix. It also proposes a custom online evaluator for the issue and adds the failing traces to an offline evaluation dataset, so the fix can be regression-tested. Engine works from a connected tracing project and an optional linked GitHub repository, and its initial analysis run takes up to roughly 20 minutes. [34][35]
LangSmith provides monitoring dashboards that aggregate trace data into time-series metrics: request volume, error rate, median and p95 latency, token consumption, and estimated cost. Dashboards can be filtered by project, tag, or metadata to isolate metrics for specific user segments, deployment versions, or geographic regions. As of February 2026, cost dashboards span the full agent workflow rather than LLM calls alone, and teams can submit custom cost metadata so that non-LLM costs (such as tool or API usage) are reflected in the totals. [25]
Automation rules allow teams to define conditions, such as traces with latency above a threshold or feedback scores below a minimum, that trigger actions: routing the trace to an annotation queue for human review, adding the trace to a dataset, or sending an alert notification.
Beyond LangChain and LangGraph, LangSmith supports tracing for applications built with OpenAI's SDK, Anthropic's SDK, CrewAI, Vercel AI SDK, Pydantic AI, LlamaIndex, and any framework that emits OpenTelemetry spans. The Prompt Hub integration is specific to LangSmith-native prompts, but the tracing infrastructure is framework-agnostic.
LangSmith's evaluation framework distinguishes between offline evaluation, which runs before deployment using curated test datasets, and online evaluation, which monitors production traffic in real time. Both modes use the same evaluator types: human review, code-based evaluators, LLM-as-a-judge, and pairwise comparison.
A dataset in LangSmith is a versioned collection of input-output examples used as the benchmark for evaluation experiments. Datasets can be created by:
Datasets are versioned automatically: every edit or deletion creates a new version with a clean audit trail. Named version tags allow teams to mark versions that correspond to production releases or evaluation milestones. Dataset examples persist indefinitely, even after the underlying traces that generated them are deleted, making datasets the durable record of evaluation data.
Each example in a dataset contains an input field (what was sent to the application), optionally an expected output (the ground truth answer), and optional metadata fields. For structured outputs, datasets can store complex nested objects.
An experiment in LangSmith is a run of an application or function over a dataset, producing one output per example. Experiments are the primary unit of offline evaluation. The evaluate() function in the LangSmith SDK accepts an application function and a dataset name and handles execution, including configurable concurrency, repetitions for variability sampling, and result caching to avoid re-running identical inputs.
Each experiment produces a run tree, where the root run represents one dataset example and child runs represent the nested operations that produced the output. Evaluator scores are attached to runs as feedback, enabling granular debugging of which step in a multi-step application contributed to a score.
Experiment results are visualized as tables and charts in the platform UI. Teams can compare two or more experiments side by side to assess whether a prompt change, model upgrade, or architectural change improves or regresses performance across the dataset. The comparison view highlights examples where two experiments produce meaningfully different scores, surfacing the cases most worth investigating. As of February 2026, any experiment can be designated as a baseline, and subsequent experiments are then compared against that baseline automatically. [25]
Code-based evaluators are Python functions that receive a run's input, output, and optionally the expected output from the dataset and return a numeric score or categorical label. They are deterministic, cheap to run, and appropriate for objective criteria such as exact match, format validation, JSON schema conformance, or regex matching.
LLM-as-a-judge evaluators send the run's input and output to a second LLM with an evaluation prompt asking it to score a quality dimension such as factual accuracy, tone, helpfulness, or goal completion. LangSmith provides built-in judge prompts for common criteria and a prompt editor for customizing evaluation rubrics. Because LLM judges are themselves probabilistic, LangSmith supports running each judge evaluation multiple times and averaging the result to reduce variance.
Pairwise evaluators present two outputs side by side to a judge (either a human or an LLM) and ask it to choose which is better. Pairwise evaluation is particularly useful when absolute quality scores are difficult to define but relative preference is clear, and it is commonly used for preference modeling and reinforcement learning from human feedback (RLHF) workflows.
Human evaluators can score any run directly in the LangSmith UI or through annotation queues.
Online evaluation applies evaluators to production traces as they are generated. Teams configure filters and sampling rates to control which fraction of production traffic is evaluated, managing the cost of running LLM-based judges at scale. Online evaluation results appear on monitoring dashboards alongside the raw production metrics, providing a continuous quality signal rather than a periodic benchmark.
Failed traces identified by online evaluation can be automatically added to datasets, creating a feedback loop in which production failures become regression test cases for the next development iteration.
LangSmith evaluation experiments can be integrated into continuous integration pipelines using the LangSmith SDK's assertion utilities. Teams define score thresholds and run evaluate() as part of a test suite; the suite fails if any evaluator score falls below the threshold. This enables automated regression testing for LLM quality, analogous to unit tests in traditional software engineering.
Annotation queues provide a structured interface for routing traces to human reviewers and collecting structured feedback at scale. They address the workflow problem of organizing human evaluation: without queues, reviewers must manually search for traces to review, there is no mechanism for assigning reviewers to specific tasks, and feedback collection is inconsistent.
LangSmith supports two annotation queue styles. Single-run queues present one trace at a time. The reviewer sees the full trace tree and is prompted to submit ratings on the rubric items configured for that queue. Rubric items can be numeric sliders (for continuous quality scores), categorical selections (for classification labels), or free-text fields (for corrections or comments).
Pairwise annotation queues (PAQs) present two traces side by side. The reviewer chooses which output is better, which is equivalent, or which is worse, optionally providing a text explanation. Pairwise queues are particularly effective for preference modeling tasks where relative quality is easier to judge than absolute quality, and they map directly onto the data format required for RLHF training. LangChain Inc. shipped pairwise annotation queues in December 2025 as a general extension of the annotation workflow. [25]
Administrators configure queues with a name, description, and rubric definition. Traces can be added to queues manually by individual developers, automatically through automation rules triggered by monitoring conditions, or through bulk operations on filtered trace queries. Queues display reviewer progress, showing how many traces have been reviewed and how many remain.
Feedback collected through annotation queues is stored as structured run feedback and is available for analysis, export, and incorporation into datasets. The ability to route annotated traces directly to datasets closes the loop between human evaluation and automated testing.
The Prompt Hub is LangSmith's centralized repository for storing, versioning, and sharing prompt templates. Each prompt is identified by a name and owner, and it stores the template text along with associated model configuration defaults such as model name, temperature, and maximum tokens.
Every time a prompt is pushed to the Hub, LangSmith generates a unique commit hash representing that exact version. Teams can reference a prompt by name (always resolving to the latest commit), by a specific commit hash (for reproducibility), or by a named tag (for environment-based release management). Tags such as staging and production can be attached to specific commits and moved to newer commits as part of a deployment workflow, enabling teams to control which prompt version runs in each environment without changing application code.
LangSmith traces record which prompt commit was used for each execution, making it straightforward to correlate output quality changes with specific prompt modifications during debugging.
The Playground is an interactive testing environment integrated into LangSmith. Developers can load any prompt from the Hub, modify it, select a model provider and configuration, and run it against custom inputs or examples from a dataset. The Playground displays token counts and cost estimates for each run and allows side-by-side comparison of outputs from different prompts or models.
Prompt Canvas, a feature introduced in 2024, allows developers to request AI-assisted prompt rewrites with specific instructions (improve clarity, adjust reading level, change tone) and then preview a diff of the proposed changes before saving a new version to the Hub.
In 2025, LangSmith expanded its scope from pure observability to include deployment capabilities for LangGraph agents. The Deployment module enables human-in-the-loop workflows where agents pause and request human approval before executing sensitive or irreversible actions. It also supports background agent execution, where agents run asynchronously from user interactions, and multi-agent coordination patterns.
A centralized agent registry within LangSmith tracks all deployed agent versions with versioning, rollback capability, and support for A2A (Agent2Agent), MCP (Model Context Protocol), and Agent Protocol standards.
Fleet, a no-code agent system, allows non-technical users to build agents through natural language descriptions rather than code. Fleet agents learn from feedback collected through annotation queues and can request permissions from administrators when encountering actions outside their configured scope. The product launched in private preview in October 2025 as Agent Builder, opened to a public beta on December 2, 2025, reached general availability across all plans in mid-January 2026, and was renamed from Agent Builder to LangSmith Fleet on March 19, 2026, adding agent identity, sharing, and organization-wide permissions. At Interrupt 2026, Fleet gained five prebuilt agents (a coding agent based on Open SWE, a go-to-market agent, an X content manager, an executive assistant, and a competitive researcher), public-beta Sandbox integration for data analysis and coding tasks, and free model usage backed by Fireworks inference on the Developer and Plus plans. [28][29][31][26][33]
The Agent Studio provides visual debugging with breakpoints, enabling developers to pause an agent mid-execution, inspect its state, modify variables, and resume, analogous to a traditional code debugger but for LLM agent workflows.
At Interrupt 2026, LangChain Inc. added a hosted runtime for deep agents called Managed Deep Agents, exposing a /v1/deepagents API and supporting durable execution, persistent context, tool access, and sandboxed code execution for long-running tasks. The company also brought LangSmith Sandboxes to general availability. Sandboxes are hardware-virtualized microVM environments for safe code execution, with snapshots, copy-on-write forking, reusable blueprints, automatic pause when idle, command-line tooling, and an authentication proxy. Context Hub, introduced at the same event, centralizes management of agent behavioral files such as AGENTS.md, skills, and policies, with versioning, environment tags, and collaborative comments, and LLM Gateway (private beta) adds a runtime governance layer that enforces spend limits and redacts sensitive data before requests leave the customer's environment. [33]
LangSmith offers three main pricing tiers as of 2025, with an additional startup program.
| Plan | Price | Seats | Included traces | Workspaces | Trace retention | Support |
|---|---|---|---|---|---|---|
| Developer | Free + pay-as-you-go | 1 | 5,000/month | 1 | 14 days | Community |
| Plus | $39/seat/month + pay-as-you-go | Unlimited | 10,000/month per org | Up to 3 | 14 days (base); 400 days (extended) | |
| Enterprise | Custom | Custom | Custom | Custom | Custom | SLA + dedicated engineers |
Overage traces on the Plus plan cost $2.50 per 1,000 traces at base retention (14 days) or $5.00 per 1,000 traces at extended retention (400 days). Developer plan overages are priced at $0.50 per 1,000 traces.
A notable pricing consideration: traces that receive feedback, annotations, or corrections are automatically upgraded to extended retention, incurring the higher $5.00 per 1,000 rate. For agent workflows with heavy annotation, this can substantially increase trace costs relative to a baseline estimate.
The Plus plan also includes deployment compute at $0.0007 per minute for development deployments and $0.0036 per minute for production deployments, plus Fleet agents at $0.005 per run beyond the included 500 monthly runs. As announced at Interrupt 2026, the Developer and Plus plans also include free model usage backed by Fireworks inference for agents built with Fleet, reducing the cost of running no-code agents for those tiers. [33]
The Enterprise tier includes custom single sign-on (SSO) and SAML integration, System for Cross-domain Identity Management (SCIM) for automated user provisioning, full role-based and attribute-based access control (RBAC/ABAC), data encryption at rest and in transit, audit logs, team training, and architectural guidance. Enterprise customers can also negotiate hybrid deployment or self-hosted deployment options.
The Startup Program offers discounted rates and credits for venture-capital-backed early-stage companies building agentic applications.
LangSmith's deepest integration is with LangChain and LangGraph, which are maintained by the same company.
For LangChain applications, enabling tracing requires only setting the LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY environment variables. Every chain, agent, LLM call, retrieval step, tool invocation, and output parser in a LangChain application generates a run automatically. No callback configuration or code modification is required.
For LangGraph applications, tracing integrates with LangGraph's state management model. Each graph node execution becomes a child run within the trace, capturing the node name, the graph state before and after execution, and any tool calls the node made. LangGraph's human-in-the-loop checkpoints appear in traces as pause events, giving developers full visibility into where an agent paused for human input and what state it resumed from.
LangSmith datasets drawn from LangGraph traces can include full agent conversation threads, preserving the multi-turn structure required to evaluate dialog quality and agent goal-completion over extended interactions.
Prompts stored in the LangSmith Prompt Hub are directly loadable in LangChain and LangGraph applications using the hub.pull() function, connecting the prompt management and tracing infrastructure within a single workflow.
The LLM observability market includes several competing platforms with different positioning and trade-offs.
| Feature | LangSmith | Langfuse | Helicone | Arize Phoenix |
|---|---|---|---|---|
| Open source | No | Yes (Apache 2.0) | Partial | Yes (Apache 2.0) |
| Self-hosting | Enterprise add-on | Free | No | Yes, free |
| Primary deployment model | Cloud SaaS | Cloud or self-hosted | Cloud SaaS (proxy) | Cloud or local |
| LangChain integration | Native, zero-config | SDK-based | SDK or proxy | SDK or OTel |
| OpenTelemetry support | Yes (March 2025) | Yes | Limited | Yes (primary) |
| Evaluation framework | Comprehensive | Growing | Basic | Comprehensive |
| Annotation queues | Yes | Yes | No | Limited |
| Prompt management | Yes (Hub + versioning) | Yes | No | Yes (April 2025) |
| Per-trace pricing | Yes | Yes | No (flat/seat) | Free (OSS) |
| Pairwise evaluation | Yes | No | No | Yes |
Langfuse is an open-source LLM observability platform released under the Apache 2.0 license. Its primary differentiator is self-hosting: the complete Langfuse feature set can be deployed on a team's own infrastructure at no software cost, with feature parity between the self-hosted and cloud versions. Langfuse is well-suited for teams with strict data residency requirements, budget constraints, or governance policies that prohibit sending application data to third-party SaaS platforms. Langfuse's evaluation and prompt management capabilities have expanded substantially since its initial release, making it competitive with LangSmith on most core features. The main trade-offs relative to LangSmith are that Langfuse requires infrastructure management when self-hosted and has a less mature pairwise evaluation workflow.
Helicone operates as an HTTP proxy rather than an SDK instrumentation layer. Applications route their LLM API calls through Helicone's proxy, which logs requests and responses without requiring application code changes. This architecture makes Helicone extremely fast to adopt, with typical setup times of under fifteen minutes, but it imposes limitations: as a proxy, Helicone sees only the HTTP boundary of each LLM call and has no visibility into the multi-step agent traces that span many LLM calls. It is best suited for applications where the primary observability need is tracking cost, latency, and volume across direct LLM API calls rather than debugging complex agent workflows. Helicone offers flat-rate pricing starting at $25 per month, which makes cost predictable for high-volume applications.
Arize Phoenix is an open-source observability platform built natively on OpenTelemetry, released under the Apache 2.0 license by Arize AI. Phoenix can run entirely locally or be deployed on team-managed infrastructure with no per-trace charges. It integrates with any framework that emits OpenTelemetry spans, making it the most framework-agnostic option in the category. Phoenix's evaluation framework supports complex multi-step agent evaluation and includes deep integration with the broader Arize AI ecosystem for production machine learning monitoring. In April 2025, Phoenix added a prompt management module with versioning and template reuse. The primary trade-off is operational: Phoenix requires infrastructure engineering to run at production scale (typically PostgreSQL and Kubernetes), and its UI is less polished than LangSmith's hosted interface. Arize AX, the enterprise SaaS product from Arize AI, is proprietary and separate from the open-source Phoenix project.
As LangSmith moved deeper into autonomous debugging and hosted agent runtimes in 2026, some industry analysts noted that its tight coupling to the LangChain ecosystem could be a consideration for multi-model enterprises that prefer a vendor-neutral observability layer spanning many frameworks and model providers. [35]
LangSmith is used across a broad range of industries and application types.
In software development tooling, companies such as Replit use LangSmith to trace and evaluate AI coding assistants, monitoring which retrieval contexts led to correct versus incorrect code suggestions and using annotation queues to collect developer feedback on code quality.
In financial services, organizations including J.P. Morgan and Moody's apply LangSmith's evaluation framework to validate that LLM outputs in regulatory and analytical contexts meet accuracy and compliance standards, maintaining audit trails of which prompt versions were active during specific time periods.
In cybersecurity, Elastic used LangGraph to orchestrate a multi-agent threat detection system and LangSmith to trace each step in the threat analysis workflow, enabling engineers to debug edge cases in detection logic and validate that agent behavior remained consistent after updates.
Rakuten built an enterprise-wide generative AI platform using LangGraph and LangSmith that enables employees across more than 70 business units to create and deploy AI agents. LangSmith provides the observability and evaluation layer that Rakuten's platform team uses to ensure quality across all deployed agents.
Customer-facing enterprise applications at companies such as Klarna, Vodafone, and Home Depot use LangSmith to monitor LLM-powered support and automation workflows, track quality metrics over time, and route edge cases to human review queues.
At Interrupt 2026, LangChain Inc. featured production case studies from teams at companies including Clay, Rippling, Workday, Toyota, Lyft, LinkedIn, Coinbase, Honeywell, monday.com, LATAM Airlines, Etsy, Abridge, and Bridgewater Associates, reflecting continued enterprise adoption of LangSmith and the broader LangChain agent stack. [34][36]
LangSmith has several limitations that prospective users should consider.
Vendor lock-in: LangSmith is closed-source software. Users cannot inspect or modify the underlying platform code, and the product roadmap is determined entirely by LangChain Inc. While the OpenTelemetry integration reduces instrumentation lock-in, the evaluation, annotation, and dataset management workflows are proprietary, and migrating accumulated datasets and experiment history to another platform requires export and reimport. The expansion into autonomous debugging (LangSmith Engine), hosted agent runtimes (Managed Deep Agents), and a proprietary observability database (SmithDB) in 2026 deepened the platform's footprint, which some analysts argued could increase coupling to the LangChain ecosystem for teams that adopt these features. [35]
Self-hosting restrictions: Self-hosted deployment is only available as an Enterprise add-on and requires Kubernetes for production deployments or Docker for development environments. The January 2026 Self-Hosted v0.13 release narrowed the feature gap with the cloud product, including bringing Insights to self-hosted installations, but the self-hosted option remains tied to the Enterprise tier. Teams that want data residency control without paying Enterprise rates must use a competing open-source platform such as Langfuse or Arize Phoenix. [25]
Pricing at scale: Per-trace pricing can become expensive at high trace volumes. The automatic upgrade of annotated traces to extended retention at the $5.00 per 1,000 rate means that teams running active annotation workflows may see trace costs significantly higher than a naive estimate based on the base trace volume alone. Enterprise customers with predictable high-volume workloads typically negotiate flat-rate contracts to avoid this unpredictability.
Data governance: LangSmith's hosted SaaS model means that detailed application data, including full prompts, model responses, and user inputs, is transmitted to LangChain Inc.'s cloud infrastructure. For applications handling sensitive personal data, health information, or legally privileged content, this raises data governance considerations that may require the Enterprise self-hosted option or an alternative open-source platform. The LLM Gateway introduced in private beta in 2026 adds runtime PII and secret redaction before requests leave a customer's environment, which can mitigate but does not eliminate these considerations.
Framework-agnostic experience gap: While LangSmith officially supports non-LangChain frameworks and provides OpenTelemetry ingest, the user experience is most fully realized for LangChain and LangGraph applications. Teams using other frameworks such as LlamaIndex, CrewAI, or custom orchestration code must invest more in integration configuration to achieve the same level of automatic instrumentation that LangChain applications receive out of the box.
Trace retention defaults: The default 14-day retention on base traces is short relative to the timescales of LLM application debugging and compliance requirements. Teams that need longer trace histories must pay for extended retention or proactively export traces to external storage.