Vellum is an enterprise AI product development platform that helps engineering teams build, test, and deploy applications powered by large language models (LLMs). The platform provides an integrated suite covering prompt management, evaluation, workflow orchestration, deployment, and production monitoring. Vellum is operated by Vocify Inc. and headquartered in New York City. The company was founded in 2023 and graduated from Y Combinator's Winter 2023 batch. By July 2025 the company had raised $24.5 million in total funding and counted more than 150 companies among its customers, including Redfin, Drata, Swisscom, and Headspace.
Vellum occupies a category often called LLMOps (large language model operations), a space that also includes LangSmith, Langfuse, and Humanloop. Its central positioning centers on test-driven development for AI: the principle that production AI systems require the same systematic evaluation and quality gates that software engineering has long applied to traditional code. Where many LLMOps tools address one part of the lifecycle (observability, or evaluation, or orchestration), Vellum attempts to cover the full arc from initial prompt prototyping through production monitoring in a single platform.
Vellum was founded by Akash Sharma, Sidd Seethepalli, and Noa Flaherty. All three had worked together at Dover, a recruiting automation startup that went through Y Combinator's Summer 2019 batch, where they spent more than two years building production LLM applications for recruiting emails and job descriptions starting in early 2020. That hands-on experience with generative AI in production, before adequate tooling existed to manage it properly, shaped the problem they set out to solve.
Noa Flaherty and Sidd Seethepalli are MIT engineers. Flaherty previously worked on DataRobot's MLOps team, and Seethepalli worked on Quora's ML Platform team. Akash Sharma studied at UC Berkeley and spent five years at McKinsey's Silicon Valley office before joining Dover. Sharma serves as CEO; Seethepalli is co-founder and CTO.
The founding insight was that traditional MLOps had developed robust tooling for managing machine learning models in production, but the LLM era created a different class of problem. Prompts were effectively the "code" of LLM applications, yet teams managed them in browser tabs and spreadsheets. There was no version control, no systematic testing, and no way to run a prompt change through a battery of test cases before deploying it to users. Flaherty and Seethepalli recognized the gap from their MLOps backgrounds; Sharma recognized the business opportunity from his strategy consulting experience at McKinsey.
The three founders also recognized that the mainstream MLOps tooling they had used at DataRobot and Quora had no equivalent for generative AI. Classical ML pipelines involved training runs, model artifacts, and well-defined feature inputs. LLM applications involved natural language prompts, subjective outputs, and provider APIs that changed behavior with each new model version. The gap between building an LLM prototype and keeping it reliable in production was the problem Vellum was designed to close.
Vellum was accepted into Y Combinator's Winter 2023 batch and announced publicly on February 2, 2023. The initial product launched with three core components. The Playground let teams experiment with prompts across multiple test cases and model parameters in a structured environment rather than an ad hoc browser session. The Manage layer acted as an API proxy that captured all LLM requests and enabled version control, audit logging, and rollback. The Optimize module handled automated fine-tuning using production data; one early customer achieved a 94% reduction in LLM costs through the fine-tuning pipeline.
Within five months of launch, Vellum had 50 paying customers. The founders reported 25-30% monthly revenue growth in the period leading up to their seed round announcement.
In July 2023, Vellum raised a $5 million seed round. Investors included Y Combinator, Rebel Fund, Pioneer Fund, and Eastlink Capital. Angel investors included Arash Ferdowsi (co-founder of Dropbox), Dharmesh Shah (co-founder of HubSpot), and Divya Bhat, a former Y Combinator group partner and two-time CEO. TechCrunch covered the round and described Vellum as a "prompt engineering startup," noting the company's 40 paying customers at announcement time and monthly pricing in the $300-$400 range.
The seed round gave the team runway to expand the product from its original three-part structure into a more unified platform. Over 2023 and 2024, Vellum added Test Suites, semantic search for retrieval-augmented generation, and a visual workflow canvas. Each of these features was driven in part by customer requests from teams that had deployed prompts to production and then needed more infrastructure to manage the resulting operational complexity.
In July 2025, Vellum raised $20 million in a Series A round led by Leaders Fund, with participation from Socii Capital, Y Combinator, Rebel Fund, Pioneer Fund, and Eastlink Capital. The round brought total funding to $24.5 million. Alongside the funding announcement, Vellum declared its platform generally available.
Leaders Fund's investment thesis described the market gap as the absence of a "modern software engineering equivalent for GenAI" -- the kind of CI/CD pipelines, unified tooling, and safeguards that software teams take for granted. The fund praised the founders for executing "with velocity, clarity, and precision" and cited Vellum's inbound-driven go-to-market approach as evidence of enterprise trust.
Y Combinator publicly noted at the time of the Series A that Vellum was "defining the standard for enterprise AI development" and that companies like Swisscom, Drata, Redfin, and Headspace were using the platform to ship AI products faster.
Speaking at the announcement, CEO Akash Sharma said: "We provide infrastructure bringing rigor, predictability and repeatability to AI development, so teams can build with confidence."
Vellum's platform is organized around the lifecycle of an AI feature: design, evaluate, deploy, and monitor. Engineers can build through a visual canvas, through a code-first SDK, or by moving between both, with the platform maintaining bidirectional synchronization so that changes in either environment stay in sync. The company positions this architecture as enabling cross-functional collaboration: product managers and domain experts can iterate in the visual builder while engineers work in code, without the two tracks diverging.
Vellum's prompt engineering environment is a shared workspace where teams can test prompts against multiple LLM providers, compare outputs side by side, and iterate on variants across a bank of test cases. Unlike an ad hoc playground, every prompt in Vellum is versioned; the platform keeps a full history of changes and supports one-click rollback to any prior version.
The platform treats prompt deployments as first-class objects, separate from application code. A team defines a deployment and gives it a name; the application calls Vellum's API using that name. When a new prompt version is ready, it can be published to that deployment and all callers immediately receive the updated behavior without a code change or redeploy. This separation means product teams can update prompts independently of engineering release cycles.
The deployment layer also enables A/B testing. A deployment can route a configurable percentage of traffic to a challenger prompt while the rest continues on the control version, and both variants are logged for comparison. Once sufficient data accumulates, the winning variant can be promoted with a single action.
A March 2025 product update added prompt version diffing: a side-by-side view of the changes between any two prompt versions, similar to a code diff. The same update introduced deployment release reviews, a workflow in which team members can approve or reject a prompt or workflow version before it goes live, modeled after the pull request review process in software development. This feature was added specifically for teams operating in regulated industries where change management documentation is required.
Collaboration is a design priority throughout the prompt management layer. Prompt history, comments, and test results are visible to all workspace members, making it practical for domain experts and product managers to participate in prompt iteration alongside engineers.
Vellum's evaluation system is organized around datasets: collections of inputs with expected outputs or scoring criteria. Teams assemble datasets that represent the range of inputs a production feature is likely to encounter, run their prompts or full workflows against those datasets, and score the results using one of several methods.
Scoring options include exact match, regex match, semantic similarity, and LLM-as-judge with custom rubrics. LLM-as-judge evaluation allows teams to define qualitative criteria, for example whether a response follows a particular tone of voice or correctly cites a source, and have a separate language model score each output against those criteria. Human labeling is also supported for high-stakes cases where automated scoring alone is insufficient.
Evaluation integrates directly with deployment. Before publishing a new prompt or workflow version, teams can run it against a test dataset and compare scores against the currently deployed version. This creates a quality gate: if the challenger performs worse than the baseline on the configured evaluation suite, the deployment can be blocked. Vellum calls this the test-driven development approach for AI, and it is the framework the company emphasizes most heavily in its positioning.
Online evaluations extend the same scoring logic to production traffic. Every execution in production can be scored against configured metrics, and those scores feed into monitoring dashboards alongside latency and cost data. Teams can configure automatic test suite generation based on production traffic patterns, seeding new evaluation datasets from real usage without manual curation.
The March 2025 update added automatic evaluation setup, which generates starter test suites from existing production executions, lowering the barrier for teams that have never formally structured their evaluation process.
Vellum's workflow builder is a visual canvas where teams connect nodes to define the control flow of an AI feature. Nodes represent discrete operations: invoking a prompt, calling a tool or external API, performing retrieval from a vector database, executing custom Python or TypeScript code, applying guardrails, branching on conditions, running steps in parallel, or invoking a subworkflow. The canvas natively supports loops, recursion, and error-handling paths.
The March 2025 update introduced native retry and try functionality as node-level adornments. A Retry wrapper causes a node to keep retrying until it succeeds or reaches a maximum attempt count. A Try wrapper allows a workflow to continue gracefully if a node fails, rather than aborting the entire execution. Both patterns are common in production reliability engineering and were previously implemented with custom workarounds.
PDF support was also added, allowing direct upload of PDF documents as prompt inputs for document analysis workflows, compatible with vision-capable models.
The visual representation has a code mirror. Vellum's Workflows SDK, released as generally available in early 2025 and open-sourced under the MIT license, defines AI systems as declarative graphs: nodes represent tasks, edges define control flow. The SDK executes locally, with some nodes making round-trips to Vellum's servers for managed operations like prompt invocation and RAG retrieval. The bidirectional sync works in both directions: teams can build workflows visually in the canvas and pull the code representation using the CLI command vellum workflows pull, or build workflows locally in Python and push them to the canvas for debugging and collaboration. The SDK repository is publicly available at github.com/vellum-ai/vellum-python-sdks.
This bidirectional parity means engineers who prefer code can work entirely in the SDK while non-engineers iterate visually, and both stay synchronized without manual reconciliation.
RAG (retrieval-augmented generation) pipelines have dedicated support. Vellum provides a managed vector database and embeddings service, so teams can ingest documents, configure retrieval parameters, and wire up search nodes in workflows without managing separate vector infrastructure. Drata, a compliance automation company, used this feature to build a system managing more than 28,000 separate vector databases, one per customer environment.
Agents are workflows that include LLM-driven decision-making loops. Vellum's agent builder allows teams to define what tools an agent can call, how it should reason about tool selection, and what conditions terminate the loop. The platform includes a natural language agent creation capability that accepts a plain-language description of what an agent should do and generates a starting workflow graph. "Vellum for Agents," announced as a platform extension in 2024, emphasizes building many specialized agents over building one monolithic agent, and includes 100-plus native integrations for reading from and writing to external systems.
Vellum treats every workflow and prompt as a versioned artifact. The deployment system supports stable, canary, and shadow release patterns. A stable release serves all production traffic. A canary release sends a configurable percentage of traffic to a new version while the remainder continues on the current release. A shadow release runs a new version in parallel with production, capturing inputs and outputs for comparison without affecting user responses.
Version labels allow teams to tag named releases. Rolling back to a prior version requires no code change; a team member with appropriate permissions can revert a deployment from the Vellum UI. Deployment release reviews, added in early 2025, allow teams to require explicit approval from one or more reviewers before a new version goes live.
The platform is available as a hosted SaaS service, as a VPC (virtual private cloud) deployment within a customer's own cloud account, and as a fully private self-hosted installation for organizations with strict data residency requirements. Vellum holds SOC 2 Type II certification and supports HIPAA compliance, which is relevant for customers in healthcare and financial services.
Vellum's observability layer captures each execution as a trace. For workflow executions, the trace shows every node's inputs and outputs, the raw provider request and response, token counts, cost, and latency. The trace view is an interactive replay of the execution graph: teams can step through a production run to understand what happened at each step and why.
A March 2025 update overhauled the monitoring views with a Grafana-based redesign for self-hosted and VPC deployments, providing faster load times and improved filtering. The same update added a Webhooks integration and native Datadog forwarding for teams that route observability data into existing infrastructure monitoring stacks.
Aggregate monitoring surfaces latency percentiles (p50 and p90), error rates, token usage, and cost trends, broken down by deployment, workflow, or model. Quality trends from online evaluations appear in the same dashboard alongside infrastructure metrics. Feedback attached to individual executions, such as a thumbs-down from a user, feeds into the monitoring views so teams can correlate user signals with the specific prompts and conditions that produced a poor output.
A workflow execution cost tracking feature, added in 2025, breaks down per-execution spend by node, allowing teams to identify the most expensive steps in a workflow and optimize accordingly.
Vellum connects to major LLM providers through a unified API abstraction. Supported providers include OpenAI (including GPT-4o and GPT-4o mini), Anthropic (Claude), Google (Gemini, including Gemini 2.5 Pro), Cohere, Mistral, and Meta (LLaMA variants including LLaMA 3.3 70B). Hosted inference services include Azure OpenAI, AWS Bedrock, Fireworks, Perplexity, and Cerebras. Teams can bring their own API keys to maintain direct billing relationships with providers.
The provider abstraction enables model routing and fallback. A workflow can specify a primary model and one or more fallback models; if the primary returns an error or exceeds a latency threshold, Vellum routes the request to the fallback automatically. This pattern is common in production systems where a single provider outage would otherwise degrade the user experience.
Routing logic can also be cost-driven. Teams can configure a workflow to send simpler requests to a smaller, less expensive model and reserve more capable models for complex inputs, with routing decisions based on input characteristics or token count.
Model-level support is updated as new models are released. The March 2025 product update added support for Gemini 2.5 Pro, LLaMA 3.3 70B, and QwQ variants, reflecting the platform's intent to track the major provider release cadence closely.
Vellum offers four plan tiers. The Free plan is available at no cost and supports one user, 50 builder credits per month, hosted app publishing, and a knowledge base with up to 20 documents. The Pro plan is $25 per month per user and expands credits to 200 and the knowledge base to 1,000 documents. The Business plan is $50 per month and supports up to five users, 500 credits, and 10 GB of execution history. Enterprise pricing is custom and includes unlimited users, unlimited credits, role-based access control, single sign-on, and VPC or fully private deployment options.
The entry-level tiers make the platform accessible to individual developers and small teams prototyping AI features. The Enterprise tier targets organizations running AI in production at scale, where SOC 2 compliance, VPC isolation, SSO, and RBAC are procurement requirements.
Vellum competes in the LLMOps category with several platforms that address overlapping parts of the prompt-to-production workflow.
| Feature | Vellum | LangSmith | Langfuse | Humanloop |
|---|---|---|---|---|
| Prompt management | Yes | Yes | Yes | Yes |
| Visual workflow builder | Yes (canvas + code) | No (code-first via LangGraph) | No | Limited |
| Dataset-backed evaluation | Yes | Yes | Yes | Yes (sunsetting) |
| LLM-as-judge scoring | Yes | Yes | Yes | Yes |
| Online evaluations in production | Yes | Limited | Limited | Yes |
| Trace replay (graph view) | Yes | Yes | Yes | Yes |
| Managed RAG / vector DB | Yes | No | No | No |
| Multi-provider routing and fallback | Yes | Limited | No | No |
| VPC / self-hosted deployment | Yes | Limited | Yes (open source) | Limited |
| Deployment release reviews | Yes | No | No | No |
| Open source | SDK (MIT license) | Partial | Yes | No |
| SOC 2 Type II | Yes | Yes | Yes | Yes |
| Status (mid-2025) | Active | Active | Active | Sunsetting |
LangSmith is the observability and evaluation layer built by LangChain for applications built on the LangChain and LangGraph frameworks. It has mature tracing infrastructure and strong OpenTelemetry alignment. Its primary limitation as a standalone choice is tight coupling to the LangChain ecosystem; teams that build directly against provider SDKs or use other orchestration frameworks get less value from LangSmith's tooling. LangSmith does not include a visual workflow canvas; workflow definition is code-first through LangGraph.
Langfuse is an open-source LLM observability platform that emphasizes cost tracking, collaborative debugging, and framework-agnostic tracing. Its open-source model gives teams with strict data requirements the option to self-host entirely on their own infrastructure. Langfuse focuses on observability and evaluation rather than workflow orchestration, so teams that need a visual builder or managed RAG components will find it incomplete as a standalone solution.
Humanloop was a prompt management and evaluation platform that competed directly with Vellum through 2024. In 2025, Anthropic acqui-hired Humanloop's team, and the platform announced it was sunsetting with active migration guidance for existing customers. Several third-party migration guides published in 2025 named Vellum as a primary recommended destination for Humanloop customers with evaluation-heavy workflows.
The clearest differentiator Vellum holds over its remaining competitors is the scope of integration. Vellum covers prompt management, workflow orchestration with a visual canvas, evaluation (both offline and online), managed RAG, deployment with version control and release reviews, and production monitoring in a single product. Competing platforms tend to excel in one or two of these areas and require additional tools to complete the stack. Leaders Fund's investment thesis explicitly framed Vellum's value as eliminating the fragmented-tools problem: "teams currently struggle with fragmented point solutions, brittle systems vulnerable to prompt changes, unpredictable outputs from non-deterministic LLMs."
By the time of its Series A in July 2025, Vellum counted more than 150 customers across industries. Publicly disclosed customers include the following.
Redfin deployed "Ask Redfin," a conversational search experience, to millions of users across 14 real estate markets. Redfin's domain experts used Vellum's evaluation framework with thousands of test cases to validate the agent's behavior before rollout, and continued to use it to govern changes after launch.
Drata, a compliance automation platform, uses Vellum to power secure AI workflows across more than 7,000 customer environments. Each customer's data is stored in an isolated vector database provisioned through Vellum's managed RAG component, resulting in more than 28,000 active vector databases managed through the platform. Drata VP of Engineering Lior Solomon described Vellum's test-driven approach as helping the team "catch regressions early and iterate quickly."
DeepScribe, a medical documentation company, uses Vellum to power clinical note generation workflows. DeepScribe reported reducing the time clinicians spent iterating on notes by 20-40% through the combination of LLM-powered drafts and Vellum's feedback loop and regression testing tools.
Headspace, the meditation and mental health app, uses Vellum across multiple AI initiatives including automating customer support, extracting data, and building content generation agents.
Swisscom, the Swiss telecommunications provider, incorporated Vellum into its AI platform offering for Swiss banks and government agencies, citing the platform's security controls and reliability as requirements for those regulated deployments.
Rely Health used Vellum to accelerate deployment of healthcare workflow automation, reducing time from initial build to production from multi-month efforts to days.
GravityStack used Vellum to automate credit agreement review, reporting a 200% reduction in review time.
Common use cases across the customer base include document parsing and data extraction, customer-facing chatbots and conversational agents, clinical and legal document review, content generation pipelines, compliance workflow automation, and internal process automation. Encore, a customer mentioned at the time of the seed round, reported 3x engineering productivity gains from integrating Vellum into its LLM-powered workflows.
Vellum's 2025 State of AI Development report, based on a survey of more than 1,250 AI developers published in January 2025, found that document parsing (59.7%) and customer service chatbots (51.4%) were the most common production AI use cases at surveyed organizations. The report also found that only 25.1% of organizations had deployed AI applications to production at all, that hallucinations were the top concern (cited by 57% of respondents), and that most teams still relied on manual checks for evaluation despite the availability of automated tools. OpenAI and Azure were the most common API providers, used by 63.3% and 33.8% of respondents respectively, followed by Anthropic and Google.
Vellum's approach to test-driven AI development attracted attention following the Series A announcement in July 2025. Industry coverage in SiliconANGLE, Business Wire, and AI-focused newsletters described the platform as addressing the prototype-to-production gap that had stalled many enterprise AI projects. The "10x faster time-to-market" metric cited by the company and the specific customer outcome data (Redfin's 14-market deployment, DeepScribe's 20-40% time reduction, GravityStack's 200% review time reduction) gave the announcement more concrete grounding than many funding announcements in the AI infrastructure category.
Leaders Fund's characterization of Vellum as the "modern software engineering equivalent for GenAI" resonated with developers who had experienced the gap firsthand. The firm's description of the market gap -- teams struggling with fragmented tools, brittle systems, and subjective evaluation -- matched a widely shared frustration in the developer community building LLM applications.
User reviews on G2 and SourceForge cited the visual workflow builder, bidirectional UI-to-code synchronization, and integrated evaluation tooling as the platform's strongest features. The ability to combine domain experts' knowledge with engineers' workflow infrastructure within a single shared workspace received consistent positive feedback from teams building AI features collaboratively across technical and non-technical roles.
The Humanloop sunset in 2025 benefited Vellum's positioning. With Humanloop customers actively seeking migration paths, third-party comparison guides published throughout 2025 frequently named Vellum as the recommended alternative for teams that relied on Humanloop's evaluation capabilities.
Vellum's 2025 State of AI Development report generated coverage in DevOps, enterprise IT, and AI newsletters. The finding that only 25% of organizations had deployed AI to production despite the broad interest in the technology reinforced the market opportunity the company was addressing.
Vellum has several limitations documented by users and independent reviewers.
Enterprise pricing is not publicly listed. Detailed pricing for VPC and private cloud deployments requires direct engagement with the sales team. This complicates pre-purchase cost modeling for procurement-heavy organizations and has been noted as a friction point in evaluations.
The evaluation suite, while comprehensive when used as part of Vellum's integrated workflow, lags behind dedicated evaluation platforms in depth. Teams that require highly specialized evaluation patterns, or that are already using a purpose-built evaluation tool with custom infrastructure, may find Vellum's evaluation capabilities sufficient for standard use cases but limited for advanced ones.
The platform's agent capabilities built around the Model Context Protocol (MCP) were introduced relatively recently and were still maturing as of mid-2025. Teams evaluating advanced agent patterns were advised by independent reviewers to run a proof-of-concept test to assess coverage before committing.
Because Vellum routes LLM requests through its own infrastructure in the SaaS deployment model, teams with strict latency requirements may observe additional round-trip overhead compared to calling providers directly. The VPC and self-hosted deployment options reduce this concern for teams where performance is a constraint.
The Workflows SDK documentation was noted as substantially stronger for Python than for JavaScript/TypeScript, though both languages are supported. Teams working primarily in TypeScript may need to refer to the Python documentation for patterns not yet fully covered in TypeScript-specific guides.
Finally, as with any platform that abstracts over multiple LLM providers, Vellum's integrations for newer or less common models may lag provider-direct access by days or weeks when a new model releases. The company has historically added major model support quickly (e.g., Gemini 2.5 Pro and LLaMA 3.3 70B appeared in the March 2025 update), but teams with very early-access requirements may need to work around the integration delay.