LiteLLM
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,019 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,019 words
Add missing citations, update stale details, or suggest a clearer explanation.
LiteLLM is an open-source Python SDK and proxy server (commonly called the "LiteLLM Proxy" or "LiteLLM Gateway") that exposes more than 100 large-language-model provider APIs through the OpenAI Chat Completions request and response format.[1] Released in August 2023 by BerriAI, a Y Combinator Winter 2023 startup founded by Ishaan Jaffer and Krrish Dholakia, the project consolidates calls to providers such as openai, anthropic, Amazon Bedrock, Google Vertex AI, Azure OpenAI, cohere, Mistral AI, Together AI, replicate, Hugging Face and ollama behind a single litellm.completion() function call.[2][3] The accompanying proxy server adds routing, automatic fallbacks, retries, virtual API keys, budget and cost tracking, and pluggable observability callbacks for stacks such as langfuse, helicone and mlflow.[4][5] LiteLLM has become one of the most widely deployed components of the LLM-operations ("LLMOps") layer, with Netflix, Lemonade, Rocket Money and Adobe listed as production users on the project's homepage.[6][7]
| Attribute | Value |
|---|---|
| Developer | BerriAI (YC W23) |
| Founders | Ishaan Jaffer, Krrish Dholakia |
| Initial public release | August 2023 |
| License | MIT (open source core); commercial enterprise license |
| Language | Python |
| Repository | github.com/BerriAI/litellm |
| Latest release (as of writing) | v1.85.1 (21 May 2026) |
| Components | Python SDK, Proxy Server ("LiteLLM Gateway") |
| Supported providers | 100+ |
| Headquarters | San Francisco, California |
LiteLLM was built inside BerriAI, a company founded in 2023 by Krrish Dholakia and Ishaan Jaffer.[2][8] The founders were attempting to build a "chat-with-your-data" SaaS product and discovered that supporting multiple LLM back-ends, each with a different request schema, error model and streaming convention, was a substantial engineering burden; LiteLLM was extracted from that internal need.[8] BerriAI was accepted into Y Combinator's Winter 2023 batch.[9]
Y Combinator publicly launched LiteLLM via its Launch YC channel on 24 August 2023.[10] Y Combinator's announcement described the project as "an open source package that allows you to call 100+ LLM APIs (like Llama2, Anthropic, and Huggingface) using the OpenAI format."[10] At launch the library standardised inputs, outputs and exceptions across providers, shipped with more than 50 test cases, and already included logging integrations to Sentry, PostHog and Helicone.[10] The first tagged GitHub release of the SDK preceded the YC launch by approximately two weeks.[9]
BerriAI raised a $1.6 million seed round in 2023 co-led by Y Combinator with participation from Gravity Fund and Pioneer Fund.[11] The project grew rapidly: by mid-2025 InfoWorld reported "over 20,000 GitHub stars and 2,600 forks,"[7] and by May 2026 the official BerriAI/litellm repository reported approximately 47,800 stars, 8,200 forks, 21,300 dependent projects and 1,328 releases.[1] The LiteLLM homepage cites more than one billion requests proxied and over 240 million Docker pulls for the official container image, with more than 1,000 individual contributors to the project.[6][12] Y Combinator's company page lists ten employees in San Francisco and a $7 million annual recurring revenue figure for LiteLLM as a venture-backed company.[9]
The release cadence is unusually fast for an open-source infrastructure project: the BerriAI/litellm GitHub release archive listed 1,328 total releases as of 21 May 2026, an average of more than one tagged build per day over the project's roughly thirty-three-month public history, and the maintainers have at times shipped multiple patch releases within a single day to accommodate API changes from upstream providers.[1] The team explicitly markets "day-zero" coverage of new flagship models, with the David Leen quote on the homepage observing that LiteLLM has historically allowed Netflix to expose newly released models to internal users on the same day they ship.[6]
The central design choice of LiteLLM is to use OpenAI's /v1/chat/completions request and response schema as a lingua franca for all supported providers.[3][13] Each provider integration is a translation layer that maps incoming OpenAI-shaped messages, tool definitions, streaming chunks and error codes onto the provider's native API, then converts the response back into the OpenAI shape that the caller expects.[3] As the docs put it, LiteLLM "maps every provider's errors to the OpenAI exception types," so existing client code written against the OpenAI SDK works against any LiteLLM-supported back-end with only a change of the model string.[14]
LiteLLM's provider catalogue is organised across several endpoint families. The Python SDK exposes litellm.completion, litellm.acompletion (the async variant), litellm.embedding, litellm.image_generation, litellm.audio_transcription, litellm.responses (a translator for the OpenAI Responses API), litellm.batches and litellm.assistants calls, all routed through the unified mapper.[3][13] Streaming is normalised so that downstream code sees OpenAI-style delta chunks regardless of provider.[14]
This design choice has several engineering consequences. Because OpenAI's API is widely understood and well-tooled, applications can use existing OpenAI client libraries, retry middleware, and inspection utilities directly. Provider differences that lack an OpenAI analogue (for example, Anthropic's earlier system-prompt convention or the per-provider format for tool definitions and tool-result messages) are normalised in the translator, with provider-specific extensions accessible through dedicated parameters where needed. The translation layer also performs token accounting using each provider's tokenizer when one is available, so that cost and rate-limit calculations reflect the provider's actual billable units rather than a uniform but inaccurate approximation.[20]
A minimal example mirrors the OpenAI Python SDK exactly except for the provider prefix in the model string:[14]
from litellm import completion
import os
os.environ["OPENAI_API_KEY"] = "..."
response = completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
Swapping providers is a one-line change:[14]
response = completion(
model="anthropic/claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello"}],
)
The SDK ships an in-process Router class that can be configured with multiple deployments per model name, routing strategy, retries, cooldowns and fallback chains, making it usable as a library-level load balancer without standing up the proxy server.[15] Router behaviour is configured via a model list of dictionaries, each containing a public model_name (the logical alias clients will request), a litellm_params block (provider, deployment-specific API base, key and tuning parameters), plus optional metadata such as rpm, tpm and weight. Multiple entries can share the same model_name, in which case the Router treats them as deployments of a single virtual model and selects between them at request time according to the configured strategy.[15]
The LiteLLM Proxy, also marketed as the "LiteLLM Gateway" and "AI Gateway," is a FastAPI-based service that exposes OpenAI-compatible HTTP endpoints (/chat/completions, /embeddings, /images/generations, /audio/transcriptions, /batches and others) and proxies them to any combination of configured back-end providers.[4][16] Because the surface is OpenAI-compatible, "any client that works with OpenAI works with the proxy, no code changes needed,"[4] including the official OpenAI SDKs, LangChain, LlamaIndex and Instructor.[16]
The gateway is configured via a YAML file (commonly config.yaml) that defines model groups, deployment ordering, routing strategy, fallbacks, budgets and observability sinks; deployment is typically via Docker, Helm or the BerriAI-managed "LiteLLM Cloud" SaaS, with PostgreSQL as the backing store for keys and spend logs and Redis used to share rate-limit and load-balancing state across replicas.[4][17] The published reference architecture, described in the official Docker quick-start tutorial, places one or more proxy replicas behind an HTTP load balancer, with PostgreSQL handling persistent state and Redis handling ephemeral, per-second counters such as token budgets and concurrency limits.[4]
The proxy advertises its OpenAI compatibility as a "drop-in" replacement: a developer can change the OPENAI_BASE_URL environment variable that the OpenAI SDK reads and the rest of the application becomes vendor-agnostic without further changes.[4][16] This makes the proxy attractive as a migration tool for organisations that already have substantial code targeting the OpenAI SDK and want to move some or all of their traffic to other providers without a code rewrite. The same pattern works for agentic frameworks such as the OpenAI Agents SDK, LangChain agents, and LlamaIndex query engines, all of which speak the OpenAI HTTP API natively.[16]
The Router supports several strategies, including a default simple-shuffle for low overhead, latency-based routing, usage-based routing (tokens per minute or requests per minute, shared across replicas via Redis), least-busy, cost-based, weighted random and pluggable custom strategies.[15] Deployments can be tagged with an integer order so the router prefers lower-priority deployments first and only escalates on failure.[15] When a deployment fails, the router applies a per-deployment cooldown (default three failures per minute, five-second cooldown) so that a single misbehaving back-end is isolated rather than the whole model group being shut down.[15]
LiteLLM distinguishes three fallback families: standard fallbacks for general errors (rate limits, timeouts, 5xx), content-policy fallbacks that fire on provider content-policy violations, and context-window fallbacks that route to a larger-context model when the request exceeds a deployment's window.[18] Fallbacks are configured as ordered lists per model name, can be overridden per request via "fallbacks": [...] in the request body, and can be disabled entirely with "disable_fallbacks": true.[18] An enable_pre_call_checks mode lets the router filter out deployments that cannot satisfy a request's context length or region requirements before the call is made.[18]
Once the proxy is connected to a PostgreSQL database and a "master key" is set, administrators can mint virtual API keys via the /key/generate endpoint.[19] Each virtual key can be scoped to a model allowlist, a team, a user and a budget, and the proxy enforces tokens-per-minute and requests-per-minute rate limits at the key, team and user levels.[19][20] Spend is computed using LiteLLM's completion_cost() function against a per-provider pricing table and is automatically attributed to the requesting key, team, user and organisation; the /global/spend/report endpoint aggregates spend by team, customer or API key, and per-call cost is also returned to the caller in the x-litellm-response-cost response header.[20] Enterprise-tier features layered on top include tag-based budgets, model-specific budgets per virtual key, temporary budget increases, soft-budget email alerts, and richer spend-logging metadata.[21]
Virtual keys also serve as the unit of model-aliasing: an administrator can configure a key so that requests for gpt-4 are rewritten to gpt-4o-mini (downgrade) or to a fine-tuned Claude model on Bedrock (cross-provider substitution), without the calling application needing to know about the change. Combined with model groups in the YAML configuration, this allows the platform team to perform centralised model deprecation, A/B testing of model upgrades, and cost-driven traffic shaping without coordinating code changes in every downstream application.[19] The proxy also exposes endpoints to list, update, regenerate and revoke virtual keys, supporting key-rotation workflows that include grace periods during which both old and new keys remain valid.[19]
LiteLLM exposes three callback hooks: input_callbacks, success_callbacks and failure_callbacks. Each can be set to a list of named integrations; for example, litellm.success_callback = ["posthog", "helicone", "langfuse", "lunary"] and litellm.failure_callback = ["sentry", "lunary", "langfuse"].[22] Built-in callback targets include Langfuse, Lunary, Helicone, LangSmith, Traceloop, Athina, Sentry, PostHog, Slack, Arize, PromptLayer, MLflow, DeepEval, Braintrust, DataDog and OpenTelemetry, alongside others.[22] The proxy also exposes Prometheus metrics and supports per-team logging so that, for example, a team's traffic can be sent to its own Langfuse project.[21]
Because the callback API is uniform across providers, observability traces produced by LiteLLM carry the same fields (token counts, latency, cost, model, key, team, user) regardless of which back-end ultimately served the request. This makes the gateway a natural point for cross-provider analytics dashboards and for cross-provider evaluations, where the same prompt is sent to several models in parallel and the resulting traces are compared after the fact. Langfuse's own integration documentation describes LiteLLM as a recommended source for ingest, and observability vendors such as Helicone and Lunary publish reciprocal walkthroughs that pair their platforms with LiteLLM-managed traffic.[22]
LiteLLM's open-source distribution under the MIT-licensed core already includes 100+ provider integrations, virtual keys, budgets, teams, load balancing and guardrail hooks at no cost.[23] BerriAI sells a separate enterprise tier, recommended for organisations running the gateway "at scale, 100+ users or 10+ production AI use-cases," that adds single sign-on for the admin UI (Okta, Azure AD, Google Workspace, OIDC and SAML), JWT-based authentication, audit logs with configurable retention, role-based access control, IP allowlisting, automated key rotations and integrations with secret managers including AWS KMS, Azure Key Vault and HashiCorp Vault.[17][21] Governance features include a four-tier multi-tenant hierarchy (organisations, teams, projects, keys), tag-based spend tracking, programmatic spend reports and per-key or per-team guardrails for secret redaction and content moderation.[21]
BerriAI also operates a managed "LiteLLM Cloud" deployment, which the company describes as SOC 2 Type 2 and ISO 27001 certified and load-tested at 1,000 requests per second.[17] Enterprise support includes dedicated Slack or Teams channels with service-level agreements ranging from a one-hour response for production-blocking issues to twenty-four hours for non-urgent matters.[17] Pricing is custom and quoted on request rather than published.[23]
The split between open-source core and paid enterprise tier follows a common LLMOps "open-core" pattern. Critical features such as the unified API, the Router, virtual keys, the admin UI and basic spend tracking are available to anyone using the public Docker image or the PyPI package, while features tied to compliance, governance and large-team operations are gated behind a license key that the enterprise distribution requires at startup.[17][23] BerriAI's published guidance is that small teams and individual developers should stay on the open-source core, while organisations exceeding roughly 100 users or 10 production AI use-cases will typically need the enterprise feature set to operate the gateway safely at scale.[17]
LiteLLM's provider matrix spans hosted frontier APIs, hyperscaler model marketplaces, regional clouds, open-source inference servers, image-generation services and audio APIs. A non-exhaustive list, drawn from the official provider index, includes:[24]
| Category | Representative providers |
|---|---|
| Hosted frontier LLMs | OpenAI, Anthropic, xAI, Cohere, AI21, Mistral AI |
| Hyperscaler marketplaces | AWS Bedrock, AWS SageMaker, Azure OpenAI, Azure AI, Google Vertex AI, Google AI Studio / Gemini |
| Multi-tenant inference clouds | Together AI, Replicate, Fireworks AI, DeepInfra, Groq, Databricks |
| Open-source / self-hosted runtimes | Ollama, vLLM, LM Studio, Llamafile, Xinference |
| Image and audio APIs | Stability AI, Black Forest Labs (FLUX), Recraft, ElevenLabs, Deepgram |
| Aggregator and developer tools | Hugging Face, OVHCloud, Volcano Engine, DataRobot |
Any service that exposes an OpenAI-compatible HTTP interface can additionally be invoked simply by passing openai/<model-name> along with a custom api_base, which is the mechanism LiteLLM uses for community-run inference servers and OpenAI-compatible gateways.[24]
LiteLLM is most often compared with three other components of the "AI gateway" layer: OpenRouter, Portkey and Helicone.[25][26]
OpenRouter is an API aggregator that fronts roughly 300 models from more than 60 providers and bills callers directly for usage (typically adding a 5% markup), making it appealing for prototypes and consumer apps that prefer not to manage individual provider accounts.[25] LiteLLM, Portkey and Helicone instead sit between an organisation's application and the providers, using the organisation's own API keys; payment flows to each provider directly and the gateway is paid for separately (or self-hosted).[25]
Portkey is a closed-core commercial gateway whose main differentiation is "production safety" features such as built-in guardrails, PII redaction, jailbreak detection and audit trails, starting at $49/month for managed use.[25] Helicone is primarily an observability platform that also functions as a lightweight proxy; it is open source, written largely in Rust, and emphasises load-balancing performance and analytics.[25][26] LiteLLM occupies the "maximum customisation, self-hostable" position: the open-source core is free with unlimited self-hosted use (callers pay only their upstream providers and any infrastructure costs), and the proxy is more configurable than Portkey or OpenRouter at the cost of more setup time.[25][26]
| Tool | Model | Hosting | Pricing model |
|---|---|---|---|
| LiteLLM | Open source proxy + SDK | Self-hosted or LiteLLM Cloud | Free OSS; enterprise quoted |
| OpenRouter | Aggregator, single billing | Hosted | ~5% markup on traffic |
| Portkey | Commercial gateway | Hosted (and self-host option) | From $49/month |
| Helicone | Observability + lightweight proxy | Self-hosted or hosted | Open source; SaaS tiers |
LiteLLM is also adjacent to higher-level orchestration libraries like LangChain and LlamaIndex, which typically call out to model providers through LiteLLM rather than competing with it, and to inference-server projects such as vLLM and Hugging Face Text Generation Inference, which sit one layer below LiteLLM and supply the actual model serving.[14][24]
LiteLLM markets itself primarily to two audiences: individual developers and small teams that want to keep a single code path while experimenting with multiple models, and platform teams that need to expose models to an entire engineering organisation with central governance.[4][23] On the corporate side, the LiteLLM homepage and Y Combinator profile cite Rocket Money, Samsara, Lemonade, Adobe and Netflix as production users; Netflix Staff Software Engineer David Leen is quoted saying that "LiteLLM has let my team provide the latest LLM models to our users usually within a day" of their release, and Lemonade Principal Architect Mark Koltnuk has called the experience "outstanding."[6][9] InfoWorld additionally describes the project as offering "day-zero access to new models with minimal overhead" for organisations including Netflix, Lemonade and Rocket Money.[7]
Common deployment patterns include using the SDK directly in a single application for portable provider selection, running the proxy as an internal AI gateway behind which all teams' applications make OpenAI-shaped calls (a pattern that yields organisation-wide cost tracking, central guardrails and central key management), and using the proxy as the back-end for tools that already speak the OpenAI protocol such as IDE assistants, RAG frameworks and agentic systems built with the Model Context Protocol.[4][7][16]
A third common deployment pattern uses LiteLLM as a translation layer in front of self-hosted inference: a fleet of open-source models served by vLLM, Hugging Face Text Generation Inference, Ollama or LM Studio is registered as deployments behind a LiteLLM Proxy, which then presents an OpenAI-compatible surface to applications. This decouples the choice of inference runtime from the application layer and allows organisations to mix hosted frontier models with self-hosted open-weight models behind the same set of virtual keys and budgets.[24]
LiteLLM has been adopted as the model-routing layer in several agentic and developer-tooling stacks. The project's own README cites integrations with LangGraph and the Google Vertex AI Agent Engine, and notes Model Context Protocol support so that any LLM accessible through LiteLLM can call MCP tool servers.[1] The combination of unified provider access, automatic fallbacks and centralised budgets makes the gateway particularly useful in long-running agent workflows, where transient provider outages or capacity exhaustion would otherwise cascade into broken agent traces.[1][15]
Several limitations and risks have been documented in independent coverage and in the project's own changelog.
Comparative reviews note that LiteLLM's flexibility comes at a configuration cost; one 2026 comparison estimates 15 to 30 minutes of YAML configuration to stand up a production proxy versus under five minutes for OpenRouter or Portkey.[25] Because LiteLLM normalises every provider to the OpenAI schema, provider-specific features that have no OpenAI analogue are exposed through ad-hoc pass-through fields, and behaviour can drift when providers add new response fields faster than mappings are updated.[14]
A high-impact incident occurred in March 2026, when malicious versions 1.82.7 and 1.82.8 of the litellm package were published to PyPI through a compromised upstream maintainer account (the broader incident also affected the unrelated trivy CI/CD package).[27][28] The malicious release contained a litellm_init.pth file that executed a base64-encoded payload on Python import, performed credential exfiltration to an attacker-controlled endpoint, encrypted captured material with RSA, and attempted Kubernetes lateral movement and installation of a persistent systemd service.[28][29] An unintended fork bomb in the payload aided detection. PyPI quarantined the package roughly thirty minutes after the first vulnerability report, by which point the package had been live for about forty-six minutes; researcher Simon Willison subsequently estimated that approximately 47,000 downloads occurred during the exposure window across the two tainted versions, and noted that 88% of the 2,337 downstream packages that depend on LiteLLM did not pin a safe version range.[27][28] The maintainers published a post-mortem on Hacker News and rotated credentials and signing keys.[28][29] The incident is frequently cited as evidence of the importance of dependency pinning, SBOM hygiene and supply-chain controls for the LLMOps tool ecosystem rather than as a flaw in LiteLLM's runtime architecture per se.[27][29]
Beyond supply-chain risk, the gateway pattern itself introduces operational considerations. Every request to a LiteLLM-managed model adds one extra network hop; while the LiteLLM homepage claims an 8 millisecond P95 overhead at 1,000 requests per second, that figure depends on co-location with the proxy, careful tuning of Python workers and the use of Redis for shared state.[1][6] Self-hosted deployments are also responsible for the availability of the proxy itself: an outage of the LiteLLM Proxy will block traffic to every underlying provider, so operators typically run multiple replicas behind a load balancer and treat the Redis and PostgreSQL dependencies as production-critical components.[4]