BentoML
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,419 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,419 words
Add missing citations, update stale details, or suggest a clearer explanation.
BentoML is an open-source Python framework for packaging, serving, and deploying machine learning and AI models as production inference services. It was first released in 2019 by Chaoyu Yang and a small team operating under the corporate entity Atalaya Tech Inc., and the company has built a commercial product, BentoCloud, on top of the open-source library.[1][2] The framework standardizes model packaging through an artifact format called a Bento, which bundles model files, Python source code, dependencies, and runtime configuration into a single deployable unit.[3] BentoML is widely used alongside or as an alternative to NVIDIA's Triton Inference Server, Ray Serve, KServe, and managed inference platforms such as Modal and Replicate. In February 2026, BentoML was acquired by Modular, the AI infrastructure company founded by Chris Lattner, and the open-source project continues under the Apache 2.0 license.[4][5]
| Field | Detail |
|---|---|
| Initial release | 2019 (open source)[1] |
| Founder / CEO | Chaoyu Yang[2] |
| Corporate entity | Atalaya Tech Inc. (dba BentoML)[6] |
| Headquarters | San Francisco, California[6] |
| License (OSS) | Apache 2.0[4] |
| Primary language | Python (Python 3.9 or newer)[7] |
| Source repository | github.com/bentoml/BentoML[3] |
| Current OSS version | v1.4.39 (released May 7, 2026)[3] |
| Commercial product | BentoCloud (managed inference platform)[8] |
| Related projects | OpenLLM, Yatai, BentoVLLM, BentoTriton[9][10] |
| Funding | Seed: $9M (June 2023, DCM Ventures lead); Series A: $9M (reported July 2024, Greylock lead)[11][12] |
| Acquirer | Modular (announced February 10, 2026)[4] |
BentoML's founder, Chaoyu Yang, joined Databricks as a software engineer in 2014, where he worked on the company's unified analytics platform and served as an early product manager on the open-source MLflow project.[13] At Databricks, Yang observed that even sophisticated machine learning customers struggled to take trained models into production, which left a gap between training pipelines and reliable, scalable model serving. In 2018, Yang left Databricks and founded a company initially called Atalaya Tech Inc. to address that gap.[13][11]
BentoML itself was open-sourced in 2019.[1][2] The early library focused on a small but important set of problems: how to save a trained scikit-learn, XGBoost, PyTorch, or TensorFlow model along with its preprocessing code; how to expose that artifact as a REST endpoint with minimal boilerplate; and how to build a Docker image around it for portable deployment. Early adopters included the Japanese messaging platform Line, where engineer Sungjun Kim integrated BentoML in 2019 across shopping search, content recommendations, and user targeting use cases. The South Korean internet conglomerate Naver was also among the early production users.[14]
After several years of public 0.x releases, the project's first stable major version, BentoML 1.0, was announced on July 12, 2022.[15] The 1.0 release was a redesign that introduced three central abstractions: a Bento as the standardized, immutable deployable artifact; a model store with framework-aware save_model and load_model primitives; and a new Runner abstraction designed to parallelize inference workloads across independent worker processes.[15] BentoML 1.0 was explicitly not backwards compatible with the earlier 0.13.x line, which the team treated as a maintenance branch.[15]
Alongside 1.0, the company released Yatai, a Kubernetes-native deployment system for BentoML services. Yatai 1.0 was announced on November 7, 2022.[16] Yatai exposes a BentoDeployment Custom Resource Definition (CRD) that lets operators describe a Bento deployment declaratively and lets the controller reconcile pods, services, horizontal pod autoscalers, and ingresses in any Kubernetes cluster.[16][10] Yatai automatically splits a Bento into separately scalable microservices, so a model's GPU-bound inference Runner can scale on GPU nodes while CPU-bound preprocessing or postprocessing services scale on cheaper hardware.[16] The latest stable Yatai release at the time of writing is v1.1.13 from October 2023; the project has since been positioned as a community-maintained option, with managed deployment now offered through BentoCloud.[10]
On June 20, 2023, the BentoML team announced OpenLLM, an open-source wrapper specifically for large language model serving.[17] OpenLLM originally provided one-command launching for open-weights models including Flan-T5, Dolly v2, ChatGLM, StarCoder, Falcon, and StableLM, and it explicitly integrated with both BentoML for service definitions and LangChain for downstream application code.[17] Over subsequent releases the project pivoted to expose any supported model as an OpenAI-compatible HTTP API, with a built-in chat UI at /chat, support for multiple inference backends including vLLM, and a default model repository covering the Llama 3 series, Mistral 7B and Mistral-Large variants, Qwen 2.5, DeepSeek R1, Gemma, Phi, and Pixtral.[9] OpenLLM is maintained in the same GitHub organization as BentoML and uses BentoML's serving engine under the hood, with optional managed hosting on BentoCloud.[9]
In June 2023, the company also closed a $9 million Seed financing round led by DCM Ventures, with Bow Capital participating; Hurst Lin of DCM joined the board.[11][18] At the time of the seed round, BentoML reported more than 3,000 community users and customers including Naver and Line, along with named users such as Microsoft.[18][11]
The company introduced BentoCloud, a managed serverless inference platform built on the BentoML open-source engine, during 2023 and brought it to general availability in 2024.[19] BentoCloud added enterprise features that the OSS project does not implement on its own, including scale-to-zero, concurrency-based autoscaling, optimized cold-start handling for GPU instances, an external request queue, and observability dashboards.[20] A Bring-Your-Own-Cloud (BYOC) mode lets enterprises run BentoCloud's control plane while keeping the data plane (containers, models, and serving traffic) entirely inside their own AWS, Azure, or Google Cloud VPC, which the company markets to regulated industries that cannot send model weights or inference traffic to a third-party SaaS.[8][19]
BentoML 1.2 was released on February 19, 2024.[21] That release replaced the explicit Runner/Service split with a single Python class decorated by @bentoml.service, in which inference methods are annotated with @bentoml.api, dependencies between services are declared with bentoml.depends(), and configuration moved from a separate YAML file into Python decorators.[21] The 1.2 line also added typed I/O for common ML primitives including numpy.ndarray, torch.Tensor, pandas.DataFrame, Pydantic models, and pathlib.Path for binary payloads, and consolidated the build and deploy workflow into a single bentoml deploy . command.[21]
In a 2023 year-in-review post, the company reported that BentoML had crossed 15,000 GitHub stars, was used by more than 1,300 projects on GitHub, and that its community channels (Slack and Discord) had grown past 4,000 members.[22] By the time of the Modular acquisition in 2026, the company stated that more than 10,000 organizations relied on BentoML to ship models to production, including more than 50 Fortune 500 companies.[4]
On February 10, 2026, Modular and BentoML announced that Modular had acquired BentoML.[4][5] Modular, founded by Chris Lattner (creator of LLVM, Clang, Swift, and MLIR) and former Google TPU lead Tim Davis, develops the MAX inference engine and the Mojo programming language as a unified, hardware-portable AI compute stack. The stated rationale for the acquisition was to combine Modular's hardware-portable inference kernels (MAX, Mammoth) with BentoML's deployment and operations layer, so that customers can move workloads across NVIDIA and AMD accelerators without rebuilding their serving infrastructure.[4] Both companies emphasized that the open-source BentoML project would continue under the Apache 2.0 license without changes to its public APIs, governance, or roadmap, and that BentoCloud customer contracts would be honored unchanged.[4][5] An "Ask Me Anything" session with Chris Lattner and Chaoyu Yang was held on February 17, 2026, to answer community questions about the integration roadmap.[4]
A Bento is the central artifact in the BentoML world: an immutable, versioned bundle that contains everything required to reproduce a model service in any environment. Concretely, a Bento packages source code (typically a service.py module), Python dependencies (either pinned directly or pulled from a requirements.txt), Python interpreter version, model weights pulled from BentoML's model store, environment variables, and other runtime configuration.[7][23] A Bento is built with the bentoml build command, which produces a versioned tag, and a .bentoignore file (analogous to .gitignore) can exclude files from the bundle.[23] Any Bento can be turned into a portable OCI image with bentoml containerize <tag>, which auto-generates a Dockerfile, installs dependencies, copies the bundle, and configures the entrypoint; the resulting image runs in any Docker-compatible runtime.[23]
The Bento format is deliberately framework-agnostic. The BentoML model store can save models from PyTorch, TensorFlow, the Hugging Face Transformers library, scikit-learn, XGBoost, ONNX, and roughly fifteen other libraries, with each integration handling that framework's preferred serialization format.[7]
Since BentoML 1.2, the canonical service definition is a Python class decorated with @bentoml.service. HTTP endpoints are regular Python methods decorated with @bentoml.api, and resource configuration (GPU type, replicas, concurrency, timeout) is passed as arguments to the service decorator.[21][24] The framework reads Python type hints on the API method signature to generate request and response schemas: primitive types, Pydantic models, NumPy arrays, PyTorch tensors, Pandas DataFrames, and pathlib.Path for file uploads are all natively supported.[21]
For applications that compose multiple models, BentoML 1.2's bentoml.depends() declaration lets one service depend on another. The dependent service can then call methods on its dependency as if they were local Python function calls, but at runtime each service is deployed as a separately scalable process or pod with its own resource allocation.[21] This pattern is used to build inference graphs in which, for example, a CPU-bound text preprocessor, a GPU-bound embedding model, a retriever, and an LLM each scale independently.[24]
In BentoML 1.0 and 1.1, the analogous abstraction was the Runner, a unit of computation that ran in its own Python worker process and could be parallelized across multiple workers. A bentoml.Service declared a list of Runners and exposed APIs that delegated to them; the framework handled inter-process communication, adaptive batching, and resource binding.[25] Runners are still supported and a bentoml.runner_service() helper is provided for migrating 1.1 code into the 1.2 class-based model with minimal changes.[25]
BentoML's serving engine implements adaptive micro-batching: independent client requests that arrive within a configurable latency budget are batched together into a single forward pass, which dramatically improves GPU utilization for workloads that do not naturally arrive batched.[15] The 1.x line also added native async def API support, streaming responses (for token-by-token LLM output), long-running background jobs via task queues, WebSocket endpoints, and gRPC alongside REST.[7][24]
BentoCloud uses the same BentoML engine but layers on infrastructure features specifically designed for GPU-backed model serving. The platform autoscales on concurrency (the number of in-flight requests per replica) rather than CPU or memory, because for batched GPU inference concurrency correlates with load more directly than the system-level metrics that traditional autoscalers use.[20] Cold starts are mitigated by pre-loading model weights into the container image at build time and by stream-loading large weights directly into GPU memory; an external request queue can buffer traffic during scale-up.[20] BYOC deployments give the customer a Kubernetes cluster in their own cloud account that the BentoCloud control plane orchestrates, so model weights and inference traffic never leave the customer's VPC.[8]
The BentoML GitHub organization (github.com/bentoml) maintains a number of related repositories beyond the core library.[9][10]
| Project | Repository | Purpose |
|---|---|---|
| BentoML | bentoml/BentoML | Core open-source serving library[3] |
| OpenLLM | bentoml/OpenLLM | One-command serving of open-source LLMs with OpenAI-compatible APIs[9] |
| Yatai | bentoml/Yatai | Kubernetes deployment operator with the BentoDeployment CRD[10] |
| yatai-deployment | bentoml/yatai-deployment | Companion controller for launching Bentos in a cluster[10] |
| BentoVLLM | bentoml/BentoVLLM | Reference example for self-hosting LLMs with vLLM and BentoML[9] |
| BentoTriton | bentoml/BentoTriton | Reference example for running NVIDIA Triton models inside a Bento[3] |
| build-bento-action | bentoml/build-bento-action | GitHub Action to build a Bento from a repository[3] |
OpenLLM and BentoVLLM together cover the LLM serving path: OpenLLM provides a turnkey OpenAI-compatible server with a chat UI, while BentoVLLM is a customizable template for users who want to build their own vLLM-backed inference service inside the standard BentoML framework.[9]
BentoML's publicly named customers, drawn from the company's own customer page and from journalistic coverage, include the Japanese messaging company Line, the South Korean internet group Naver, the location intelligence company TomTom, the consumer credit card issuer Mission Lane, the structured-knowledge platform Yext, the computer-vision startup Neurolabs, the experiences marketplace GetYourGuide, the generative visuals company Jabali AI, and Ben Labs.[26][14][18] Microsoft has also been named publicly as a BentoML user.[18] The company says more than 10,000 organizations and 50-plus Fortune 500 companies use BentoML in production, with one named customer (Yext) reportedly scaling to over 150 production models across multiple regions.[4][26]
The open-source project has consistently been one of the larger model-serving repositories on GitHub. The BentoML team reported crossing 15,000 stars in 2023 and the main repository now lists 8.7k stars (note that an organization fork at github.com/bentoai/BentoML also exists; the canonical repository is github.com/bentoml/BentoML).[22][3] OpenLLM and related repositories add further visibility, and BentoML packages on PyPI have accumulated millions of monthly downloads.[3]
BentoML occupies the "Python-first, framework-agnostic model server" position in a crowded landscape of inference systems. The most commonly cited alternatives are NVIDIA Triton, Ray Serve, KServe, Modal, and Replicate; each makes different trade-offs.
| System | Primary abstraction | Best fit | Kubernetes-native |
|---|---|---|---|
| BentoML | Bento (Python service + packaging) | Python-first teams needing portable serving across clouds[27][7] | Optional (via Yatai or BentoCloud)[10] |
| NVIDIA Triton Inference Server | Model repository (multi-framework) | GPU-heavy multi-model workloads, especially on NVIDIA hardware[27] | Via KServe or operators |
| Ray Serve | Python deployments on a Ray cluster | Teams already standardized on the Ray distributed computing stack[27] | Indirectly (via KubeRay) |
| KServe | Kubernetes InferenceService CRD | Cloud-native organizations with a Kubernetes-first platform[28] | Yes (built on Knative) |
| Modal | Serverless Python on Modal-managed GPUs | Teams that want zero infrastructure and accept platform lock-in[28] | No (proprietary scheduler) |
| Replicate | Hosted API for community models (Cog) | Quickly calling popular open-source models without self-hosting[27] | No (managed only) |
Several of these systems are complementary rather than strictly competitive. BentoML can wrap an NVIDIA Triton model server (via the BentoTriton example) so that the Triton model runs as a Runner inside a larger Bento.[3] OpenLLM and BentoVLLM use vLLM as their inference backend while BentoML provides the HTTP layer, scaling, and packaging.[9] BentoCloud's deployment plane can run on top of any Kubernetes cluster, so users with existing KServe or Triton infrastructure can adopt BentoML for the developer-facing API layer without replacing their orchestration tooling.[8] Independent comparisons place BentoML and Triton on the "self-managed" end of the spectrum, with Modal and Replicate at the fully managed end and Ray Serve and KServe sitting between as cluster-resident frameworks.[27][28]
A 2024 BentoML engineering benchmark on Llama 3 (8B and 70B) used BentoML as the integration shell to compare five different LLM inference backends (vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and Hugging Face TGI) on an NVIDIA A100 80GB GPU at three concurrency levels (10, 50, 100 users). The study reported that LMDeploy achieved the highest peak token generation rate (up to roughly 4,000 tokens per second on Llama 3 8B at 100 concurrent users) and the lowest TTFT at low load, while vLLM held the most consistent TTFT across concurrency levels.[29] The benchmark also illustrated BentoML's positioning as a backend-agnostic serving framework: the same BentoML service definition was used to expose all five backends behind a consistent REST API.[29]
BentoML and its extensions are used to serve a wide range of model types:
bentoml.depends().[24]The company explicitly markets BentoML at AI startups that want to go from a local notebook to a production API quickly, and at regulated enterprises that need to keep model weights and inference traffic inside their own VPC via BYOC.[8][19]
BentoML's position has trade-offs that the project itself documents openly:
InferenceService and Cog's image format), and the value of the abstraction depends on how invested a team is in BentoML-specific tooling.[28]