# BentoML

> Source: https://aiwiki.ai/wiki/bentoml
> Updated: 2026-07-16
> Categories: Developer Tools, MLOps, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**BentoML** is an open-source Python framework for packaging, serving, and deploying machine learning and AI models as production inference services. It was first released in 2019 by Chaoyu Yang and a small team operating under the corporate entity Atalaya Tech Inc., and the company has built a commercial product, **BentoCloud**, on top of the open-source library.[^1][^2] The framework standardizes model packaging through an artifact format called a **Bento**, which bundles model files, Python source code, dependencies, and runtime configuration into a single deployable unit.[^3] BentoML is widely used alongside or as an alternative to NVIDIA's [Triton Inference Server](/wiki/nvidia_triton_inference_server), Ray Serve, KServe, and managed inference platforms such as Modal and [Replicate](/wiki/replicate). In February 2026, BentoML was acquired by Modular, the AI infrastructure company founded by Chris Lattner, and the open-source project continues under the Apache 2.0 license.[^4][^5]

## Overview

| Field | Detail |
|---|---|
| Initial release | 2019 (open source)[^1] |
| Founder / CEO | Chaoyu Yang[^2] |
| Corporate entity | Atalaya Tech Inc. (dba BentoML)[^6] |
| Headquarters | San Francisco, California[^6] |
| License (OSS) | Apache 2.0[^4] |
| Primary language | Python (Python 3.9 or newer)[^7] |
| Source repository | github.com/bentoml/BentoML[^3] |
| Current OSS version | v1.4.39 (released May 7, 2026)[^3] |
| Commercial product | BentoCloud (managed inference platform)[^8] |
| Related projects | OpenLLM, Yatai, BentoVLLM, BentoTriton[^9][^10] |
| Funding | Seed: $9M (June 2023, DCM Ventures lead); Series A: $9M (reported July 2024, Greylock lead)[^11][^12] |
| Acquirer | Modular (announced February 10, 2026)[^4] |

## History

### Origins and the Atalaya years (2018 to 2019)

BentoML's founder, Chaoyu Yang, joined Databricks as a software engineer in 2014, where he worked on the company's unified analytics platform and served as an early product manager on the open-source [MLflow](/wiki/mlflow) project.[^13] At [Databricks](/wiki/databricks), Yang observed that even sophisticated machine learning customers struggled to take trained models into production, which left a gap between training pipelines and reliable, scalable model serving. In 2018, Yang left Databricks and founded a company initially called Atalaya Tech Inc. to address that gap.[^13][^11]

BentoML itself was open-sourced in 2019.[^1][^2] The early library focused on a small but important set of problems: how to save a trained [scikit-learn](/wiki/scikit_learn), [XGBoost](/wiki/xgboost), [PyTorch](/wiki/pytorch), or [TensorFlow](/wiki/tensorflow) model along with its preprocessing code; how to expose that artifact as a REST endpoint with minimal boilerplate; and how to build a Docker image around it for portable deployment. Early adopters included the Japanese messaging platform Line, where engineer Sungjun Kim integrated BentoML in 2019 across shopping search, content recommendations, and user targeting use cases. The South Korean internet conglomerate Naver was also among the early production users.[^14]

### Version 1.0 and Yatai (2022)

After several years of public 0.x releases, the project's first stable major version, BentoML 1.0, was announced on July 12, 2022.[^15] The 1.0 release was a redesign that introduced three central abstractions: a **Bento** as the standardized, immutable deployable artifact; a model store with framework-aware `save_model` and `load_model` primitives; and a new **Runner** abstraction designed to parallelize inference workloads across independent worker processes.[^15] BentoML 1.0 was explicitly not backwards compatible with the earlier 0.13.x line, which the team treated as a maintenance branch.[^15]

Alongside 1.0, the company released **Yatai**, a Kubernetes-native deployment system for BentoML services. Yatai 1.0 was announced on November 7, 2022.[^16] Yatai exposes a `BentoDeployment` Custom Resource Definition (CRD) that lets operators describe a Bento deployment declaratively and lets the controller reconcile pods, services, horizontal pod autoscalers, and ingresses in any [Kubernetes](/wiki/nvidia_triton_inference_server) cluster.[^16][^10] Yatai automatically splits a Bento into separately scalable microservices, so a model's GPU-bound inference Runner can scale on GPU nodes while CPU-bound preprocessing or postprocessing services scale on cheaper hardware.[^16] The latest stable Yatai release at the time of writing is v1.1.13 from October 2023; the project has since been positioned as a community-maintained option, with managed deployment now offered through BentoCloud.[^10]

### OpenLLM and the LLM era (2023)

On June 20, 2023, the BentoML team announced **OpenLLM**, an open-source wrapper specifically for large language model serving.[^17] OpenLLM originally provided one-command launching for open-weights models including Flan-T5, Dolly v2, ChatGLM, StarCoder, Falcon, and StableLM, and it explicitly integrated with both BentoML for service definitions and LangChain for downstream application code.[^17] Over subsequent releases the project pivoted to expose any supported model as an OpenAI-compatible HTTP API, with a built-in chat UI at `/chat`, support for multiple inference backends including [vLLM](/wiki/vllm), and a default model repository covering the [Llama 3](/wiki/llama_3) series, [Mistral 7B](/wiki/mistral_7b) and Mistral-Large variants, Qwen 2.5, DeepSeek R1, Gemma, Phi, and Pixtral.[^9] OpenLLM is maintained in the same GitHub organization as BentoML and uses BentoML's serving engine under the hood, with optional managed hosting on BentoCloud.[^9]

In June 2023, the company also closed a $9 million Seed financing round led by DCM Ventures, with Bow Capital participating; Hurst Lin of DCM joined the board.[^11][^18] At the time of the seed round, BentoML reported more than 3,000 community users and customers including Naver and Line, along with named users such as Microsoft.[^18][^11]

### BentoCloud and growth (2023 to 2025)

The company introduced **BentoCloud**, a managed serverless inference platform built on the BentoML open-source engine, during 2023 and brought it to general availability in 2024.[^19] BentoCloud added enterprise features that the OSS project does not implement on its own, including scale-to-zero, concurrency-based autoscaling, optimized cold-start handling for GPU instances, an external request queue, and observability dashboards.[^20] A **Bring-Your-Own-Cloud (BYOC)** mode lets enterprises run BentoCloud's control plane while keeping the data plane (containers, models, and serving traffic) entirely inside their own AWS, Azure, or Google Cloud VPC, which the company markets to regulated industries that cannot send model weights or inference traffic to a third-party SaaS.[^8][^19]

BentoML 1.2 was released on February 19, 2024.[^21] That release replaced the explicit Runner/Service split with a single Python class decorated by `@bentoml.service`, in which inference methods are annotated with `@bentoml.api`, dependencies between services are declared with `bentoml.depends()`, and configuration moved from a separate YAML file into Python decorators.[^21] The 1.2 line also added typed I/O for common ML primitives including `numpy.ndarray`, `torch.Tensor`, `pandas.DataFrame`, Pydantic models, and `pathlib.Path` for binary payloads, and consolidated the build and deploy workflow into a single `bentoml deploy .` command.[^21]

In a 2023 year-in-review post, the company reported that BentoML had crossed 15,000 GitHub stars, was used by more than 1,300 projects on GitHub, and that its community channels (Slack and Discord) had grown past 4,000 members.[^22] By the time of the Modular acquisition in 2026, the company stated that more than 10,000 organizations relied on BentoML to ship models to production, including more than 50 Fortune 500 companies.[^4]

### Acquisition by Modular (2026)

On February 10, 2026, Modular and BentoML announced that Modular had acquired BentoML.[^4][^5] Modular, founded by Chris Lattner (creator of LLVM, Clang, Swift, and MLIR) and former Google TPU lead Tim Davis, develops the MAX inference engine and the Mojo programming language as a unified, hardware-portable AI compute stack. The stated rationale for the acquisition was to combine Modular's hardware-portable inference kernels (MAX, Mammoth) with BentoML's deployment and operations layer, so that customers can move workloads across NVIDIA and AMD accelerators without rebuilding their serving infrastructure.[^4] Both companies emphasized that the open-source BentoML project would continue under the Apache 2.0 license without changes to its public APIs, governance, or roadmap, and that BentoCloud customer contracts would be honored unchanged.[^4][^5] An "Ask Me Anything" session with Chris Lattner and Chaoyu Yang was held on February 17, 2026, to answer community questions about the integration roadmap.[^4]

## Technical details

### The Bento packaging format

A **Bento** is the central artifact in the BentoML world: an immutable, versioned bundle that contains everything required to reproduce a model service in any environment. Concretely, a Bento packages source code (typically a `service.py` module), Python dependencies (either pinned directly or pulled from a `requirements.txt`), Python interpreter version, model weights pulled from BentoML's model store, environment variables, and other runtime configuration.[^7][^23] A Bento is built with the `bentoml build` command, which produces a versioned tag, and a `.bentoignore` file (analogous to `.gitignore`) can exclude files from the bundle.[^23] Any Bento can be turned into a portable OCI image with `bentoml containerize <tag>`, which auto-generates a Dockerfile, installs dependencies, copies the bundle, and configures the entrypoint; the resulting image runs in any Docker-compatible runtime.[^23]

The Bento format is deliberately framework-agnostic. The BentoML model store can save models from [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), the Hugging Face [Transformers](/wiki/transformers_library) library, [scikit-learn](/wiki/scikit_learn), [XGBoost](/wiki/xgboost), ONNX, and roughly fifteen other libraries, with each integration handling that framework's preferred serialization format.[^7]

### Services, APIs, and dependencies

Since BentoML 1.2, the canonical service definition is a Python class decorated with `@bentoml.service`. HTTP endpoints are regular Python methods decorated with `@bentoml.api`, and resource configuration (GPU type, replicas, concurrency, timeout) is passed as arguments to the service decorator.[^21][^24] The framework reads Python type hints on the API method signature to generate request and response schemas: primitive types, Pydantic models, NumPy arrays, PyTorch tensors, Pandas DataFrames, and `pathlib.Path` for file uploads are all natively supported.[^21]

For applications that compose multiple models, BentoML 1.2's `bentoml.depends()` declaration lets one service depend on another. The dependent service can then call methods on its dependency as if they were local Python function calls, but at runtime each service is deployed as a separately scalable process or pod with its own resource allocation.[^21] This pattern is used to build inference graphs in which, for example, a CPU-bound text preprocessor, a GPU-bound embedding model, a retriever, and an LLM each scale independently.[^24]

### Runners (legacy 1.0 / 1.1)

In BentoML 1.0 and 1.1, the analogous abstraction was the **Runner**, a unit of computation that ran in its own Python worker process and could be parallelized across multiple workers. A `bentoml.Service` declared a list of Runners and exposed APIs that delegated to them; the framework handled inter-process communication, adaptive batching, and resource binding.[^25] Runners are still supported and a `bentoml.runner_service()` helper is provided for migrating 1.1 code into the 1.2 class-based model with minimal changes.[^25]

### Adaptive batching, async, and streaming

BentoML's serving engine implements **adaptive micro-batching**: independent client requests that arrive within a configurable latency budget are batched together into a single forward pass, which dramatically improves GPU utilization for workloads that do not naturally arrive batched.[^15] The 1.x line also added native `async def` API support, streaming responses (for token-by-token LLM output), long-running background jobs via task queues, WebSocket endpoints, and gRPC alongside REST.[^7][^24]

### BentoCloud serving stack

BentoCloud uses the same BentoML engine but layers on infrastructure features specifically designed for GPU-backed model serving. The platform autoscales on **concurrency** (the number of in-flight requests per replica) rather than CPU or memory, because for batched GPU inference concurrency correlates with load more directly than the system-level metrics that traditional autoscalers use.[^20] Cold starts are mitigated by pre-loading model weights into the container image at build time and by stream-loading large weights directly into GPU memory; an external request queue can buffer traffic during scale-up.[^20] BYOC deployments give the customer a Kubernetes cluster in their own cloud account that the BentoCloud control plane orchestrates, so model weights and inference traffic never leave the customer's VPC.[^8]

## Ecosystem and related projects

The BentoML GitHub organization (github.com/bentoml) maintains a number of related repositories beyond the core library.[^9][^10]

| Project | Repository | Purpose |
|---|---|---|
| BentoML | bentoml/BentoML | Core open-source serving library[^3] |
| OpenLLM | bentoml/OpenLLM | One-command serving of open-source LLMs with OpenAI-compatible APIs[^9] |
| Yatai | bentoml/Yatai | Kubernetes deployment operator with the `BentoDeployment` CRD[^10] |
| yatai-deployment | bentoml/yatai-deployment | Companion controller for launching Bentos in a cluster[^10] |
| BentoVLLM | bentoml/BentoVLLM | Reference example for self-hosting LLMs with [vLLM](/wiki/vllm) and BentoML[^9] |
| BentoTriton | bentoml/BentoTriton | Reference example for running NVIDIA Triton models inside a Bento[^3] |
| build-bento-action | bentoml/build-bento-action | GitHub Action to build a Bento from a repository[^3] |

OpenLLM and BentoVLLM together cover the LLM serving path: OpenLLM provides a turnkey OpenAI-compatible server with a chat UI, while BentoVLLM is a customizable template for users who want to build their own vLLM-backed inference service inside the standard BentoML framework.[^9]

## Adoption

BentoML's publicly named customers, drawn from the company's own customer page and from journalistic coverage, include the Japanese messaging company Line, the South Korean internet group Naver, the location intelligence company TomTom, the consumer credit card issuer Mission Lane, the structured-knowledge platform Yext, the computer-vision startup Neurolabs, the experiences marketplace GetYourGuide, the generative visuals company Jabali AI, and Ben Labs.[^26][^14][^18] Microsoft has also been named publicly as a BentoML user.[^18] The company says more than 10,000 organizations and 50-plus Fortune 500 companies use BentoML in production, with one named customer (Yext) reportedly scaling to over 150 production models across multiple regions.[^4][^26]

The open-source project has consistently been one of the larger model-serving repositories on GitHub. The BentoML team reported crossing 15,000 stars in 2023 and the main repository now lists 8.7k stars (note that an organization fork at github.com/bentoai/BentoML also exists; the canonical repository is github.com/bentoml/BentoML).[^22][^3] OpenLLM and related repositories add further visibility, and BentoML packages on PyPI have accumulated millions of monthly downloads.[^3]

## Comparison with adjacent systems

BentoML occupies the "Python-first, framework-agnostic model server" position in a crowded landscape of inference systems. The most commonly cited alternatives are NVIDIA Triton, Ray Serve, KServe, Modal, and Replicate; each makes different trade-offs.

| System | Primary abstraction | Best fit | Kubernetes-native |
|---|---|---|---|
| BentoML | Bento (Python service + packaging) | Python-first teams needing portable serving across clouds[^27][^7] | Optional (via Yatai or BentoCloud)[^10] |
| NVIDIA Triton Inference Server | Model repository (multi-framework) | GPU-heavy multi-model workloads, especially on NVIDIA hardware[^27] | Via KServe or operators |
| Ray Serve | Python deployments on a Ray cluster | Teams already standardized on the Ray distributed computing stack[^27] | Indirectly (via KubeRay) |
| KServe | Kubernetes `InferenceService` CRD | Cloud-native organizations with a Kubernetes-first platform[^28] | Yes (built on Knative) |
| Modal | Serverless Python on Modal-managed GPUs | Teams that want zero infrastructure and accept platform lock-in[^28] | No (proprietary scheduler) |
| [Replicate](/wiki/replicate) | Hosted API for community models (Cog) | Quickly calling popular open-source models without self-hosting[^27] | No (managed only) |

Several of these systems are complementary rather than strictly competitive. BentoML can wrap an NVIDIA Triton model server (via the BentoTriton example) so that the Triton model runs as a Runner inside a larger Bento.[^3] OpenLLM and BentoVLLM use [vLLM](/wiki/vllm) as their inference backend while BentoML provides the HTTP layer, scaling, and packaging.[^9] BentoCloud's deployment plane can run on top of any Kubernetes cluster, so users with existing KServe or Triton infrastructure can adopt BentoML for the developer-facing API layer without replacing their orchestration tooling.[^8] Independent comparisons place BentoML and Triton on the "self-managed" end of the spectrum, with Modal and Replicate at the fully managed end and Ray Serve and KServe sitting between as cluster-resident frameworks.[^27][^28]

A 2024 BentoML engineering benchmark on Llama 3 (8B and 70B) used BentoML as the integration shell to compare five different LLM inference backends (vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and Hugging Face TGI) on an NVIDIA A100 80GB GPU at three concurrency levels (10, 50, 100 users). The study reported that [LMDeploy](/wiki/lmdeploy) achieved the highest peak token generation rate (up to roughly 4,000 tokens per second on Llama 3 8B at 100 concurrent users) and the lowest TTFT at low load, while [vLLM](/wiki/vllm) held the most consistent TTFT across concurrency levels.[^29] The benchmark also illustrated BentoML's positioning as a backend-agnostic serving framework: the same BentoML service definition was used to expose all five backends behind a consistent REST API.[^29]

## Applications

BentoML and its extensions are used to serve a wide range of model types:

- Classical ML and tabular models (e.g., gradient-boosted trees and [scikit-learn](/wiki/scikit_learn) pipelines) deployed as low-latency REST or gRPC services.[^7]
- Computer-vision pipelines, including the [Stable Diffusion](/wiki/stable_diffusion) family of text-to-image models served through BentoML.[^11][^7]
- LLM serving via OpenLLM and BentoVLLM, exposing any open-weights model behind an OpenAI-compatible HTTP API with a chat UI.[^9]
- Retrieval-augmented generation systems that compose embedding models, vector retrieval, and LLM generation as a graph of dependent services.[^24]
- Multi-agent systems where multiple model-backed agents are deployed as cooperating services using `bentoml.depends()`.[^24]

The company explicitly markets BentoML at AI startups that want to go from a local notebook to a production API quickly, and at regulated enterprises that need to keep model weights and inference traffic inside their own VPC via BYOC.[^8][^19]

## Limitations and criticisms

BentoML's position has trade-offs that the project itself documents openly:

- The framework is Python-only on the service side. Teams that want to write services in Go, Rust, or other languages cannot use BentoML's service decorators directly, although a Bento can wrap a non-Python inference backend (such as a Triton model server) as a Runner.[^7]
- The Yatai Kubernetes operator has moved more slowly than the core library. As of the most recent release (v1.1.13, October 2023), the project page states that "Yatai for BentoML 1.2 is currently under construction"; users who want a fully supported Kubernetes deployment story are now generally directed to BentoCloud (including BYOC) rather than self-managed Yatai.[^10]
- The 1.0 to 1.2 evolution introduced two breaking changes within roughly eighteen months (the 0.x to 1.0 redesign in 2022 and the Runner-to-class redesign in 2024), which has required existing users to migrate code across major versions.[^15][^21]
- The "Bento" packaging format duplicates some work already done by general container build systems and by other model-serving projects' standard formats (e.g., KServe's `InferenceService` and Cog's image format), and the value of the abstraction depends on how invested a team is in BentoML-specific tooling.[^28]

## See also

- [MLflow](/wiki/mlflow)
- [NVIDIA Triton Inference Server](/wiki/nvidia_triton_inference_server)
- [vLLM](/wiki/vllm)
- [LMDeploy](/wiki/lmdeploy)
- [TensorFlow Serving](/wiki/tensorflow_serving)
- [Amazon SageMaker](/wiki/amazon_sagemaker)
- [Replicate](/wiki/replicate)
- [Ollama](/wiki/ollama)
- [Hugging Face](/wiki/hugging_face)
- [Hugging Face Transformers](/wiki/transformers_library)
- [LangChain](/wiki/langchain)
- [LlamaIndex](/wiki/llamaindex)
- [Databricks](/wiki/databricks)
- [TensorRT](/wiki/tensorrt)
- [Inference optimization](/wiki/inference_optimization)
- [Online inference](/wiki/online_inference)
- [Offline inference](/wiki/offline_inference)

## References

[^1]: BentoML, "Past and Present (BentoML)", BentoML Blog, 2023. https://www.bentoml.com/blog/bentoml-past-and-present. Accessed 2026-05-21.

[^2]: Kyle Wiggers, "BentoML scores $9M funding to expedite AI app development", TechCrunch, 2023-06-26. https://techcrunch.com/2023/06/26/bentoml-scores-9m-funding-to-expedite-ai-app-development/. Accessed 2026-05-21.

[^3]: BentoML Contributors, "bentoml/BentoML: The easiest way to serve AI apps and models", GitHub, 2026-05-07. https://github.com/bentoml/BentoML. Accessed 2026-05-21.

[^4]: Modular, "BentoML Joins Modular", Modular Blog, 2026-02-10. https://www.modular.com/blog/bentoml-joins-modular. Accessed 2026-05-21.

[^5]: BentoML, "BentoML Is Joining Modular", BentoML Blog, 2026-02-10. https://www.bentoml.com/blog/bentoml-is-joining-modular. Accessed 2026-05-21.

[^6]: The SaaS News, "BentoML Raises $9 Million in Seed Round", The SaaS News, 2023-06-26. https://www.thesaasnews.com/news/bentoml-raises-9-million-in-seed-round. Accessed 2026-05-21.

[^7]: BentoML, "BentoML Documentation: Unified Inference Platform", BentoML Docs, 2026. https://docs.bentoml.com/en/latest/. Accessed 2026-05-21.

[^8]: BentoML, "Bring Your Own Cloud (BentoCloud)", BentoML Docs, 2026. https://docs.bentoml.com/en/latest/scale-with-bentocloud/administering/bring-your-own-cloud.html. Accessed 2026-05-21.

[^9]: BentoML Contributors, "bentoml/OpenLLM: Run any open-source LLMs as OpenAI-compatible API endpoints", GitHub, 2025-04-21. https://github.com/bentoml/OpenLLM. Accessed 2026-05-21.

[^10]: BentoML Contributors, "bentoml/Yatai: Model Deployment at Scale on Kubernetes", GitHub, 2023-10. https://github.com/bentoml/Yatai. Accessed 2026-05-21.

[^11]: Kyle Wiggers, "BentoML scores $9M funding to expedite AI app development", TechCrunch, 2023-06-26. https://techcrunch.com/2023/06/26/bentoml-scores-9m-funding-to-expedite-ai-app-development/. Accessed 2026-05-21.

[^12]: SalesTools, "BentoML raises $9M Series A at $50M valuation", SalesTools.io, 2024-07-09. https://salestools.io/en/report/bentoml-9m-series-a. Accessed 2026-05-21.

[^13]: Cerebral Valley, "BentoML's unique approach to AI inference", Cerebral Valley Beehiiv, 2024. https://cerebralvalley.beehiiv.com/p/bentomls-unique-approach-ai-inference. Accessed 2026-05-21.

[^14]: BentoML, "Past and Present (BentoML)", BentoML Blog, 2023. https://www.bentoml.com/blog/bentoml-past-and-present. Accessed 2026-05-21.

[^15]: BentoML, "Introducing BentoML 1.0", BentoML Blog, 2022-07-12. https://bentoml.com/blog/introducing-bentoml-10. Accessed 2026-05-21.

[^16]: BentoML, "Yatai 1.0: Model Deployment On Kubernetes Made Easy", BentoML Blog, 2022-11-07. https://bentoml.com/blog/yatai-10-model-deployment-on-kubernetes-made-easy. Accessed 2026-05-21.

[^17]: BentoML, "Announcing OpenLLM: An Open-Source Platform for Running Large Language Models in Production", BentoML Blog, 2023-06-20. https://www.bentoml.com/blog/announcing-open-llm-an-open-source-platform-for-running-large-language-models-in-production. Accessed 2026-05-21.

[^18]: The SaaS News, "BentoML Raises $9 Million in Seed Round", The SaaS News, 2023-06-26. https://www.thesaasnews.com/news/bentoml-raises-9-million-in-seed-round. Accessed 2026-05-21.

[^19]: BentoML, "BentoCloud: Fast and Customizable GenAI Inference in Your Cloud", BentoML Blog, 2024. https://www.bentoml.com/blog/introducing-bentocloud. Accessed 2026-05-21.

[^20]: BentoML, "Concurrency and autoscaling (BentoCloud)", BentoML Docs, 2026. https://docs.bentoml.com/en/latest/scale-with-bentocloud/scaling/autoscaling.html. Accessed 2026-05-21.

[^21]: BentoML, "Introducing BentoML 1.2", BentoML Blog, 2024-02-19. https://www.bentoml.com/blog/introducing-bentoml-1-2. Accessed 2026-05-21.

[^22]: BentoML, "BentoML 2023 Year in Review", BentoML Blog, 2024-01. https://www.bentoml.com/blog/bentoml-2023-year-in-review. Accessed 2026-05-21.

[^23]: BentoML, "Packaging for deployment", BentoML Docs, 2026. https://docs.bentoml.com/en/latest/get-started/packaging-for-deployment.html. Accessed 2026-05-21.

[^24]: BentoML, "Create online API Services", BentoML Docs, 2026. https://docs.bentoml.com/en/latest/build-with-bentoml/services.html. Accessed 2026-05-21.

[^25]: BentoML, "Runners (BentoML 1.1)", BentoML Docs, 2024. https://docs.bentoml.com/en/1.1/concepts/runner.html. Accessed 2026-05-21.

[^26]: BentoML, "Customer Stories", BentoML, 2026. https://www.bentoml.com/customers. Accessed 2026-05-21.

[^27]: Index.dev, "BentoML vs Ray Serve vs Triton: Model Serving for AI Teams 2026", Index.dev, 2026. https://www.index.dev/skill-vs-skill/ai-bentoml-vs-ray-serve-vs-triton. Accessed 2026-05-21.

[^28]: Northflank, "6 best BentoML alternatives for self-hosted AI model deployment (2026)", Northflank Blog, 2026. https://northflank.com/blog/bentoml-alternatives. Accessed 2026-05-21.

[^29]: BentoML Engineering, "Benchmarking LLM Inference Backends", BentoML Blog, 2024. https://bentoml.com/blog/benchmarking-llm-inference-backends. Accessed 2026-05-21.