NVIDIA Triton Inference Server

From AI Wiki
Revision as of 14:20, 29 March 2023 by Daikon Radish (talk | contribs)

Introduction

NVIDIA Triton Inference Server is an open-source software that standardizes model deployment and execution, providing fast and scalable AI in production environments. As part of the NVIDIA AI platform, Triton enables teams to deploy, run, and scale trained AI models from any framework on GPU- or CPU-based infrastructure, offering high-performance inference across cloud, on-premises, edge, and embedded devices.

Features

Support for Multiple Frameworks

Triton supports major training and inference frameworks, including TensorFlow, NVIDIA TensorRT, PyTorch, Python, ONNX, XGBoost, scikit-learn RandomForest, OpenVINO, custom C++, and more. This flexibility allows AI researchers and data scientists to choose the right framework for their projects without affecting production deployment.

High-Performance Inference

Triton supports inferencing on NVIDIA GPU-, x86-, Arm CPU-, and AWS Inferentia-based platforms. It provides dynamic batching, concurrent execution, optimal model configuration, model ensemble, and streaming audio/video inputs to maximize throughput and utilization.

Designed for DevOps and MLOps

Triton integrates with Kubernetes for orchestration and scaling, exports Prometheus metrics for monitoring, supports live model updates, and is compatible with major public cloud AI and Kubernetes platforms. It is also integrated into many MLOps software solutions.

Supports Model Ensembles and Pipelines

Modern inference often requires multiple models with preprocessing and post-processing for a single query. Triton supports model ensembles and pipelines, allowing for the execution of ensemble parts on CPU or GPU and using multiple frameworks within the ensemble.

Fast and Scalable AI in Every Application

Triton enables high-throughput inference by executing multiple models concurrently on a single GPU or CPU. It optimizes serving for real-time inferencing under strict latency constraints with dynamic batching and supports batch inferencing to maximize GPU and CPU utilization. Triton also includes built-in support for audio and video streaming input and supports live updates of models in production without restarting the server or application.

Scalability and Ease of Integration

Triton is available as a Docker container and integrates with Kubernetes for orchestration, metrics, and autoscaling. It supports the standard HTTP/gRPC interface for connecting with other applications, such as load balancers, and can scale to any number of servers to handle increasing inference loads for any model.

Native Support in Python

PyTriton provides a simple interface for Python developers to use Triton Inference Server for serving models, processing functions, or entire inference pipelines. This native support in Python enables rapid prototyping and testing of machine learning models with high hardware utilization.

Model Orchestration with Management Service

Triton offers new model orchestration functionality for efficient multi-model inference. This production service loads models on demand, unloads models when not in use, and allocates GPU resources efficiently by placing as many models as possible on a single GPU server. The model orchestration feature is in private early access.

Large Language Model Inference

Triton supports inference for large language models, such as GPT-3 and Megatron, which may not fit on a single GPU. It can partition the model into smaller files and execute each on a separate GPU within or across servers.

Optimal Model Configuration with Model Analyzer

Triton's Model Analyzer is a tool that automatically evaluates model deployment configurations, such as batch size, precision, and concurrent execution instances on the target processor. It helps select the optimal configuration to meet application quality-of-service (QoS) constraints and reduces the time needed to find the optimal configuration.

Tree-Based Model Inference with Forest Inference Library (FIL) Backend

The Forest Inference Library (FIL) backend in Triton provides support for high-performance inference of tree-based models with explainability (SHAP values) on CPUs and GPUs. It supports models from XGBoost, LightGBM, scikit-learn RandomForest, RAPIDS cuML RandomForest, and others in Treelite format.

Ecosystem Integrations

Triton is supported by a variety of cloud platforms and services, including Alibaba Cloud, Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS), Amazon SageMaker, Google Kubernetes Engine (GKE), Google Vertex AI, HPE Ezmeral, Microsoft Azure Kubernetes Service (AKS), Azure Machine Learning, and Oracle Cloud Infrastructure Data Science Platform.

Success Stories

NVIDIA Triton has been successfully adopted by companies such as Amazon, American Express, and Siemens Energy to improve customer satisfaction, fraud detection, and physical inspections through AI-based remote monitoring.

Resources and Support

To help organizations scale AI in production, NVIDIA provides global enterprise support for Triton through its NVIDIA AI Enterprise offering. This includes guaranteed response times, priority security notifications, regular updates, and access to NVIDIA AI experts. Additional resources, such as whitepapers, on-demand sessions, and blog posts, are available to learn more about Triton and its capabilities.

NVIDIA LaunchPad

NVIDIA LaunchPad provides access to hosted infrastructure and allows users to experience Triton Inference Server through free curated labs.

Community and Updates

The Triton community offers a platform to stay current on the latest feature updates, bug fixes, and more. By joining the community, users can connect with other professionals, share experiences, and learn from best practices.

Extensive Use Cases and Applications

NVIDIA Triton is widely used across various industries and applications, such as healthcare, finance, retail, manufacturing, and logistics. It accelerates workloads for speech recognition, recommender systems, medical imaging, natural language processing, and more.

Developer Documentation and Tutorials

NVIDIA provides comprehensive documentation and tutorials to help developers get started with Triton Inference Server. These resources cover topics such as installation, configuration, model deployment, performance optimization, and integration with popular frameworks and services.

Contributions and Open-Source Development

As an open-source project, NVIDIA encourages contributions from developers and researchers to enhance Triton's capabilities, performance, and stability. By actively participating in the project's development, users can shape the future of AI inference and model deployment.

Future Developments

NVIDIA continues to invest in Triton's development, incorporating new features and improvements based on user feedback and industry needs. Upcoming advancements may include additional framework support, improved orchestration capabilities, enhanced performance optimization, and more.