I'm a Senior MLOps Engineer with 5+ years of experience in building end-to-end Machine Learning products. From my industry experience, I write long-form articles on MLOps to help you build real-world AI systems.
Join us in a fascinating exploration of BentoML, where we ensure every concept, from the simplest to the most complex, is presented with clarity and simplicity, making your learning journey enjoyable and insightful ! 🎉
Without further ado, let’s get started: Model Serving and Deployment are terms often used interchangeably in the machine learning world, yet they encapsulate distinct phases in the transition from model development to production.
Model deployment is the process of transitioning a machine learning model into a production environment, ensuring an appropriate format for practical use and, if necessary, establishing additional infrastructure like servers and databases to support it.
On the other hand, Model serving is the practice of making a machine learning model available for use trought APIs which allows users to input data and receive predictions.
On the free tier, BentoML’s focus is solely on the model serving phase. While it does defined the way a model will interact with its production environment, this tool aims specifically at encapsulating the model into a Docker Image that will be easily deployed in a production environment such as Google Cloud Run or Kubernetes 🐳
Model serving and deployment are vital in machine learning workflows, bridging the gap between experimental models and practical applications by enabling models to deliver real-world predictions and insights.
In this article, we will focus exclusively on BentoML’s model serving capabilities. If you want to explore the deployment options available for your model, feel free to read this article.
Join us in the next section where we analyze the capabilities of BentoML, ensuring you gain a robust understanding of its functionalities and how it can be a game-changer in your ML journey ! 🚀
BentoML Demystified
In this section, we’ll give a brief introduction of BentoML’s functionalities and features and how it can be evaluated in various machine learning workflows.
Understanding BentoML
BentoML is a library that simplifies the process of deploying machine learning models. It encapsulates models, regardless of their originating framework, into a format that can be deployed, whether in cloud environments, on local machines, or edge devices, offering a versatile approach to model deployment.
By generating a Docker Image of the packaged model, BentoML facilitates a flexible array of deployment options.
Evaluating BentoML
Understanding the strengths and weaknesses of BentoML is crucial to determine if it fits your use case, as evaluating the pros and cons of any tool is essential for informed decision-making.
Advantages of BentoML
BentoML brings several advantages to the table when it comes to deploying machine learning models:
- Easy Serving: BentoML streamlines the serving process, enabling a smooth transition of ML models into production-ready APIs.
- Integration Capabilities: It offers robust integration, working seamlessly with various platforms and tools such as ZenML, Airflow, Spark, MLflow and more.
- Optimized Performance: Through the use of micro-batching, BentoML maximizes resource usage and allows for separate scaling specifically for model inference.
- Consistent Format: BentoML provides a consistent format for model serving and deployment, ensuring uniformity across different use cases.
- Platform Flexibility: Not limited to Kubernetes, BentoML supports deployment across a variety of platforms, offering notable flexibility.
Limitations of BentoML
As BentoML focuses specifically on the containerization of machine learning models, it’s essential to note a few drawbacks:
- Limited Experimentation: BentoML leans heavily towards deployment, leaving experimentation aspects to be managed by additional tools like MLflow.
- Scaling Concerns: Horizontal scaling is not handled by default in BentoML, which might require additional configurations or tools.
- Lack of Advanced Features: Certain advanced features, such as multi-model serving and A/B testing, are not supported.
- Basic Monitoring: While it does provide monitoring and logging, additional effort is required to establish a fully functional system.
Is BentoML the Right Choice for Your Team?
BentoML is a fitting choice for teams that prioritize quick and straightforward model deployment without the need for advanced deployment features. However, it may not respond to teams that require a more complex deployment process, especially those seeking advanced features like multi-model serving and A/B testing.
Fundamental Principles of BentoML
Let’s explore the fundamental principles of BentoML, ensuring a thorough understanding of its key features and functionalities. This section will be as straight-forward a possible to promote clarity between the various concepts.
BentoML Models
In BentoML, a model contains the algorithms and learned parameters from training, enabling predictions on new data.
Model Store
BentoML’s Model Store is a local repository for saving and managing models. Key operations include:
- Saving a model: upload a model to the Model Store.
import bentoml
= bentoml.sklearn.save_model("iris_clf", clf) saved_model
- Retrieving a model: download a model from the Model Store.
import bentoml
from sklearn.base import BaseEstimator
= bentoml.sklearn.load_model("iris_clf:latest") model: BaseEstimator
- Managing a model: the following manipulation are available from the BentoML CLI:
bentoml models list
bentoml models get
bentoml models delete
Model Runners
Runners handle model inference, simplifying direct model interactions. After loading a saved model, you can establish a runner for local inference:
import bentoml
# Retrieve the saved model
= bentoml.models.get("iris_clf:latest")
bento_model
# Create a runner from the model
= bento_model.to_runner()
my_runner
# Initialize the Runner in the current process (for development and testing only):
my_runner.init_local()
# Use the runner for inference (hypothetical example)
= my_runner.predict(input_data) predictions
Model Signature
In BentoML, the model signature specifies the model’s expected input and output formats. It ensures data consistency during inference and aids in error-free deployment by validating and transforming inputs.
bentoml.pytorch.save_model("demo_mnist", # Model name in the local Model Store
# Model instance being saved
trained_model, ={ # Model signatures for Runner inference
signatures"classify": {
"batchable": False,
}
} )
Batching
In BentoML, batching allows simultaneous handling of multiple data for faster inference. By setting the batchable parameter to True
in a model’s signature, multiple calls can merge into one batched call for efficiency:
bentoml.pytorch.save_model("demo_mnist", # Model name in the local Model Store
# Model instance being saved
trained_model, ={ # Model signatures for Runner inference
signatures"__call__": {
"batchable": True,
"batch_dim": 0,
},
} )
The batch_dim
parameter determines the input’s batching dimension. If set to 0
, inputs [1, 2]
and [3, 4]
become [[1, 2], [3, 4]]
. If set to 1
, they merge as [1, 2, 3, 4]
.
Having explored BentoML models, let’s now turn our attention to how Service and APIs play a crucial role in utilizing these models effectively.
Service an APIs
Moving forward, we will explore how to create a service, understand the interaction with runners, dive into service APIs, learn about IO descriptors, and differentiate between synchronous and asynchronous APIs.
In BentoML, the service is the primary structure where users specify the logic for the model to interact with its deployment environment.
Creating a Service
A service is essentially a combination of Runners and APIs:
- Runners: Specialized components that handle model inference.
- APIs: Define how external requests interact with the models.
For instance, in the provided example, a service named iris_classifier
is created using a runner (iris_clf_runner
) for a ScikitLearn model:
= bentoml.Service("iris_classifier", runners=[iris_clf_runner]) svc
After initializing the service, use the svc.api
decorator to define APIs, set input/output formats, and link a function like classify
:
from bentoml.io import NumpyNdarray
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
= iris_clf_runner.predict.run(input_series)
result return result
The Interaction with Runners
In BentoML, a Runner encapsulates the serving logic of a model, optimizing throughput and resource use. It can be easily created from a saved model:
= bentoml.sklearn.get("iris_clf:latest").to_runner() runner
Runners adapt to the ML framework’s characteristics, ensuring efficient model inference. For debugging or manual serving, you can initialize and use runners as follows:
from service import svc
for runner in svc.runners:
runner.init_local()
= svc.apis["my_endpoint"].func(inp) result
Service APIs
Inference APIs dictate remote service calls. A service can host multiple APIs, each with its input/output specs and a function definition:
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_array: np.ndarray) -> np.ndarray:
= runner.run(input_array)
result return result
Using @svc.api
, the function becomes an API endpoint. For instance, the above becomes an HTTP /predict
endpoint. The request can be perfomed with:
curl -X POST -H "content-type: application/json" \
--data "[[5.9, 3, 5.1, 1.8]]" \
http://127.0.0.1:3000/predict
IO Descriptors
IO descriptors define the data type for an API’s input and output. They ensure data consistency and conversion between native types. For instance, the classify
API uses bentoml.io.NumpyNdarray
for both input and output:
import numpy as np
from bentoml.io import NumpyNdarray
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_array: np.ndarray) -> np.ndarray:
...
BentoML offers various IO descriptors like PandasDataFrame
, JSON
, Image
, Text
and File
, allowing to easily use predefined types for common inputs.
IO descriptors help specify and validate expected data types and shapes. For instance, with the NumpyNdarray
descriptor, you can define data type and shape using dtype
and shape
arguments. Enforcing strict validation is possible with enforce_shape
and enforce_dtype
:
import numpy as np
from bentoml.io import NumpyNdarray
= bentoml.Service("iris_classifier")
svc
# Define IO using samples
= NumpyNdarray.from_sample(np.array([[1.0, 2.0, 3.0, 4.0]]))
output_descriptor
@svc.api(
input=NumpyNdarray(shape=(-1, 4), dtype=np.float32, enforce_dtype=True, enforce_shape=True),
=output_descriptor,
output
)def classify(input_array: np.ndarray) -> np.ndarray:
...
Synchronous vs Asynchronous APIs
BentoML supports both synchronous and asynchronous APIs. While synchronous APIs are straightforward and suitable for many use cases, asynchronous APIs offer better performance, especially for IO-bound tasks or when invoking multiple runners:
# Synchronous API example
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_array: np.ndarray) -> np.ndarray:
return runner.run(input_array)
# Asynchronous API example
import aiohttp
import asyncio
= bentoml.sklearn.get("iris_clf:version1").to_runner()
runner1 = bentoml.sklearn.get("iris_clf:version2").to_runner()
runner2
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
async def predict(input_array: np.ndarray) -> np.ndarray:
async with aiohttp.ClientSession() as session:
= await session.get('https://features/get', params=input_array[0])
features
= await asyncio.gather(
results
runner1.predict.async_run(input_array, features),
runner2.predict.async_run(input_array, features),
)return combine_results(results)
Now that we had a grasp on managing services and APIs, let’s turn our attention to Bentos, exploring the processes of building, managing, testing, and integrating them in various scenarios.
Exploring Bentos
A Bento is an archive containing everything needed to run a bentoml.Service
, including source code, models, data, and configurations. While bentoml.Service
defines the inference API, Bento ensures it can be consistently run in production.
Building a Bento
The bentoml build
CLI command creates a Bento using a bentofile.yaml
build file:
service: "service:svc" # Same as the argument passed to `bentoml serve`
labels:
owner: bentoml-team
stage: dev
include:
- "*.py" # A pattern for matching which files to include in the bento
python:
packages: # Additional pip packages required by the service
- scikit-learn
- pandas
This file specifies the service, labels, included files, and required Python packages. Each Bento gets a unique version tag, but you can set a custom version with the --version
argument if needed:
bentoml build --version 1.0.1
Managing Bentos
A Bento can be managed locally using bentoml
CLI commands in the same fashion as managing Bento Models:
bentoml list
bentoml get
bentoml delete
Bentos can also be managed with the Python API:
import bentoml
= bentoml.get("iris_classifier:latest") bento
Testing Bentos
Before deploying, it is crucial to test Bentos locally to ensure correct behavior.
There is three ways to test a Bento:
- BentoML CLI: Serve a Bento using the command line (replace
BENTO_TAG
with your tag, e.g.,iris_classifier:latest
):
bentoml serve BENTO_TAG
- bentoml.Server API: For a programmatic approach, use the Python API. Especially useful for debugging:
from bentoml import HTTPServer
import numpy as np
= HTTPServer("iris_classifier:latest", production=True, port=3000, host='0.0.0.0')
server = server.get_client()
client
with server.start() as client:
= client.classify(np.array([[4.9, 3.0, 1.4, 0.2]]))
result print(result)
Pushing & Pulling Bentos
Yatai, an additonal tool built by the same company, offers a Bento repository with APIs and a Web UI, storing Bentos on cloud storage like AWS S3 or GCS. It can auto-build Docker images for new Bentos:
bentoml push iris_classifier:latest
bentoml pull iris_classifier:nvjtj7wwfgsafuqj
Directory Structure
To view Bento’s generated files, use:
» cd $(bentoml get iris_classifier:latest -o path)
» tree
.
├── README.md
├── apis
│ └── openapi.yaml
├── bento.yaml
├── env
│ ├── docker
│ │ ├── Dockerfile
│ │ └── entrypoint.sh
│ └── python
│ ├── requirements.lock.txt
│ ├── requirements.txt
│ └── version.txt
├── models
│ └── iris_clf
│ ├── latest
│ └── nb5vrfgwfgtjruqj
│ ├── model.yaml
│ └── saved_model.pkl
└── src
├── locustfile.py
├── service.py
└── train.py
Where:
src
: Files frombentofile.yaml
’sinclude
field, relative to the code’s current working directory. It allows relative module imports and file paths in user code.models
: Contains models needed by the Service, determined from the Service’s runners.apis
: API specs generated from the Service’s API definitions.env
: Environment files from Bento Build Options inbentofile.yaml
.
Now that we’ve explored BentoML’s features in details, let’s shift our focus to investigate how to serve these models in different cloud environments.
Conclusion
In our exploration of BentoML, we’ve dissected its core functionalities, highlighting its capacity to streamline the transition of models from their developmental stage right through to their practical application. 🚀
Its ability to encapsulate models into Docker images not only simplifies deployment across various platforms but also ensures that models are readily accessible and usable in diverse production environments.
This tool offers a robust framework for model serving. Unfortunately, it leaves the deployment concerns entirely to the user 🤔 While it provides foundational monitoring and logging, users must craft a more comprehensive monitoring to fully harness its capabilities in varied contexts.
I warmly encourage you to try BentoML ! As it enabled me to quickly package models to promptly deploy them and create value.
I’m sincerely grateful for your time in exploring BentoML together and I hope it gave you the insights you were searching. 🔍