I'm a Senior MLOps Engineer with 5+ years of experience in building end-to-end Machine Learning products. From my industry experience, I write long-form articles on MLOps to help you build real-world AI systems.
Deploying machine learning models is hard. Scaling them is even harder! One ML model instance is usually not enough to handle production workloads. You need a way to adjust the number of machines running your models based on the traffic.
Model Serving Platforms are the solution. They provide utilities to scale your models while optimizing costs. Open-source options (like KServe) run on Kubernetes, whereas fully managed alternatives such as SageMaker or Vertex AI handle the infrastructure for you.
Read this guide, to learn:
- What a Model Serving Platform is.
- The key features to look for when choosing one
- The pros and cons of open-source and fully managed options.
- A detailed comparison of the best platforms, including KServe, SageMaker, Vertex AI, BentoCloud, and Seldon Core.
- A feature comparison table for these platforms.
By the end of the article, you will know how to choose the best platform for your team and project. Let’s get started!
What is a Model Serving Platform?
Model Serving Platforms are designed to manage the infrastructure to scale and deploy machine learning models based on your application’s traffic and response time requirements. You can use open-source serving platforms, such as KServe and Seldon, or proprietary ones, like VertexAI or Amazon SageMaker.
When there are no requests, a serving platform can shut down all the model servers (scale-to-zero). Then, when requests come in, the platform will determine how many instances are needed to meet your service level objective (SLO), spawn them, and distribute the requests to the instances.
Serving platforms are complex systems that will thrive in some scenarios and are unnecessary in others. Read our in-depth model serving platform overview to fully understand when and how to use them to scale your models.
How to Choose a Model Serving Platform?
Choosing a serving platform can be overwhelming. There are so many options available, multiple core features, and criteria to consider.
Main Criteria for Comparison
To find the right platform, you need to understand all the key features these serving tools offer. Let’s review them one by one.
ML Framework Support
When choosing a serving platform, make sure it supports your preferred machine learning frameworks, such as PyTorch, Keras, and TensorFlow.
Also check if it supports any high-performance runtimes you plan to use, such as TensorRT or ONNX.
Deployment Strategies
Most early ML teams use a simple deployment process: they directly replace the model in production with a new one.
However, after the initial stages, you might want to use more advanced deployment strategies such as multi-armed bandit, canary, or A/B testing. Again, check that the serving platforms you consider support the strategies you want to use.
Integration
If you already have an MLOps stack, consider the effort needed to integrate it with any serving platforms you are evaluating.
For example, if you are already maintaining a Kubernetes cluster, choosing a serving platform that runs on K8s will be easier than using a fully managed platform like SageMaker.
Learning Curve
Some platforms like Seldon are complex and need more engineering than fully managed platforms like SageMaker.
Make sure your team has the right skills for the platform you choose, especially if it has a steep learning curve.
Maintenance
Open-source platforms that require you to develop and maintain the underlying infrastructure will require infinitely more maintenance than proprietary platforms like Vertex AI.
When choosing between different platforms, remember to include maintenance costs in your overall budget.
Model Monitoring
Model monitoring allows you to track your deployed models’ performance. Common model monitoring features include:
- Data quality: Checks the quality of the input data that your model receives.
- Model drift: Detects when model performance degrades over time as the distribution of the data changes compared to your training data.
- Bias detection: Finds biases that cause poor model performance on some data clusters.
- Feature attribution: Determines which input features most impact your model’s decisions.
Choose a platform according to the monitoring features that matter most to you. Some platforms, like KServe, do not support out-of-the-box model monitoring, while others, like Vertex AI, do.
Scaling Features
There are two major features to consider when scaling machine learning models:
- Auto-scaling: Automatically adjusts resources (number of model servers) based on demand to optimize performance and cost.
- Scale-to-zero: Shuts down all machines when the are no incoming requests.
Every platform on our list supports auto-scaling. However, depending on the platform, tuning the scaling according to your needs can vary in complexity.
Most platforms do not support scaling to zero for real-time endpoints. Consider this when selecting the best serving platform, as it can save you a lot of money.
Vendor lock-in
Evaluate how much a serving platform ties you to its ecosystem. Vendor-locked platforms will limit your ability to migrate to other production environments. Weigh the pros and cons of such dependencies before making a decision.
Cost
Open-source platforms are free, but you still need to pay for the infrastructure to run them. Proprietary platforms usually charge based on usage or have subscription fees.
Evaluate the total costs, including infrastructure, development, and maintenance, not just the direct platform fees.
Open Source vs Fully Managed
Deciding between an open-source or a fully managed platform depends on factors such as project’s requirements, the team’s size and skills, and budget.
All open-source serving platforms are based on Kubernetes. Let’s review at the implications of using this orchestration system.
Kubernetes Pros and Cons
Powerful Scaling Capabilities
Kubernetes (K8s) excels in orchestrating containerized applications, making it a robust solution for scaling machine-learning models. Its capabilities in designing and managing micro-service applications are unmatched. Plus, its active community support and well-documented resources make it a go-to choice for many organizations.
Learning Curve
Kubernetes is powerful but has a steep learning curve, which can be a barrier for many teams. Mastering its complexities requires a significant investment in engineering and cloud resources. This can slow down your development process, especially for teams new to container orchestration.
Significant Maintenance
Kubernetes requires constant effort to build, develop, and maintain its infrastructure. The ongoing requirement of time and expertise to ensure the infrastructure remains robust and updated can be intensive. This aspect can challenge teams looking to minimize operational overheads.
After reviewing this tool’s strengths and weaknesses, deciding on your deployment stack really comes down to the following question: K8s or no K8s?
Open-Source Platforms
The main benefits of open-source tools such as KServe or Seldon are:
- Customization: High customization capabilities for advanced deployment strategies.
- Lower cloud costs: Infrastructure costs will be dramatically reduced compared to proprietary platforms. However, during development, the salary cost of the employees will, in most cases, largely outweigh the cost of a fully managed tool.
- Integration: Easier to integrate with various MLOps tools.
I would choose an open-source Kubernetes platform over a managed one when:
- The company requires a high degree of flexibility and customization regarding MLOps features.
- If we know in advance that the data volume and model inference traffic will be high, it is worth spending extra engineering budget to save long-term cloud costs.
- Due to its higher integration capability, choosing an open-source tool can be a wiser choice if your team has an already established stack.
- Your team has the expertise to manage a Kubernetes cluster.
Fully-Managed Platforms
Fully-managed tools, such as Vertex AI or Amazon SageMaker, are alternatives to avoid the complexity of Kubernetes. These tools are designed for ease and efficiency, making them a favorable choice for certain project environments.
The main advantages are:
- Simplicity: Ideal for teams searching for straightforward and fast-paced development.
- Fast adoption: Their documentation is great, and the learning curve is much more approachable than that of K8s.
- No infrastructure management: You don’t have to set up and manage the infrastructure yourself, saving a lot of time and effort.
You might wonder why it is worth considering open-source tools when fully managed ones have such benefits. The short answer is cloud costs and customization.
Best Model Serving Platforms
I originally wrote this serving platform analysis for Neptune AI in their best model serving tools article. In this guide, I have decided to focus more on serving platforms.
It’s important to note that most of these platforms offer features for the whole lifecycle of ML, from data preparation to model monitoring. However, in this context, we will focus primarily on the deployment and serving features.
Best Open-Source Platforms
KServe
KServe is an open-source tool providing a Kubernetes-native platform for deploying and scaling machine learning models.
This serving platform was originally named KFServing and was part of the Kubeflow project. The community decided to split KServe from its Kubeflow to make it a standalone tool.
Advantages of KServe
- Out-of-the-box auto-scaling: KServe automatically scales your model instances to handle incoming request variations. It can also scale to zero when there is no traffic to reduce costs.
- Real-time inference: The KServe architecture provides supports fast online prediction.
- Batch inference: KServe offers an inference batcher for efficient batch prediction.
- Supports complex inference graphs: You can build complex inference graphs with many models using this platform.
- ML frameworks integration: KServe supports an impressive number of ML frameworks and high-performance runtimes.
- Integrates with many tools: KServe can be used with tools like ZenML, Kafka, Nvidia Triton, and Grafana.
- Advanced deployment strategies: KServe supports multi-armed bandits, A/B testing, and canary deployments to help you deploy models safely.
- Active community: The KServe community is active and ready to help if you need support.
Disadvantages of KServe
- Requires Kubernetes: To use KServe, you need to deploy and manage your own Kubernetes cluster. This can be hard if you do not have a team with Kubernetes skills.
- No built-in monitoring: KServe does not have a monitoring tool included. KServe containers create logs for Prometheus, but you have to set up and manage a Prometheus server yourself. However, adding Prometheus should not be too much work if you already use Kubernetes.
Summary for KServe
KServe is a strong choice for teams with solid Kubernetes skills that want advanced deployment capabilities and the flexibility to customize their MLOps stack.
Seldon Core
Seldon Core is a serving platform that deploys and scales ML models on Kubernetes. It is known for supporting advanced deployment strategies.
Before January 22, 2024, Seldon Core was free and open-source. But now it uses the Business Source License (BSL) 1.1. Companies have to pay $18,000 per year to use Seldon Core versions released after January 22, 2024, in commercial products.
Advantages of Seldon Core
- Real-time inference: Seldon Core has a strong online prediction system. It integrates with Kafka natively.
- Batch inference: This platform provides an elegant architecture for batch predictions.
- Advanced deployment: Seldon Core supports multi-armed bandits, canary deployments, and A/B testing.
Disadvantages of Seldon Core
- High cost: The Seldon Core subscription starts at $18,000 per year, and this does not include support from Seldon.
- Needs Kubernetes skills: To use Seldon Core, you have to deploy and manage a Kubernetes cluster. This can be difficult if you do not have a DevOps team to work with.
- Limited auto-scaling: Auto-scaling with Seldon Core is not straightforward. You have to set it up using KEDA. Seldon Core also does not support scaling to zero instances.
Summary for Seldon Core
Before moving to a proprietary model, Seldon Core was a strong alternative to KServe, but given its new pricing and KServe’s advantages, I believe there are now better open-source options in most cases
Best Fully Managed Platforms
Fully managed platforms are ideal for teams that:
- Lack MLOps/DevOps skills: They enable smaller teams without much MLOps expertise to deliver quickly.
- Want to outsource maintenance: Fits teams that prefer delegating infrastructure maintenance.
- Have sufficient budget: Works well for teams that can handle higher cloud costs.
However, these platforms are not a good fit for teams that:
- Have an existing complex infrastructure: Less suited for those with pre-existing systems incompatible with Vertex AI or other cloud-provider platforms.
- Want flexibility: Not ideal for teams looking for a decentralized, flexible MLOps platform.
Vertex AI
Vertex AI, a serving platform available on the Google Cloud Platform (GCP), offers a managed environment for deploying and scaling machine learning models.
Advantages of Vertex AI
- AutoML: Vertex AI includes AutoML, allowing teams to create custom ML models with minimal expertise, streamlining the model development process for various data types including tabular, image, video, and text.
- Managed infrastructure: Involves minimal setup and maintenance for deploying and scaling ML models as the platform manages the necessary compute resources.
- Cloud services integration: Vertex AI is completely integrated with GCP. It makes it easy to use machine learning models with Google’s services, such as CloudSQL, Secret Manager, and more.
- Auto-scaling: Vertex AI automatically scales the number of instances of machines running your models according to incoming traffic.
- Built-in model monitoring: Vertex AI comes with integrated model metrics, eliminating the need for additional monitoring infrastructure.
- Robust data infrastructure: Vertex AI integrates seamlessly with Google Cloud’s advanced data tools and BigQuery, providing a strong foundation for handling structured datasets.
- Support: Offers extensive support from the provider, available with the Google customer care subscription. Documentation: The documentation is well-designed and gets regular updates.
Disadvantages of Vertex AI
- Expensive: Vertex AI is more expensive than a self-managed infrastructure, particularly for applications requiring high-end GPUs.
- Complex pricing: Estimating costs is tricky since they depend on factors like hardware type, machine storage, and network data transfer
- No scaling-to-zero: Vertex AI cannot scale down resources to zero when idle, which is not currently on the roadmap despite being requested since 2021.
- Operational restrictions: Imposes specific methods of operation, reducing flexibility and customizability.
- Vendor lock-in: Creates a strong dependence on GCP, making migration to other platforms difficult.
Summary for Vertex AI
Vertex AI is particularly beneficial for small to medium-sized teams that do not have extensive DevOps skills. It’s well-suited for teams looking to outsource infrastructure management and who have the budget to handle higher operational costs.
Amazon SageMaker
Amazon SageMaker is a cloud-based machine learning platform provided by Amazon Web Services, similar to Google’s Vertex AI.
Advantages of Amazon SageMaker
- Managed infrastructure: You don’t need to manage infrastructure. SageMaker handles this for you, letting you focus on building models.
- Advanced auto-scaling: SageMaker offers flexible scaling options. You can define your own scaling policies, step scaling for complex scenarios, or set up scheduled scaling. It lets you customize metrics, instance limits, and cooldown periods to match your workload needs.
- Built-in monitoring: Includes model monitoring out-of-the-box thus helping you track how well your models are performing in production.
- AWS integration: It works seamlessly with other AWS services, creating a unified cloud ecosystem.
- Pre-built containers: Supports a wide variety of pre-built model containers. These container are ready-to-use thus speeding up deployment.
- Cost efficiency: It’s often more cost-effective than Vertex AI, especially for larger workloads or when using the scale-to-zero feature.
Disadvantages of Amazon SageMaker
- Steeper learning curve: It has a steeper learning curve than Vertex AI. You might need more time to become proficient with SageMaker.
- Vendor lock-in: Using SageMaker ties you closely to AWS. This vendor lock-in can make it hard to switch to other cloud providers later.
- Operational restrictions: It imposes certain ways of working. This might limit your flexibility if you have specific workflow needs.
- No scaling-to-zero: The documentation is quite confusing because there’s many types of inference endpoints. For real-time, SageMaker does not support scaling-to-zero.
- Less user-friendly: Vertex AI feels easier to use than Sagemaker, according to Superwise.ai.
Summary for SageMaker
SageMaker is a good fit for teams of all sizes who need a robust, scalable ML platform. It’s especially useful if you’re already using AWS or if you need advanced scaling customization.
If you’re hesitating between Vertex AI or Sagemaker, I would advise Vertex AI if you need robust data management tools (especially for tabular data) and SageMaker if you need advanced scaling.
When choosing between Vertex AI and SageMaker, consider your priority:
- Choose Vertex AI if you require strong data management capabilities, especially for tabular data.
- Opt for SageMaker if advanced scaling features are crucial for your project.
BentoCloud
BentoCloud is a model-serving platform that scales BentoML containers efficiently and cost-effectively. Developed by the same company behind BentoML, this platform provides pre-built model containers and high-level APIs. With BentoCloud, you can scale machine learning models with just a few lines of code.
Advantages of BentoCloud
- Ease of use: BentoCloud offers a simple yet powerful CLI experience, allowing developers to deploy BentoML containers effortlessly across various cloud providers.
- Supports complex inference graphs: With BentoCloud, you can build distributed inference graphs with multiple models.
- Built-in auto-scaling: BentoCloud auto-scales your model instances based on the traffic. It can even reduce the number of instances to zero when there are no requests.
- ML framework support: BentoCloud can scale BentoML containers. This means it supports all the ML frameworks integrated with BentoML.
- No vendor lock-in: Teams can deploy BentoCloud on their desired cloud provider. As BentoML Docker images can be deployed outside BentoCloud, migrating to another serving platform is easier.
- Built-in monitoring: You can see your model metrics from the BentoCloud UI. No extra setup is needed.
Disadvantages of BentoCloud
- Costly: BentoCloud is proprietary. The managed version uses a pay-as-you-go pricing model. There is an enterprise version that you can deploy on your own cloud, but the price is not communicated.
- Only works with BentoML: BentoCloud only supports models packaged with the BentoML model-serving runtime.
- No advanced deployment strategies: BentoCloud has documentation for deployment strategies that seems to include A/B testing but there’s no support for multi-armed bandits or canary.
Summary for BentoCloud
BentoCloud is a good choice if you already use BentoML and want a simple way to deploy and scale your models. But it might not be the best option if you use many serving runtimes (like TorchServe or Triton) or need to use advanced deployment strategies.
Feature Comparison Table
Conclusion
Scaling machine learning models is crucial for real-time applications and high-volume batch pipelines. Model Serving Platforms are perfect for solving this problem as they provide clear abstractions and utilities to adjust the number of model instances based on your application traffic.
There are many criteria to consider when choosing a serving platform. The most important ones are the machine learning frameworks and runtimes supported, the deployment strategies available, the integration with your existing stack, and the maintenance and cloud costs.
Kubernetes is the system behind most open-source platforms. It is powerful for building complex pipelines but has a steep learning curve. On the other hand, fully managed platforms, such as Vertex AI or SageMaker, are easier to use but are more expensive and less flexible.
The effort needed to adopt these platforms varies greatly. Before deciding, consider the size and technical skills of your team.
Scaling ML models is a complex challenge, but I hope this guide will help you make informed decisions. Good luck putting your models in production!