Universe Scale Serving of Generative Models

And Why You Might Not Want To Use FastAPI For That Purpose

Posted by Javier Mermet

on April 12, 2023 · 9 mins read

New frontiers

If you're a fan of our blog, you know we're constantly pushing ourselves to be at the frontier of AI/ML tech. In our latest technical post, we introduced Stable Diffusion. Not because of all the hype around it, but because our brand-new Research Squad has been doing ✨ magic ✨ with it!

But building on top of state of the art models is not enough by itself if you plan on using it in production. Very much alike other ML projects, for this endeavor to thrive, we had to build the capabilities for serving, monitoring and scaling the infrastructure that will serve as the backbone of the model. This was new ground for us, so we took it one step at a time. We started building from the ground up, learning along the way different alternatives towards getting the most out of available hardware in serving GPU intensive models such as Stable Diffusion.

Goals of our tests

Before doing any benchmarks, we need to know what we'll be optimizing for. In serving a model such as Stable Diffusion, the GPU is most likely to be your hardware bottleneck, whether you are going with cloud providers or on-premise. So, constrained to a given GPU, we want to get the highest throughput possible. That's to say: the most images per second or the most requests per second. These two are similar premises but have a nuanced difference: a request might demand more than one image. So, let's focus on RPS under the assumption that each request requires to generate just one image.

Slow start: FastAPI

FastAPI is a wrapper on top of Starlette, which has gained popularity over the last few years in the Python ecosystem. Many tutorials for Machine Learning practitioners teach you how to serve your models through a RESTful interface using FastAPI, but most lack a complete outlook into the MLOps aspect of it.

FastAPI falls short in many aspects for our purposes:

Performance
- Ranking: This benchmark is quite popular, and you might notice something not so curious. Python frameworks do not rank well. So, already, the language of choice is not optimal.
- (Micro) Batching: Since FastAPI is not built with the explicit purpose of serving ML models, microbatching is not a built-in feature. You will have to either build it yourself or use a third party package.
Observability: FastAPI is just a web framework. As such, it doesn't bundle features that enable MLOps/Devops teams to monitor the near real-time status of an app.
ASGI goes only as far as your GPU/s: The hardware bottleneck in your system will most likely be the GPU, so you want to make the most out of it. There are some optimizations to be made in the API itself, but you'll have to consider how to use the GPU to it's fullest.

Addressing performance

FastAPI optimizations

Don't use StreamingResponse. Seriously. Most online tutorials return the generated images using StreamingResponse, but there are several reasons not to. Using a plain Response, we saw improvements ranging from 5% for the single user scenario and up to 60% on median latency for the multi user scenario.
Use microbatching. We had to implement this ourselves, but the idea is that you can generate several images with different prompts all at once (provided some other parameters are the same) and we make better use of the GPU. Generation is blocking, so we can serve more requests at once.
- We set two parameters here: the maximum time to wait before fetching the next batch and the maximum number of prompts to fetch. 500ms and 8 prompts seemed like sensible defaults to start with.

Uvicorn workers

Use uvloop. By adding uvicorn as a dependency with the standard extras, you get uvloop, which increases performance.

PyTorch optimizations

Enable attention slicing for VRAM savings
Enable memory efficient attention for performance
Enable cuDNN auto-tuner
Disable gradients calculation
Set PyTorch's num_threads to 1

Implementing observability

Out of the box, FastAPI has no observability capabilities. But there are lots of packages that add features and can be seamlessly integrated.

Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.

It is widely used to collect metrics as time series and query them through PromQL.

We will be using cadvisor to export metrics from our containers into prometheus, dcgm-exporter to export GPU metrics and as we will see later, metrics straight from FastAPI. For each of these, we use the default scrape_interval of 15 seconds.

Grafana

We set up two dashboards, for different purposes: monitoring the GPU and the API container.

api dashboard — The dashboard for monitoring the API container.

gpu dashboard — The dashboard for monitoring GPU usage.

These are a must have for monitoring in a production environment. You need to be able to tell what's wrong with a quick glance.

FastAPI Prometheus instrumentator

We used prometheus-fastapi-instrumentator to add a /metrics endpoint which exports metrics in a prometheus-friendly format.

We had to add latency and requests metrics, but it was a breeze. Read the docs and you will find most use cases explained.

Stress testing

We evaluated several options to stress test the API under different scenarios and workloads. We needed to easily set scenarios to each endpoint so that we could compare later.

Artillery. Great performance, configurable with code.
Apache Bench. Easy setup, wrappable in scripts for different scenarios.
Apache JMeter. A classic choice.
Locust. For the pythonista. Easy setup. Configurable with code.

Although it's better suited for more complex workflows than ours, we went with locust because of ease of setup, scenario writing, we get performance measures out of the box (on top of prometheus) and we've used it in the past for several projects.

Locust

The setup for each scenario looked similar to this snippet:

import random

from locust import FastHttpUser, task


class T2IHighRes(FastHttpUser):
    @task
    def fetch_image(self):
        self.client.post(
            "/t2i",
            json={
                "seed": random.randint(0, 1_000_000),
                "negative_prompt": "photography",
                "prompt": "digital painting of bronze (metal raven), automaton, all metal, glowing magical eye, intricate details, perched on workshop bench, cinematic lighting, by greg rutkowski, in the style of midjourney, ({steampunk})",
                "width": 512,
                "height": 512,
                "inference_steps": 20,
                "guidance_scale": 8.2,
                # "n": 1,
            },
        )

(Which produces stunning images, by the way)

You'll notice that we get many metrics that we already had in grafana. We used both as a double check, but with grafana/prometheus you can get more stylized metrics, such as 5 minutes windows moving averages.

Experiments & Results

We conducted all experiments on g4dn.xlarge spot instances provided by AWS, using the Amazon Linux 2 AMI with Nvidia drivers. Initially, we tried deploying on vast.ai, but we found that it was not possible to run Docker containers due to environment limitations (docker in docker is not available). So, it was not suited for our choice of deployment.

g4dn.xlarge instances have 16GB of RAM, 4 vCPUs and a Nvidia Tesla T4 GPU.

There are five endpoints to generate images, four of them are based on combining these traits:

Sync vs Async: When using FastAPI, you usually define endpoints as async functions, but many times, the underlying service is not async (for instance, sqlalchemy querying). We tested whether defining the endpoint+service as async functions made any difference.
Stream response vs Response: As stated before, we felt this was necessary to test due to how ubiquitous using a StreamingResponse to return images is on online tutorials and reference implementations. We implemented the same endpoint twice, once using a plain Response object and another using StreamingResponse

And the fifth one is the microbatching endpoint, which uses a custom priority queue to serve a configurable amount of responses at the same time. As said before, the batch timeout are set to 500ms by default and the max batchsize is set to 8. This means that every 500ms, if no group of requests has accumulated 8 requests, we fetch the group with the oldest request and serve those first. Requests are grouped by width, height, inference_steps, and guidance_scale.

For all scenarios we got 0 failures, thus they are not reported. Metrics reported here come from Locust, but where revised against the aforementioned dashboards.