PyTorch Hugging Face Diffusers - Stable Diffusion Text to Image

Deploy Stable Diffusion for scalable, high fidelity, text-to-image generation on CoreWeave Cloud
The Stable Diffusion model takes a text prompt as input, and generates high quality images with photorealistic capabilities. It is an open source model built by our friends at Stability.AI. Stability also offers a UI for the model and an API service via Dream Studio.
In this example we will go step by step to deploy Stable Diffusion as an auto-scaling Inference Service on CoreWeave Cloud, which will provide an HTTP API in order to generate images from a text prompt.
An image generated from the prompt: "Red forest, digital art, trending"
View the example code on GitHub:


The following tools must be installed and configured prior to running the example:


Build and push the Docker images

We require two images:
  1. 1.
    The Downloader image. This will download the model to a shared storage volume, the individual inference Pods will load the model from this storage instead of downloading it over internet every time they scale up.
  2. 2.
    The Model Image. This is what will run CompVis/stable-diffusion-v1-4
The default Docker tag is latest. Using this tag is strongly discouraged, as containers are cached on the nodes and in other parts of the CoreWeave stack. Always use a unique tag, and never push to the same tag twice. Once you have pushed to a tag, do not push to that tag again.
Below, we use simple versioning by using the tag 1 for the first iteration of the image.
When running the following commands, be sure to replace the example username with your Docker Hub username.
From the kubernetes-cloud/online-inference/stable-diffusion directory, run the following commands:
$ docker login
$ export DOCKER_USER=coreweave
$ docker build -t $DOCKER_USER/model-downloader:1 -f Dockerfile.downloader .
$ docker build -t $DOCKER_USER/stable-diffusion:1 -f Dockerfile .
$ docker push $DOCKER_USER/model-downloader:1
$ docker push $DOCKER_USER/stable-diffusion:1
This example assumes a public docker registry. To use a private registry, an imagePullSecret needs to be defined.

Deploy the Kubernetes resources


Note Before continuing, you may either point the image: in the following manifests to the image we just built in the previous steps, or you may use the publicly-available image found in the following manifests:
To create a PVC in which to store the model, run the following command from the kubernetes-cloud/online-inference/stable-diffusion directory:
$ kubectl apply -f 00-model-pvc.yaml

Model Repository Registration

Due to the generative power of this model, it is necessary to register your contact information via HuggingFace before the model can be used.
While logged in, visit the HuggingFace Model Repository page, review the terms, and click the Access Repository button at the bottom of the page.
Stable Diffusion HuggingFace repository registration


If you have not already done so, create a HuggingFace account and API Token.
Once you have a token, copy and Base64 encode it:
$ echo -n "TOKEN_HERE" | base64
Replace TOKEN_HERE with your Huggingface API Token
Note the extra space before the "echo" command, this will prevent the command (and as a result your HuggingFace API token) out of your shell history.
Take the Base64-encoded value of your token from the above command, and replace the value in the token field of the 01-huggingface-secret.yaml file with it, then create the Secret:
$ kubectl create -f 01-huggingface-secret.yaml

Model job download

To deploy the job that downloads the model to the PVC, run the following command from the kubernetes-cloud/online-inference/stable-diffusion/ directory:
$ kubectl apply -f 02-model-download-job.yaml
To check if the model has finished downloading, wait for the job to be in a Completed state:
$ kubectl get pods
stable-diffusion-download-vsznr 0/1 Completed 0 3h14m
Or, follow the job logs to monitor progress:
$ kubectl logs -l job-name=stable-diffusion-download --follow


Once the model is downloaded, the InferenceService can be deployed by invoking:
$ kubectl apply -f 03-inference-service.yaml
Loading up the model into GPU memory may take a couple of minutes. To monitor the progress of this, you can wait to see the KServe workers start in the pod logs by invoking:
$ kubectl logs -l --container kfserving-container
Alternatively, you can wait for the InferenceService to show that READY is True, and that it has a URL:
$ kubectl get isvc stable-diffusion
stable-diffusion True 100 stable-diffusion-predictor-default-00001 64m
Using the provided URL, you can make an HTTP request via your preferred means.
Here is a simple cURL example:
curl -d '{"prompt": "California sunset on the beach, red clouds, Nikon DSLR, professional photography", "parameters": {"seed": 424242, "width": 768}}' --output sunset.png \
&& open sunset.png
An image generated from the prompt: "California sunset on the beach, red clouds, Nikon DSLR, professional photography"
The following per-request parameters are supported:
- guidance_scale
- num_inference_steps
- seed
- width
- height
Parameters may be passed per-request as follows:
$ curl -d \ '{"prompt": "California sunset on the beach, red clouds, Nikon DSLR, professional photography", "parameters": {"guidance_scale": 14.0, "num_inference_steps"
: 100, "seed": 424242, "width": 1024, "height": 768}}' --output sunset.png \
&& open sunset.png

Hardware and Performance

This example is set to one A40 for production of higher resolution images. Inference times are around 4.78 seconds for a default resolution of 512x512 with 50 steps. Larger resolutions take longer - for example, a resolution of 1024x768 takes around 47 seconds.
Multi-GPU Inference is not supported.
Depending on use case, GPUs with less VRAM will also work down to 8GB GPUs, such as the Quadro RTX 4000, however output resolution will be limited by memory to 512x512.
The graph and table below compare recent GPU benchmark inference speeds for Stable Diffusion processing on different GPUs:
Quadro RTX 4000
Quadro RTX 5000
A100 40GB PCIE
AWS A100 40GB
Additional Resources
Refer to the Node Types page for all available GPUs and their selectors.


Scaling is controlled in the InferenceService configuration. This example is set to always run one replica, regardless of number of requests.
Increasing the number of maxReplicas will allow the CoreWeave infrastructure to automatically scale up replicas when there are multiple outstanding requests to your endpoints. Replicas will automatically be scaled down as demand decreases.


minReplicas: 1
maxReplicas: 1
By setting minReplicas to 0, Scale To Zero can be enabled, which will completely scale down the InferenceService when there have been no requests for a period of time.
When a service is scaled to zero, no cost is incurred.