Scalable Inference with DALL-E Mini

Deploy DALL-E Mini for scalable inference on CoreWeave Cloud

The DALL-E Mini and Mega models accept a text prompt as input, which is used to generate an image. The following example includes code to submit as a text input for generating an image as output using DALL-E Mini, as well as the code and manifests to download the DALL-E Mini model and to deploy an autoscaling Inference Service on CoreWeave Cloud.

The end result is a HTTP API that can be used to generate images from text input in a concurrent and highly scalable fashion.

Prerequisites

This tutorial presumes you have an active CoreWeave Cloud account. The following additional tools must be installed and configured prior to running the example:

Example source code

To follow along with this example, clone the source code repository:

Note

dalle-mini may be replaced with dalle-mega in the YAML manifests and following commands, should you wish to deploy the larger version of the model.

Build the Docker images

Important

The default Docker tag is latest. Using this tag is strongly discouraged, as containers are cached on the nodes and in other parts of the CoreWeave stack. Always use a unique tag, and never push to the same tag twice. Once you have pushed to a tag, do not push to that tag again.

Two images are used in this example:

The downloader image

Defined in Dockerfile.downloader, this image downloads the model to a shared storage volume when built. Individual inference Pods then load the model from the storage volume, instead of downloading it over the Internet each time they scale up.

The model image

This image runs DALL-E Mini.

Docker Hub

Note

When running the following commands, be sure to replace the coreweave DOCKER_USER with your own Docker Hub username.

Below, simple versioning is used, where the tag 1 is used for the first iteration of the image.

First, change directories to kubernetes-cloud/online-inference/dalle-mini. Then, from this directory, use docker login to log in to Docker Hub.

$ docker login

Be sure to export your own Docker Hub username as the DOCKER_USER.

$ export DOCKER_USER=<your Docker Hub username>

Next, build the model-downloader image with the tag 1, by specifying the Dockerfile.downloader Dockerfile.

$ docker build -t $DOCKER_USER/model-downloader:1 -f Dockerfile.downloader . 

Then, build the dalle-mini image with the tag 1, by specifying the Dockerfile named Dockerfile.

$ docker build -t $DOCKER_USER/dalle-mini:1 -f Dockerfile .

Once both images are built, push both the model-downloader image with tag 1 and the dalle-mini image with tag 1 to your Docker Hub account.

$ docker push $DOCKER_USER/model-downloader:1
$ docker push $DOCKER_USER/dalle-mini:1

Deploy the Kubernetes resources

Define image

In each of the manifests below, image may be set either to the newly-built Docker images for the model-downloader and the inference-service respectively, or you may use the publicly-available image as provided in the following manifests:

Build the Persistent Volume Claim (PVC)

From the kubernetes-cloud/online-inference/dalle-mini directory, apply the 00-model-pvc.yaml manifest using kubectl apply to deploy a new PVC in which to store the model.

$ kubectl apply -f 00-model-pvc.yaml

Apply the model downloader

From the kubernetes-cloud/online-inference/dalle-mini/ directory, use kubectl apply to deploy the model downloader job.

$ kubectl apply -f 01-model-download-job.yaml

Use kubectl get pods or kubectl watch to check the status of the newly-deployed Pods. Wait until the downloader Pod is in a Completed state.

$ kubectl get pods

NAME                        READY   STATUS      RESTARTS   AGE
dalle-mini-download-hkws6   0/1     Completed   0          1h

Or, use kubectl logs to monitor its progress.

$ kubectl logs -l job-name=dalle-mini-download --follow

Deploy the Inference Service

Once the model is downloaded, deploy the InferenceService by using kubectl apply on the 02-inference-service.yaml manifest.

$ kubectl apply -f 02-inference-service.yaml

Loading the model into GPU memory may take a couple of minutes. To monitor progress, you can wait to see the KServe workers start in the Pod logs by using kubectl logs -l.

$ kubectl logs -l serving.kubeflow.org/inferenceservice=dalle-mini --container kfserving-container

Or, you can wait for the InferenceService to show that READY is True, and that it a URL has propagated:

$ kubectl get isvc dalle-mini

NAME         URL                                                                        READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                  AGE
dalle-mini   http://dalle-mini.tenant-my-namespace.knative.ord1.coreweave.cloud   True           100                              dalle-mini-predictor-default-00001   19h

Note

isvc is the object type for CoreWeave's InferenceService.

Make an HTTP request

Using the provided URL provided using kubectl get isvc, make a test HTTP request to the InferenceService.

Here is a simple example using cURL, which also opens the output image.

$ curl http://dalle-mini.tenant-my-namespace.knative.ord1.coreweave.cloud/v1/models/dalle-mini:predict -d '{"prompt": "Photorealistic galaxy"}' --output prediction.png && open prediction.png

cURL parameters

The model inference-time parameters may be adjusted by invoking curl with a parameters map, such as:

$ curl http://dalle-mini.tenant-my-namespace.knative.ord1.coreweave.cloud/v1/models/dalle-mini:predict -d '{"prompt": "Photorealistic galaxy", "parameters": {"top_k": 100, "top_p": 2.0, "temperature": 1.5, "condition_scale": 15.0}}' --output prediction.png && open prediction.png

The following parameters are supported:

  • top_k

  • top_p

  • temperature

  • condition_scale

Reference

Hardware

Additional Resources

For all available GPU types and their respective selector labels, refer to the Node Types page.

In this example, one RTX A6000 is used. Inference times are between 5 and 10 seconds on RTX A6000 and on A40 GPUs. Multi-GPU inference is supported, which provides higher inference speeds. To enable multi-GPU inference, increase the number of GPUs defined in the limits key of the InferenceService manifest.

GPUs with less VRAM - at a minimum of 16GB GPUs like the Quadro RTX 5000 - may also work for this exercise.

Autoscaling

Autoscaling is controlled in the InferenceService configuration. By default, this example is set to always run 1 replica, regardless of the number of requests.

Increasing the number of maxReplicas allows the CoreWeave infrastructure to automatically scale up replicas when there are multiple outstanding requests to your endpoints. Replicas will automatically be scaled down as demand decreases.

Example

spec:
  predictor:
    minReplicas: 1
    maxReplicas: 1

By setting minReplicas to 0, Scale-To-Zero can be enabled. This feature completely scales down the InferenceService after no requests have been received for a set period of time.

Note

When the Service is scaled to zero, no cost is incurred.

Last updated