Scalable Inference with DALL-E Mini
Deploy DALL-E Mini for scalable inference on CoreWeave Cloud
The DALL-E Mini and Mega models accept a text prompt as input, which is used to generate an image. The following example includes code to submit as a text input for generating an image as output using DALL-E Mini, as well as the code and manifests to download the DALL-E Mini model and to deploy an autoscaling Inference Service on CoreWeave Cloud.
The end result is a HTTP API that can be used to generate images from text input in a concurrent and highly scalable fashion.
Prerequisites
This tutorial presumes you have an active CoreWeave Cloud account. The following additional tools must be installed and configured prior to running the example:
- kubectl
- Docker
- A Docker Hub account, with a public Docker registry. (To use a private registry, an
imagePullSecret
must be defined.)
Example source code
To follow along with this example, clone the source code repository from the CoreWeave GitHub.
dalle-mini
may be replaced with dalle-mega
in the YAML manifests and following commands, should you wish to deploy the larger version of the model.
Build the Docker images
The default Docker tag is latest
. Using this tag is strongly discouraged, as containers are cached on the nodes and in other parts of the CoreWeave stack. Always use a unique tag, and never push to the same tag twice. Once you have pushed to a tag, do not push to that tag again.
Two images are used in this example:
The downloader
image
Defined in Dockerfile.downloader
, this image downloads the model to a shared storage volume when built. Individual inference Pods then load the model from the storage volume, instead of downloading it over the Internet each time they scale up.
The model
image
This image runs DALL-E Mini.
Docker Hub
When running the following commands, be sure to replace the coreweave DOCKER_USER
with your own Docker Hub username.
Below, simple versioning is used, where the tag 1
is used for the first iteration of the image.
First, change directories to kubernetes-cloud/online-inference/dalle-mini
. Then, from this directory, use docker login
to log in to Docker Hub.
$docker login
Be sure to export your own Docker Hub username as the DOCKER_USER
.
$export DOCKER_USER=<your Docker Hub username>
Next, build the model-downloader
image with the tag 1
, by specifying the Dockerfile.downloader
Dockerfile.
$docker build -t $DOCKER_USER/model-downloader:1 -f Dockerfile.downloader .
Then, build the dalle-mini
image with the tag 1
, by specifying the Dockerfile named Dockerfile
.
$docker build -t $DOCKER_USER/dalle-mini:1 -f Dockerfile .
Once both images are built, push both the model-downloader
image with tag 1
and the dalle-mini
image with tag 1
to your Docker Hub account.
$docker push $DOCKER_USER/model-downloader:1$docker push $DOCKER_USER/dalle-mini:1
Deploy the Kubernetes resources
Define image
In each of the manifests below, image
may be set either to the newly-built Docker images for the model-downloader
and the inference-service
respectively, or you may use the publicly-available image as provided in the following manifests:
Build the Persistent Volume Claim (PVC)
From the kubernetes-cloud/online-inference/dalle-mini
directory, apply the 00-model-pvc.yaml
manifest using kubectl apply
to deploy a new PVC in which to store the model.
$kubectl apply -f 00-model-pvc.yaml
Apply the model downloader
From the kubernetes-cloud/online-inference/dalle-mini/
directory, use kubectl apply
to deploy the model downloader job.
$kubectl apply -f 01-model-download-job.yaml
Use kubectl get pods
or kubectl watch
to check the status of the newly-deployed Pods. Wait until the downloader Pod is in a Completed
state.
$kubectl get podsNAME READY STATUS RESTARTS AGEdalle-mini-download-hkws6 0/1 Completed 0 1h
Or, use kubectl logs
to monitor its progress.
$kubectl logs -l job-name=dalle-mini-download --follow
Deploy the Inference Service
Once the model is downloaded, deploy the InferenceService
by using kubectl apply
on the 02-inference-service.yaml
manifest.
$kubectl apply -f 02-inference-service.yaml
Loading the model into GPU memory may take a couple of minutes. To monitor progress, you can wait to see the KServe workers start in the Pod logs by using kubectl logs -l
.
$kubectl logs -l serving.kubeflow.org/inferenceservice=dalle-mini --container kfserving-container
Or, you can wait for the InferenceService
to show that READY
is True
, and that it a URL has propagated:
$kubectl get isvc dalle-miniNAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGEdalle-mini http://dalle-mini.tenant-my-namespace.knative.ord1.coreweave.cloud True 100 dalle-mini-predictor-default-00001 19h
isvc
is the object type for CoreWeave's InferenceService.
Make an HTTP request
Using the provided URL provided using kubectl get isvc
, make a test HTTP request to the InferenceService.
Here is a simple example using cURL, which also opens the output image.
$curl http://dalle-mini.tenant-my-namespace.knative.ord1.coreweave.cloud/v1/models/dalle-mini:predict -d '{"prompt": "Photorealistic galaxy"}' --output prediction.png && open prediction.png
cURL parameters
The model inference-time parameters may be adjusted by invoking curl
with a parameters
map, such as:
$curl http://dalle-mini.tenant-my-namespace.knative.ord1.coreweave.cloud/v1/models/dalle-mini:predict -d '{"prompt": "Photorealistic galaxy", "parameters": {"top_k": 100, "top_p": 2.0, "temperature": 1.5, "condition_scale": 15.0}}' --output prediction.png && open prediction.png
The following parameters are supported:
top_k
top_p
temperature
condition_scale
Reference
Hardware
For all available GPU types and their respective selector labels, refer to the Node Types page.
In this example, one RTX A6000 is used. Inference times are between 5 and 10 seconds on RTX A6000 and on A40 GPUs. Multi-GPU inference is supported, which provides higher inference speeds. To enable multi-GPU inference, increase the number of GPUs defined in the limits
key of the InferenceService
manifest.
GPUs with less VRAM - at a minimum of 16GB GPUs like the Quadro RTX 5000 - may also work for this exercise.
Autoscaling
Autoscaling is controlled in the InferenceService
configuration. By default, this example is set to always run 1 replica, regardless of the number of requests.
Increasing the number of maxReplicas
allows the CoreWeave infrastructure to automatically scale up replicas when there are multiple outstanding requests to your endpoints. Replicas will automatically be scaled down as demand decreases.
Example
spec:predictor:minReplicas: 1maxReplicas: 1
By setting minReplicas
to 0
, Scale-To-Zero can be enabled. This feature completely scales down the InferenceService
after no requests have been received for a set period of time.
When the Service is scaled to zero, no cost is incurred.