Skip to main content

Fine-tune Stable Diffusion Models with CoreWeave Cloud

Fine-tune and train Stable Diffusion models using Argo Workflows

Fine-tuning and training Stable Diffusion can be computationally expensive, but CoreWeave Cloud allows you to train Stable Diffusion models with on-demand compute resources and infrastructure that scale down to zero active pods, incurring no charges, after training is complete.

This guide is a reference example of how to use an Argo Workflow to create a pipeline at CoreWeave to fine-tune and train Stable Diffusion models. It's a working demonstration to get you started, but it's not intended to be a production application.

important

This article covers both DreamBooth and Textual Inversion training methods. Most of the steps are the same for both methods. But, when they are different, we used tabbed sections to indicate which training method applies.

Prerequisites

This guide contains all the information required to train Stable Diffusion, but assumes that you have already followed the process to set up the CoreWeave Kubernetes environment. If you have not done so already, follow the steps in Cloud Account and Access before proceeding.

It also assumes you are familiar with the topics covered in these articles.

Resources

Hardware

This reference example uses the following optimal container configuration for training Stable Diffusion models, but you can use any configuration you wish, as long as it meets the minimum requirements. This configuration is currently $1.52 per hour using CoreWeave's resource based pricing model.

  • 8 vCPU (AMD EPYC)
  • 32GB RAM
  • NVIDIA A40/A6000 GPUs (48GB VRAM)

There is an optional test Inference endpoint that can be enabled and deployed automatically when the model completes fine-tuning. This Inference container defaults to the following configuration, which currently costs $0.65 per hour with resource based pricing.

  • 4 vCPU
  • 8GB RAM
  • NVIDIA Quadro RTX 5000 (16GB VRAM)

GitHub repository

To follow this guide, clone the latest version of the CoreWeave kubernetes-cloud repository and navigate to the project directory for your preferred fine-tuning method:

note

Understanding the Argo Workflows

Each of the Argo Workflow templates used in the examples have a similar structure. They consist of three important sections:

  • Workflow parameters
  • Main template
  • Individual step templates

Throughout the file, you will see many template tags, surrounded by double braces: {{ and }}. Many of these are simple variable substitutions using workflow and step parameters. Expression template tags that start with {{= contain expr code.

Parameters

All of the Workflow parameters and their default values are defined at the top of the workflow templates, and cover the following categories:

  • Fine-tuning hyperparameters
  • File paths
  • Workflow step controls
  • Container images

All the parameters have suitable defaults, but make sure to review them and adjust according to your needs.

Main template

The workflow is defined by a main template that lists the steps in the order they should be run. Each of the steps have the parameters defined and some include a when value which tells the workflow when the step should be skipped.

Step templates

The step templates define how the job will be run, including the container image, resource requests, command, etc.

The step that creates the inference service is different because it applies manifest to the cluster instead of running a custom job. This manifest defines the inference service. For this reason, the guide will ask you to create a service account with permission to create inference services. The workflow will then use this service account to apply the manifest.

The inference service step has custom-defined success and failure conditions. Without these, Argo will mark the step as succeeded as soon as it successfully applies the manifest. By using these custom conditions, Argo will monitor the new inference service and only consider the step complete after the inference service start successfully. This makes it easy to run additional steps afterwards using the inference service, like creating batches of images.

Triggering the Argo Workflows

This guide offers two ways to deploy everything needed to trigger the workflows.

The first is through the Argo Workflow UI. From there, you can see all of the deployed workflow templates. Clicking on one will allow you to submit a new run after editing all of the parameter's default values.

The second is through the Argo Rest API's /api/v1/events/<namespace>/<discriminator> endpoint. The discriminator will be defined in a WorkflowEventBinding deployed alongside each WorkflowTemplate.

note

You can view all of the available endpoints in your Argo Workflows deployment by clicking on the API Docs button in the sidebar of the UI.

About the fine-tuning methods

This guide explains how to deploy an Argo Workflow to fine-tune a Stable Diffusion base model on a custom dataset, then use the fine-tuned model in an inference service.

The base model being trained on can be provided directly in a PVC (PersistentVolumeClaim), or in a Stable Diffusion model identifier from Hugging Face's model repository. The dataset trained upon needs to be in the same PVC in text and image format.

As described earlier, you can choose one of two different methods to train the base model, either DreamBooth or Textual Inversion. Here's a short orientation of each before getting started.

DreamBooth method

The DreamBooth method allows you to fine-tune Stable Diffusion on a small number of examples to produce images containing a specific object or person. This method for fine-tuning diffusion models was introduced in a paper publish in 2022, DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. A lighter introductory text was also released along with the paper in this blog post.

To summarize, the DreamBooth method is a way to teach a diffusion model about a specific object or style using approximately three to five example images. After the model is fine-tuned on a specific object using DreamBooth, it can produce images containing that object in new settings.

The DreamBooth method uses "Prior Preservation Loss", which means class-specific loss is combined with the loss from your custom dataset. For example, when using the DreamBooth method to teach the model about a specific dog, the model will also be fine tuned against generic images of dogs. This helps prevent the model from forgetting what normal dogs look like.

In the paper a special token, "sks", is used in the prompts for the custom dataset. It is not necessary to use a special token like "sks", but it allows you to use this token in inference prompts to create images containing the dog in the custom dataset. The "sks" token was chosen because it appears very rarely in the data used to train the text encoder.

Textual Inversion method

The Textual Inversion training method captures new concepts from a small number of example images and associates the concepts with words from the pipeline's text encoder. The model then uses these words and concepts to create images from text prompts with fine-grained control. Textual Inversion was introduced in the 2022 paper, An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion.

The Textual Inversion examples in this guide allow you to fine-tune Stable Diffusion with your own dataset using the same technique used for pre-training.

Example templates

The GitHub repository for this guide has template files for both training methods. Refer to the tables below to learn about each file.

DreamBooth Templates

FilenameDescription
db-workflow-template.yamlThe Argo Workflow Template itself.
db-workflow-event-binding.yamlThe Event Binding used to trigger the Workflow via an API call.
inference-role.yamlThe inference role you configured earlier.
db-finetune-pvc.yamlThe model storage volume described earlier.
huggingface-secret.yamlThe Hugging Face token used to download a base model.
wandb-secret.yamlThe Weights and Biases token used for reporting during fine-tuning.

Required components

The following Kubernetes-based components are required for this guide. Deploy each of them before proceeding to the database setup step.

Argo Workflows

Deploy Argo Workflows using the Application Catalog.

From the application deployment menu, click on the Catalog tab, then search for argo-workflows to find and deploy the application.

Argo Workflows

PVC

Create a ReadWriteMany PVC storage volume from the Storage menu. By default, this workflow uses a specific PVC depending on your fine-tune method:

note
  • The DreamBooth PVC name should be db-finetune-data
  • The Textual Inversion PVC name should be sd-finetune-data

This name can be changed in the configuration after you are familiar with the workflow.

1TB to 2TB is recommended for training Stable Diffusion models, depending on the size of the dataset and how many fine-tunes you wish to run. Later, if you later require more space, it's easy to increase the size of the PVC as needed.

The PVC can be shared between multiple fine-tune runs. We recommend using HDD type storage, because the fine-tuner does not require high performance storage.

Configuring a PVC storage volume from the Cloud UI

If you prefer, you can also deploy the PVC with the YAML snippet for your method below, then use kubectl apply -f to apply it.

DreamBooth YAML

db-finetune-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: db-finetune-data
spec:
storageClassName: shared-hdd-las1
accessModes:
- ReadWriteMany
resources:
requests:
storage: 2000Gi

Kubernetes Secrets

The workflow requires two Kubernetes Secrets to be created — one from a Weights and Biases API key, the other from a Hugging Face user access token. These are used to log metrics during fine tuning, and (potentially) download the base model from Hugging Face.

To create the Weights and Biases Secret, encode your Weights and Biases API key with base64.

$ echo "example-wandb-api-key" -n | base64

Output:

ZXhhbXBsZS1hcGkta2V5IC1uCg==

Add the encoded string to wandb-secret.yaml at line 3 as shown. The string will be unique to the API key.

wandb-secret.yaml
apiVersion: v1
data:
token: ZXhhbXBsZS1hcGkta2V5IC1uCg==
kind: Secret
metadata:
name: wandb-token-secret
type: Opaque

Next, create the Hugging Face Secret by encoding the user access token from your account.

$ echo "hugging-face-token" -n | base64

Output:

aHVnZ2luZy1mYWNlLXRva2VuIC1uCg==

Add the string to huggingface-secret.yaml . The string is unique to the token.

huggingface-secret.yaml
apiVersion: v1
data:
token: aHVnZ2luZy1mYWNlLXRva2VuIC1uCg==
kind: Secret
metadata:
name: huggingface-token-secret
type: Opaque

Apply both manifests to create the Secrets.

$ kubectl apply -f wandb-secret.yaml
$ kubectl apply -f huggingface-secret.yaml

Optional Component: filebrowser

The filebrowser component is optional, but may make your interaction easier. This application allows you to easily transfer files and folders to and from your PVC. You can deploy filebrowser from the Application Catalog.

We recommend using a short name, such as finetune, for the filebrowser application to avoid SSL CNAME issues. When deploying filebrowser, make sure to add the PVC created earlier to the filebrowser list of mounts.

tip

You may prefer to use a Virtual Server to interact with the PVC via ssh, or use some other mechanism. This flexibility is one of CoreWeave's key advantages.

The filebrowser application

Dataset Setup

At this point, you should have a PVC set up that is accessible via the filebrowser application or some other mechanism. Now it's time to populate the PVC with your dataset.

Select the tab for your chosen fine-tuning method.

DreamBooth Dataset

For each dataset you want to use, create a directory with a meaningful name such as data/example-dog and place your dataset images in that directory.

An example dataset containing images of a dog

The fine-tuner will use Prior Preservation loss which means "generic" images (AKA class images) will be used during fine-tuning. The fine-tuner will generate these "generic" images prior to starting the training loop using the base model and a provided prompt, but you can also upload these images to a separate folder in the PVC. For example, if you are fine-tuning the model based on pictures of your dog, you would want to use images of random dogs for the "generic" images. By default, the workflow will use 100 class images.

An example class images dataset of generic dogs
note

These generic datasets can be reused for different fine-tuned models.

Permissions Setup

In order to automatically create an InferenceService, the Argo Workflow job needs special permissions, which are declared in inference-role.yaml. Apply that manifest to grant the permissions.

$ kubectl apply -f inference-role.yaml

Deploy the Workflow template

To deploy the workflow template, you can use kubectl or the Argo CLI.

Select the tab for your chosen fine-tuning method.

DreamBooth Workflow deployment

To use kubectl, run the following command:

kubectl apply -f db-workflow-template.yaml

To use Argo's CLI, run the following command:

argo template create db-workflow-template.yaml

Run the Workflow

You can trigger runs of the workflow from the Argo UI, or by setting up a webhook.

Use Argo Workflows UI

Once the workflow template is applied, you should see it in the Argo Workflows UI. An example of Textual Inversion method is shown below. If you use the DreamBooth method, everything is the same except the name will be db-finetune-template.

Deployed Workflow Template in the Argo UI

To trigger a new run of the workflow through the UI, click on the template, then the submit button, then change the necessary parameters. The most common parameters are shown below, but there are many other workflow parameters you may want to review.

DreamBooth parameters

ParameterDescription
run_nameThe workflow name, also used in WandB logs.
instance_datasetThe populated dataset directory.
instance_promptThe prompt with identifier specifying the instance.
class_datasetThe path where generic images are located in the filebrowser.
class_promptThe prompt to specify images in the same class as provided instance images.
outputThe output directory.

Use Webhook

To trigger workflow runs by calling an Argo REST endpoint, you first need to deploy a WorkflowEventBinding. This sets a custom discriminator that tells Argo how to map an endpoint to the workflow template you just deployed.

Select the tab for your chosen fine-tuning method.

DreamBooth WorkflowEventBinding

The WorkflowEventBinding is defined in db-workflow-event-binding.yaml and the discriminator is set to db-finetune. It also maps a few values in the body of the HTTP request to parameters in the workflow template as an example.

To deploy the WorkflowEventBinding, run the following:

kubectl apply -f db-workflow-event-binding.yaml

Now you can trigger the workflow with the /api/v1/events/<namespace>/db-finetune.

The domain used for the API is the same one used to navigate to the UI. You can find the URL by running kubectl get ingress.

The namespace in the URL is the Kubernetes namespace where you've deployed all of the resources. To find your default namespace, run:

kubectl config view --minify --output 'jsonpath={..namespace}'

The Argo API uses the same authentication that you used to login to the UI. For more information about generating the token, see Get Started with Workflows.

Use the information you've collected above to complete the bash commands below, which will hit the endpoint to trigger workflow runs.

DreamBooth Endpoint

export ARGO_API=<enter the URL>
export NAMESPACE=<enter your k8s namespace>
export ARGO_TOKEN=<enter your Argo token>

export INSTANCE_DATASET=<path to instance dataset>
export INSTANCE_PROMPT=<instance prompt to use>
export CLASS_DATASET=<path to the class dataset>
export CLASS_PROMPT=<class prompt to use>
export NUM_CLASS_IMAGES=<number of class images to generate and/or use>
export OUTPUT=<path to folder where the model will be saved>

curl --location "https://${ARGO_API}/api/v1/events/${NAMESPACE}/db-finetune" \
--header "${ARGO_TOKEN}" \
--header 'Content-Type: application/json' \
--data "{
\"run_name\": \"example-dog\",
\"instance_dataset\": \"${INSTANCE_DATASET}\",
\"instance_prompt\": \"${INSTANCE_PROMPT}\",
\"class_dataset\": \"${CLASS_DATASET}\",
\"class_prompt\": \"${CLASS_PROMPT}\",
\"num_class_image\": \"${NUM_CLASS_IMAGES}\",
\"output\": \"${OUTPUT}\"
}"

Observe the Workflow

At this point, we can observe the running workflow via several mechanisms.

argo list

Using the argo list command, you can see information about all of the workflows. Use this command to find the name of the workflow that was just launched.

You can also filter by statuses. To get all running workflows, run:

$ argo list --status Running
  • For DreamBooth, the output should look like this:
NAME                         STATUS    AGE   DURATION   PRIORITY   MESSAGE
db-finetune-template-4fe7b Running 3m 3m 0
  • For Textual Inversion, the output should look like this:
NAME                         STATUS    AGE   DURATION   PRIORITY   MESSAGE
sd-finetune-template-5zx10 Running 2m 2m 0

argo watch

Invoking argo watch <workflow name> tells Argo that we want to watch the job as it goes through all of its stages. Here is example output:

DreamBooth Output

Name:                db-finetune-template-4fe7b
Namespace: tenant-example
ServiceAccount: inference
Status: Running
Conditions:
PodRunning True
Created: Mon Apr 10 11:32:53 -0400 (3 minutes ago)
Started: Mon Apr 10 11:32:53 -0400 (3 minutes ago)
Duration: 3 minutes 22 seconds
Progress: 1/2
ResourcesDuration: 13s*(1 cpu),1m26s*(100Mi memory)
Parameters:
run_name: example-dog
pvc: db-finetune-data
model: stabilityai/stable-diffusion-2-1-base
instance_dataset: data/example-dog
instance_prompt: a photo of sks dog
prior_loss_weight: 1
class_dataset: generic/dogs-2
class_prompt: a photo of dog
output: finetunes/docs-example
num_class_images: 100
lr: 2e-6
lr_scheduler: constant
lr_warmup_steps: 0
batch_size: 1
epochs: 4
seed: 42
checkpointing_steps: 200
image_log_steps: 100
image_log_amount: 4
resolution: 512
use_tensorizer: true
run_inference: true
inference_only: false
region: LAS1
trainer_gpu: A40
trainer_gpu_count: 1
inference_gpu: Quadro_RTX_5000
downloader_image: ghcr.io/wbrown/gpt_bpe/model_downloader
downloader_tag: e2ef65f
finetuner_image: navarrepratt/sd-finetuner
finetuner_tag: df-14
serializer_image: navarrepratt/sd-serializer
serializer_tag: df-14
inference_image: navarrepratt/sd-inference
inference_tag: df-14-3

STEP TEMPLATE PODNAME DURATION MESSAGE
● db-finetune-template-4fe7b main
├───✔ downloader(0) model-downloader db-finetune-template-4fe7b-model-downloader-956898090 49s
└───● finetuner model-finetuner db-finetune-template-4fe7b-model-finetuner-1551742686 2m

argo logs

Invoking argo logs -f <workflow name> watches the logs in real time.

important

If this process appears to hang while outputting the message Loading the model, this is due to a bug in the terminal display code which is exposed during initial model download and caching. To fix this, kill the relevant pod or job, then resubmit it. This should result in the proper progress display.

During fine-tuning, the time elapsed is displayed, alongside the expected time to complete. Checkpointing and loss reporting is also reported within the logs as well as WandB.

note

You can instantly watch a submitted workflow by using the --watch option when running the submit command:

argo submit --watch

WandB Logging

Logs for the fine-tuning workflow can be tracked and visualized using Weights & Biases (WandB).

Generated samples during fine-tuning

The Media tab is where you can see images being generated during the fine-tuning process for every image_log_steps amount of steps. This can also be adjusted depending on how often you want to sample from the model during fine-tuning.

Performance metrics

In the performance tab you will see how fast the GPU is performing in a metric of samples per second.

Fine-tuning metrics

For the training tab, a multitude of fine-tuning metrics are recorded which indicates whether or not the workflow is making progress by reducing loss over time. These metrics can be very useful in determining whether or not the model has reached convergence.

Web UI

You can access your Argo Workflow application via the web UI to see all the fine-tuner jobs, and to check their statuses.

Argo Workflow Web UI

Artifacts

Once the model completes fine-tuning, the model artifacts are available in the respective directory:

  • For DreamBooth, the directory is supplied to the workflow as the output parameter.
  • For Textual Inversion, the directory name pattern is {{pvc}}/finetunes/{{run_name}}.

You can download the model from the respective location.

Inference

If you set up the inference service, you can query the URL endpoint to test the model.

  • If the KNative client is installed, get the URL by invoking kn service list.
  • Retrieve the URL without KNative by executing kubectl get ksvc.

See the example output for your fine-tune method:

DreamBooth Output

NAME                                      URL                                                                                        LATESTCREATED                                   LATESTREADY                                     READY   REASON
inference-example-dog-predictor-default https://inference-example-dog-predictor-default.tenant-example.knative.chi.coreweave.com inference-example-dog-predictor-default-00001 inference-example-dog-predictor-default-00001 True

Test Query

To run a test query, run the curl example for your fine-tune method with the URL retrieved from the previous step.

DreamBooth Test Query

Replace <output parameter> with the value from your run.

curl https://inference-example-dog-predictor-default.tenant-example.knative.chi.coreweave.com/v1/models/<output parameter>:predict \
-d '{"prompt": "A photo of sks dog at the beach", "parameters": {"seed": 42, "width": 512, "height": 512}}' \
--output beach_dog.png

The above command should produce an image similar to:

A photo of sks dog at the beach

The model and dataset have now been run through the fine-tuning process, allowing test inferences against the new model.

Cleaning Up

Once you are finished with the example, you can delete all of the resources that were created.

First, you can run kubectl delete -f <file name> for all of the yaml files that were previously used to deploy resources.

note

For the PVC to be deleted, everything using it will need to be deleted first.

You can delete the Argo deployment through the CoreWeave Cloud UI from the Application page.

To delete all of the inference services that were created from the workflow runs, you need to use the kubectl delete isvc <inference service name> command. In order to see all of the current inference services, you can run kubectl get isvc.

note

By default, the inference service will scale down to 0 active pods after not being used for 30 minutes. When scaled down to 0, they won't incur any charges since no compute is being used.

This concludes the demonstration. Now that you know how to run a simple Argo workflow at CoreWeave, you can expand this example for production jobs.