Transformers Accelerate: BigScience BLOOM

Deploy BigScience BLOOM as an Inference Service using the Accelerate Library

BigScience BLOOM is an autoregressive Large Language Model (LLM), trained to construct new text from a prompt based on vast amounts of text data using industrial-scale computational resources.

It is capable of outputting coherent, functional, and stylistically consistent text in 46 human languages as well as 13 programming languages. BLOOM can also be instructed to perform text-based tasks for which it hasn't been explicitly trained, by casting the tasks as text generation tasks.

The following tutorial demonstrates how to deploy BigScience BLOOM as an Inference Service on CoreWeave Cloud, complete a simple HTTP API to perform Text Generation while leveraging Hugging Face's Transformers Accelerate library. The deployment will be run on NVIDIA A100 GPUs with autoscaling, with Scale-to-Zero enabled.

Prerequisites

This tutorial assumes that kubectl is already installed on your system.

It also assumes you already have an active CoreWeave Cloud account, with account credentials configured and a working Kubeconfig file installed on your system.

Example source code

To follow along with this tutorial, first clone the BLOOM example code from the CoreWeave kubernetes-cloud repository.

The online-inference/bloom-176b directory contains all of the manifests used to deploy the resources, as well as everything needed to build your own Docker image, if desired.

Advanced options

In this tutorial, public images of the BLOOM container are used. However, if you would prefer to build and push your own image, see "Build your own image" below. For those instructions, a public registry on Docker Hub is used.

Container images may alternatively be pushed to CoreWeave Object Storage.

note

To use a private registry, an imagePullSecret must be defined. This is outside the scope of this tutorial.

Hardware and performance

This example is set to use five NVIDIA A100 PCIe GPUs.

Please be aware that current generation performance with the current codebase is suboptimal. We are actively working to integrate optimizations into the BLOOM example. This tutorial is primarily for demonstration purposes - optimized performance should not currently be expected.

The highest performing GPU combination for production grade deployment of BLOOM is 8x A100 NVLINK 80GB GPUs. To access these, please contact support.

Procedure

Once the tutorial source code is cloned to your local system, change directories to kubernetes-cloud/online-inference/bloom-176b.

$ cd kubernetes-cloud/online-inference/bloom-176b

This directory contains the following files:

Filename	Description
`model`	A directory containing additional files
`00-bloom-176b-pvc.yaml`	The Kubernetes manifest used to create the PVC, in which the model will be stored
`01-bloom-176b-download-job.yaml`	The Kubernetes manifest used to deploy the download job, which will download the model into the created PVC
`02-bloom-176b-inferenceservice.yaml`	The Inference Service, which will run the experiment

The model directory contains the following files, used for creating the BLOOM container image:

Filename	Description
`Dockerfile`	The Dockerfile used to create the BLOOM container image
`bloom.py`	Utilized by the `download-job` manifest to download the model into the PVC
`requirements.txt`	Requirements for `bloom.py`
`scripts/download-model`	The `download-model` script in the `scripts` directory is used to automate downloading the Hugging Face model

Deploy the Kubernetes resources

note

The manifests in this tutorial's source code are configured by default to point to publicly available BLOOM images. If you are building and pushing your own image instead, be sure to adjust the value of image in the download-job and inferenceservice manifests so that image points to your image source prior to deploying the Kubernetes resources.

Next, deploy each of the Kubernetes resources. All resource manifests are contained within the online-inference/bloom-176b directory.

Deploy the PVC

In order to store the model, a PVC is needed. To create the PVC, apply the 00-bloom-176b-pvc.yaml manifest.

$ kubectl apply -f 00-bloom-176b-pvc.yaml

Deploy the model download job

note

The model is quite large at 329Gi, so it may take around 30 minutes for the download job to complete.

Next, the model will be downloaded to the PVC. To deploy the job to download the model, apply the 01-bloom-176b-download-job.yaml manifest.

$ kubectl apply -f 01-bloom-176b-download-job.yaml

To check the model's download status, follow the job logs to monitor progress using kubectl logs:

kubectl logs -l job-name=bloom-176b-download --follow

Or, wait for the job to be in a Completed state:

$ kubectl get pods

NAME                        READY   STATUS      RESTARTS   AGE
bloom-176b-download-hkws6   0/1     Completed   0          1h

Deploy the Inference Service

Once the model has finished downloading, apply the Inference Service by applying the final manifest, 02-bloom-176b-inferenceservice.yaml.

$ kubectl apply -f 02-bloom-176b-inferenceservice.yaml

note

Due to the size of the model, loading into GPU memory can take around 5 minutes.

To monitor the loading progress, either wait to see the KServe workers start in the Pod logs by using kubectl logs:

$ kubectl logs -f -l serving.kubeflow.org/inferenceservice=bloom-176b kfserving-container

Or, wait for the InferenceService to show that READY is True, and that it has a URL:

$ kubectl get inferenceservices

NAME         URL                                                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                  AGE
bloom-176b   http://bloom-176b.tenant-sta-tweldon-workbench.knative.ord1.coreweave.cloud   True           100                              bloom-176b-predictor-default-00001   9m2s

Submit a request

The URL shown when describing the Inference Service is the endpoint to send HTTP requests using your preferred means.

For example, here is a request submitted using cURL:

curl http://bloom-176b.tenant-sta-tweldon-workbench.knative.ord1.coreweave.cloud/v1/models/bigscience-bloom:predict -d '{"instances": ["That was fun"]}' | jq
{
  "predictions": [
    [
      {
        "generated_text": "That was fun.\n- Yeah, it was good to see you guys again.\nYeah, you too.\nYou know what?\nI think I'm gonna go home and get some sleep.\nI'm beat.\n-"
      }
    ]
  ]
}

The following parameters are supported:

min_length
max_length
temperature
top_k
top_p
repetition_penalty

You can modify the model parameters by adjusting the curl command. For example:

$ curl http://bloom-176b.tenant-sta-tweldon-workbench.knative.ord1.coreweave.cloud/v1/models/bigscience-bloom:predict -d '{"instances": ["This will generate some text"], "parameters":{"max_length": 20, "min_length": 20}}'

Autoscaling options (advanced)

Scaling is controlled in the InferenceService manifest. For this tutorial, the manifest is set to always run one replica, regardless of the number of requests.

Increasing the number of maxReplicas will allow the CoreWeave infrastructure to automatically create additional replicas in the event that there are multiple outstanding requests to your endpoints.

For example:

spec:
  predictor:
    minReplicas: 1
    maxReplicas: 4

Replicas will then automatically be scaled down as demand decreases.

Scale-to-Zero

By setting minReplicas to 0, Scale-to-Zero is enabled, which will completely scale down the InferenceService if there have been no requests for a period of time.

When a service is scaled to zero, no cost is incurred.

note

Please note that due to the size of the BLOOM model, enabling Scale-to-Zero may lead to very long request completion times if the model has to be scaled up from zero replicas. This can take up to around 5 minutes.

Build your own image (advanced)

Log in to Docker Hub (if applicable)

If you are using Docker Hub, log in to Docker Hub now.

$ docker login

Set your Docker Hub username as the environment variable $DOCKER_USER.

$ export DOCKER_USER=coreweave

Build the Docker image

caution

CoreWeave strongly discourages using the default Docker tag latest. Containers are cached on the nodes and in other parts of the CoreWeave stack, so using this tag can cause issues. Instead, use a simple versioning system.

Similarly, once you have pushed to a tag, do not push to the same tag again.

Using the provided Dockerfile, build the HuggingFace image. For this tutorial, a very simple versioning scheme of single numbers is used. The tag 1 is used for the first iteration of the image.

$ docker build -t $DOCKER_USER/bloom-176b:1 -f Dockerfile .

Next, push the built and tagged image to the Docker Hub container registry.

$ docker push $DOCKER_USER/bloom-176b:1

Once the image is pushed to the registry, continue the rest of this tutorial, starting at Deploy the Kubernetes resources.

Prerequisites​

Example source code​

Advanced options​

Hardware and performance​

Procedure​

Deploy the Kubernetes resources​

Deploy the PVC​

Deploy the model download job​

Deploy the Inference Service​

Submit a request​

Autoscaling options (advanced)​

Scale-to-Zero​

Build your own image (advanced)​

Log in to Docker Hub (if applicable)​

Build the Docker image​

Prerequisites

Example source code

Advanced options

Hardware and performance

Procedure

Deploy the Kubernetes resources

Deploy the PVC

Deploy the model download job

Deploy the Inference Service

Submit a request

Autoscaling options (advanced)

Scale-to-Zero

Build your own image (advanced)

Log in to Docker Hub (if applicable)

Build the Docker image