Transformers DeepSpeed: BigScience BLOOM

An example of a Hugging Face Transformers implementation of the BigScience Bloom 176B parameter model, optimized by Microsoft's DeepSpeed and pre-sharded model weights.

This example demonstrates how to deploy BLOOM as an InferenceService with a simple HTTP API to perform Text Generation, while leveraging Hugging Face's Transformers Accelerate library including the DeepSpeed plugin for Accelerate.

The deployment will run a DeepSpeed-optimized, pre-sharded version of the model on CoreWeave Cloud NVIDIA A100 80GB GPUs networked by NVLink with autoscaling and Scale To Zero. This example uses the Hugging Face BLOOM Inference Server under the hood, wrapping it as a Inference Service on CoreWeave.

note

Please contact CoreWeave Support to access NVIDIA A100 80GB GPUs.

What is BLOOM?

BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources.

– Hugging Face BigScience BLOOM

BigScience BLOOM is able to output coherent text in 46 human languages and 13 programming languages. BLOOM can also be instructed to perform text tasks that it hasn't been explicitly trained for by casting them as text generation tasks.

What is DeepSpeed?

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

– Microsoft DeepSpeed

tip

Code samples are available on CoreWeave's kubernetes-cloud repository.

Pre-requisites

If you'd like to follow along with this demo, clone the manifests provided on GitHub.

The following pieces must be installed and configured prior to running the example.

kubectl
docker
A CoreWeave Account (with Kubectl configured to use your CoreWeave Kubeconfig)
A Docker Hub Account

Overview

No modifications are needed to any of the files to follow along with this demo. The general procedure for this example is:

Build and push the Docker images to a container registry, in this case Docker Hub.
Deploy the Kubernetes resources:
1. A PVC in which to store the model.
2. A Batch Job to download the model. The model is quite large at roughly 329Gi, and will take around 30 minutes to complete the download.
3. Deploy the CoreWeave InferenceService.
Perform Text Generation using the model by sending HTTP requests to the InferenceService.

Procedure

Build and push the Docker images

First, enter the kubernetes-cloud/online-inference/bloom-deepspeed directory. From here, build and push the Docker images; we need one for the model downloader, and one to run the model.

important

The default Docker tag is latest. We strongly discourage you to use this, as containers are cached on the nodes and in other parts of the CoreWeave stack.

Once you have pushed to a tag, do not push to that tag again. Below, we use simple versioning by using tag 1 for the first iteration of the image.

note

In the following commands, be sure to replace the example username with your Docker Hub username.

From the kubernetes-cloud/online-inference/bloom-deepspeed directory, run the following commands:

docker login
export DOCKER_USER=coreweave
docker build -t $DOCKER_USER/huggingface-hub-downloader:1 -f Dockerfile.downloader .
docker push $DOCKER_USER/huggingface-hub-downloader:1
docker build -t $DOCKER_USER/microsoft-bloom-deepspeed-inference-fp16:1 -f Dockerfile .
docker push $DOCKER_USER/microsoft-bloom-deepspeed-inference-fp16:1

note

This example assumes a public docker registry. To use a private registry, an imagePullSecret needs to be defined.

Deploy the Kubernetes resources

PVC

note

Before continuing, you may either point the image: in the following manifests to the images we just built in the previous steps, or you may use the publicly-available image found in the following manifests:

01-download-job.yaml
02-inference-service.yaml

To create a PVC to store the model, from the kubernetes-cloud/online-inference/bloom-deepspeed/ directory, run:

kubectl apply -f 00-pvc.yaml

Model job download

To deploy the job to download the model to the PVC, from the kubernetes-cloud/online-inference/bloom-deepspeed/ directory, run:

kubectl apply -f 01-download-job.yaml

note

The model is quite large at 329Gi, and may take around 30 minutes for the download job to complete.

To check if the model has finished downloading, wait for the job to be in a Completed state:

kubectl get po -l job-name=microsoft-bloom-deepspeed-inference-fp16-download
NAME                                                      READY   STATUS    RESTARTS   AGE
microsoft-bloom-deepspeed-inference-fp16-download-5mdd2   0/1     Pending   0          48s

Or, follow the job logs to monitor progress:

kubectl logs -l job-name=microsoft-bloom-deepspeed-inference-fp16-download --follow

InferenceService

Once the model is downloaded, the InferenceService can be deployed by invoking:

kubectl apply -f 02-inference-service.yaml

Due to the size of the model, loading into GPU memory can take around 5 minutes. To monitor the progress of this, you can wait to see the KServe workers start in the pod logs by invoking:

kubectl logs -f -l serving.kubeflow.org/inferenceservice=microsoft-bloom-deepspeed-inference-fp16 kfserving-container

Alternatively, you can wait for the InferenceService to show that READY is True, and that it has a URL:

kubectl get inferenceservices
NAME                                       URL                                                                                        READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                                               AGE
microsoft-bloom-deepspeed-inference-fp16   http://microsoft-bloom-deepspeed-inference-fp16.tenant-demo.knative.ord1.coreweave.cloud   True           100                              microsoft-bloom-deepspeeda7e1fc0ba9c8977d6db7956f04d85acf-00001   16m

Using the provided URL, you can make an HTTP request via your preferred means.

Here is a simple cURL example:

curl http://microsoft-bloom-deepspeed-inference-fp16.tenant-demo.coreweave.com/generate/ -H 'Content-Type: application/json' -d '{"text": ["Deepspeed is"], "repetition_penalty": 10.0}'
{"method":"generate","num_generated_tokens":[75],"query_id":1,"text":["Deepspeed is a leading provider of high-performance, low-latency data center connectivity. We offer the fastest and most reliable Internet access available in North America.\nWe are committed to providing our customers with unmatched performance at an affordable price point through innovative technology solutions that deliver exceptional value for their business needs\nOur network consists entirely on dark fiber routes between major metropolitan areas across Canada & USA"],"total_time_taken":"4.86 secs"}

tip

For a complete list of available request parameters, check the Hugging Face GitHub.

Parameters may be modified with each request by supplying the parameters found at the above link as keys, along with the desired value in your request data:

curl http://microsoft-bloom-deepspeed-inference-fp16.tenant-demo.knative.ord1.coreweave.cloud/generate/ -H 'Content-Type: application/json' -d '{"text": ["Deepspeed is"], "repetition_penalty": 10.0, "top_k": 10, "do_sample": true}'

Autoscaling

Scaling is controlled in the InferenceService configuration. This example is set to always run one replica, regardless of number of requests.

Increasing the number of maxReplicas will allow the CoreWeave infrastructure to automatically scale up replicas when there are multiple outstanding requests to your endpoints. Replicas will automatically be scaled down as demand decreases.

Example

spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5

By setting minReplicas to 0, Scale To Zero can be enabled, which will completely scale down the InferenceService when there have been no requests for a period of time.

When a service is scaled to zero, no cost is incurred. Please note that due to the size of the BLOOM model, Scale to Zero will lead to very long request completion times if the model has to be scaled from zero. This can take around 5 minutes.

Hardware and Performance

This example is set to use eight NVIDIA A100 80GB NVLink GPUs, as required by Microsoft's pre-sharded weights. This combination offers the highest available throughput for a production grade deployment.

DeepSpeed offers a dramatic speedup to the model over vanilla transformers accelerate as indicated by benchmark testing. The benchmarks below were run on CoreWeave Cloud using BLOOM's inference scripts.

DeepSpeed benchmarks

*** Running benchmark

*** Performance stats:
Throughput per token including tokenize: 62.73 msecs
Start to ready to generate: 129.698 secs
Tokenize and generate 500 (bs=1) tokens: 6.280 secs
Start to finish: 135.978 secs

*** Running benchmark

*** Performance stats:
Throughput per token including tokenize: 7.58 msecs
Start to ready to generate: 122.540 secs
Tokenize and generate 4000 (bs=8) tokens: 6.088 secs
Start to finish: 128.628 secs

HuggingFace transformers with accelerate benchmarks

*** Running benchmark
*** Performance stats:
Throughput per token including tokenize: 318.38 msecs
Start to ready to generate: 338.782 secs
Tokenize and generate 500 (bs=1) tokens: 39.511 secs
Start to finish: 378.292 secs

*** Running benchmark

*** Performance stats:
Throughput per token including tokenize: 57.81 msecs
Start to ready to generate: 353.200 secs
Tokenize and generate 4000 (bs=8) tokens: 56.108 secs
Start to finish: 409.308 secs

What is BLOOM?​

What is DeepSpeed?​

Pre-requisites​

Overview​

Procedure​

Build and push the Docker images​

Deploy the Kubernetes resources​

PVC​

Model job download​

InferenceService​

Autoscaling​

Example​

Hardware and Performance​

DeepSpeed benchmarks​

HuggingFace transformers with accelerate benchmarks​

What is BLOOM?

What is DeepSpeed?

Pre-requisites

Overview

Procedure

Build and push the Docker images

Deploy the Kubernetes resources

PVC

Model job download

InferenceService

Autoscaling

Example

Hardware and Performance

DeepSpeed benchmarks

HuggingFace transformers with accelerate benchmarks