PyTorch Hugging Face Transformers DeepSpeed - BigScience BLOOM

An example of a Hugging Face Transformers implementation of the BigScience Bloom 176B parameter model, optimized by Microsoft's DeepSpeed and pre-sharded model weights.
This example demonstrates how to deploy BLOOM as an InferenceService with a simple HTTP API to perform Text Generation. The deployment will run a DeepSpeed optimized, pre-sharded version of the model on CoreWeave Cloud NVIDIA A100 80GB GPUs networked by NVLink with autoscaling and Scale To Zero. This example uses the Hugging Face BLOOL Inference Server under the hood, wrapping it as a Inference Service on CoreWeave.
Please contact CoreWeave Support to access NVIDIA A100 80GB GPUs.
To follow along, please clone the manifests from GitHub

What is BLOOM?

BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources.
BigScience BLOOM is able to output coherent text in 46 human languages and 13 programming languages. BLOOM can also be instructed to perform text tasks that it hasn't been explicitly trained for by casting them as text generation tasks.

What is DeepSpeed?

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

View the example code on GitHub:


If you'd like to follow along with this demo, clone the manifests provided on GitHub.
The following pieces must be installed and configured prior to running the example.


No modifications are needed to any of the files to follow along with this demo. The general procedure for this example is:
  1. 1.
    Build and push the Docker images to a container registry, in this case Docker Hub.
  2. 2.
    1. 1.
      A PVC in which to store the model.
    2. 2.
      A Batch Job to download the model. The model is quite large at roughly 329Gi, and will take around 30 minutes to complete the download.
  3. 3.
    Perform Text Generation using the model by sending HTTP requests to the InferenceService.


Build and push the Docker images

First, enter the kubernetes-cloud/online-inference/bloom-deepspeed directory. From here, build and push the Docker images; we need one for the model downloader, and one to run the model.
Important The default Docker tag is latest. We strongly discourage you to use this, as containers are cached on the nodes and in other parts of the CoreWeave stack.
Once you have pushed to a tag, do not push to that tag again. Below, we use simple versioning by using tag 1 for the first iteration of the image.
Note ****In the following commands, be sure to replace the example username with your Docker Hub username.
From the kubernetes-cloud/online-inference/bloom-deepspeed directory, run the following commands:
$ docker login
$ export DOCKER_USER=coreweave
$ docker build -t $DOCKER_USER/huggingface-hub-downloader:1 -f Dockerfile.downloader .
$ docker push $DOCKER_USER/huggingface-hub-downloader:1
$ docker build -t $DOCKER_USER/microsoft-bloom-deepspeed-inference-fp16:1 -f Dockerfile .
$ docker push $DOCKER_USER/microsoft-bloom-deepspeed-inference-fp16:1
Note This example assumes a public docker registry. To use a private registry, an imagePullSecret needs to be defined.

Deploy the Kubernetes resources


Note Before continuing, you may either point the image: in the following manifests to the images we just built in the previous steps, or you may use the publicly-available image found in the following manifests:
  • 01-download-job.yaml
  • 02-inference-service.yaml
To create a PVC to store the model, from the kubernetes-cloud/online-inference/bloom-deepspeed/ directory, run:
$ kubectl apply -f 00-pvc.yaml

Model job download

To deploy the job to download the model to the PVC, from the kubernetes-cloud/online-inference/bloom-deepspeed/ directory, run:
$ kubectl apply -f 01-download-job.yaml
Note ****The model is quite large at 329Gi, and may take around 30 minutes for the download job to complete.
To check if the model has finished downloading, wait for the job to be in a Completed state:
$ kubectl get po -l job-name=microsoft-bloom-deepspeed-inference-fp16-download
microsoft-bloom-deepspeed-inference-fp16-download-5mdd2 0/1 Pending 0 48s
Or, follow the job logs to monitor progress:
kubectl logs -l job-name=microsoft-bloom-deepspeed-inference-fp16-download --follow


Once the model is downloaded, the InferenceService can be deployed by invoking:
kubectl apply -f 02-inference-service.yaml
Due to the size of the model, loading into GPU memory can take around 5 minutes. To monitor the progress of this, you can wait to see the KServe workers start in the pod logs by invoking:
kubectl logs -f -l kfserving-container
Alternatively, you can wait for the InferenceService to show that READY is True, and that it has a URL:
$ kubectl get inferenceservices
microsoft-bloom-deepspeed-inference-fp16 True 100 microsoft-bloom-deepspeeda7e1fc0ba9c8977d6db7956f04d85acf-00001 16m
Using the provided URL, you can make an HTTP request via your preferred means.
Here is a simple cURL example:
$ curl -H 'Content-Type: application/json' -d '{"text": ["Deepspeed is"], "repetition_penalty": 10.0}'
{"method":"generate","num_generated_tokens":[75],"query_id":1,"text":["Deepspeed is a leading provider of high-performance, low-latency data center connectivity. We offer the fastest and most reliable Internet access available in North America.\nWe are committed to providing our customers with unmatched performance at an affordable price point through innovative technology solutions that deliver exceptional value for their business needs\nOur network consists entirely on dark fiber routes between major metropolitan areas across Canada & USA"],"total_time_taken":"4.86 secs"}
A complete list of available request parameters can be found here:
Parameters may be modified with each request by supplying the parameters found at the above link as keys, along with the desired value in your request data:
curl -H 'Content-Type: application/json' -d '{"text": ["Deepspeed is"], "repetition_penalty": 10.0, "top_k": 10, "do_sample": true}'


Scaling is controlled in the InferenceService configuration. This example is set to always run one replica, regardless of number of requests.
Increasing the number of maxReplicas will allow the CoreWeave infrastructure to automatically scale up replicas when there are multiple outstanding requests to your endpoints. Replicas will automatically be scaled down as demand decreases.


minReplicas: 1
maxReplicas: 5
By setting minReplicas to 0, Scale To Zero can be enabled, which will completely scale down the InferenceService when there have been no requests for a period of time.
When a service is scaled to zero, no cost is incurred. Please note that due to the size of the BLOOM model, Scale to Zero will lead to very long request completion times if the model has to be scaled from zero. This can take around 5 minutes.

Hardware and Performance

This example is set to use eight NVIDIA A100 80GB NVLink GPUs, as required by Microsoft's pre-sharded weights. This combination offers the highest available throughput for a production grade deployment.
DeepSpeed offers a dramatic speedup to the model over vanilla transformers accelerate as indicated by benchmark testing. The benchmarks below were run on CoreWeave Cloud using BLOOM's inference scripts.

DeepSpeed benchmarks

*** Running benchmark
*** Performance stats:
Throughput per token including tokenize: 62.73 msecs
Start to ready to generate: 129.698 secs
Tokenize and generate 500 (bs=1) tokens: 6.280 secs
Start to finish: 135.978 secs
*** Running benchmark
*** Performance stats:
Throughput per token including tokenize: 7.58 msecs
Start to ready to generate: 122.540 secs
Tokenize and generate 4000 (bs=8) tokens: 6.088 secs
Start to finish: 128.628 secs

HuggingFace transformers with accelerate benchmarks

*** Running benchmark
*** Performance stats:
Throughput per token including tokenize: 318.38 msecs
Start to ready to generate: 338.782 secs
Tokenize and generate 500 (bs=1) tokens: 39.511 secs
Start to finish: 378.292 secs
*** Running benchmark
*** Performance stats:
Throughput per token including tokenize: 57.81 msecs
Start to ready to generate: 353.200 secs
Tokenize and generate 4000 (bs=8) tokens: 56.108 secs
Start to finish: 409.308 secs