Transformers DeepSpeed: BigScience BLOOM
An example of a Hugging Face Transformers implementation of the BigScience Bloom 176B parameter model, optimized by Microsoft's DeepSpeed and pre-sharded model weights.
This example demonstrates how to deploy BLOOM as an InferenceService with a simple HTTP API to perform Text Generation, while leveraging Hugging Face's Transformers Accelerate library including the DeepSpeed plugin for Accelerate.
The deployment will run a DeepSpeed-optimized, pre-sharded version of the model on CoreWeave Cloud NVIDIA A100 80GB GPUs networked by NVLink with autoscaling and Scale To Zero. This example uses the Hugging Face BLOOM Inference Server under the hood, wrapping it as a Inference Service on CoreWeave.
Please contact CoreWeave Support to access NVIDIA A100 80GB GPUs.
What is BLOOM?
BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources.
– Hugging Face BigScience BLOOM
BigScience BLOOM is able to output coherent text in 46 human languages and 13 programming languages. BLOOM can also be instructed to perform text tasks that it hasn't been explicitly trained for by casting them as text generation tasks.
What is DeepSpeed?
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Code samples are available on CoreWeave's kubernetes-cloud
repository.
Pre-requisites
If you'd like to follow along with this demo, clone the manifests provided on GitHub.
The following pieces must be installed and configured prior to running the example.
- kubectl
- docker
- A CoreWeave Account (with Kubectl configured to use your CoreWeave Kubeconfig)
- A Docker Hub Account
Overview
No modifications are needed to any of the files to follow along with this demo. The general procedure for this example is:
- Build and push the Docker images to a container registry, in this case Docker Hub.
- Deploy the Kubernetes resources:
- A PVC in which to store the model.
- A Batch Job to download the model. The model is quite large at roughly 329Gi, and will take around 30 minutes to complete the download.
- Deploy the CoreWeave InferenceService.
- Perform Text Generation using the model by sending HTTP requests to the InferenceService.
Procedure
Build and push the Docker images
First, enter the kubernetes-cloud/online-inference/bloom-deepspeed
directory. From here, build and push the Docker images; we need one for the model downloader, and one to run the model.
The default Docker tag is latest
. We strongly discourage you to use this, as containers are cached on the nodes and in other parts of the CoreWeave stack.
Once you have pushed to a tag, do not push to that tag again. Below, we use simple versioning by using tag 1
for the first iteration of the image.
In the following commands, be sure to replace the example username
with your Docker Hub username
.
From the kubernetes-cloud/online-inference/bloom-deepspeed
directory, run the following commands:
$docker login$export DOCKER_USER=coreweave$docker build -t $DOCKER_USER/huggingface-hub-downloader:1 -f Dockerfile.downloader .$docker push $DOCKER_USER/huggingface-hub-downloader:1$docker build -t $DOCKER_USER/microsoft-bloom-deepspeed-inference-fp16:1 -f Dockerfile .$docker push $DOCKER_USER/microsoft-bloom-deepspeed-inference-fp16:1
This example assumes a public docker registry. To use a private registry, an imagePullSecret needs to be defined.
Deploy the Kubernetes resources
PVC
Before continuing, you may either point the image:
in the following manifests to the images we just built in the previous steps, or you may use the publicly-available image found in the following manifests:
01-download-job.yaml
02-inference-service.yaml
To create a PVC to store the model, from the kubernetes-cloud/online-inference/bloom-deepspeed/
directory, run:
$kubectl apply -f 00-pvc.yaml
Model job download
To deploy the job to download the model to the PVC, from the kubernetes-cloud/online-inference/bloom-deepspeed/
directory, run:
$kubectl apply -f 01-download-job.yaml
The model is quite large at 329Gi, and may take around 30 minutes for the download job to complete.
To check if the model has finished downloading, wait for the job to be in a Completed
state:
$kubectl get po -l job-name=microsoft-bloom-deepspeed-inference-fp16-downloadNAME READY STATUS RESTARTS AGEmicrosoft-bloom-deepspeed-inference-fp16-download-5mdd2 0/1 Pending 0 48s
Or, follow the job logs to monitor progress:
$kubectl logs -l job-name=microsoft-bloom-deepspeed-inference-fp16-download --follow
InferenceService
Once the model is downloaded, the InferenceService
can be deployed by invoking:
$kubectl apply -f 02-inference-service.yaml
Due to the size of the model, loading into GPU memory can take around 5 minutes. To monitor the progress of this, you can wait to see the KServe workers start in the pod logs by invoking:
$kubectl logs -f -l serving.kubeflow.org/inferenceservice=microsoft-bloom-deepspeed-inference-fp16 kfserving-container
Alternatively, you can wait for the InferenceService
to show that READY
is True
, and that it has a URL:
$kubectl get inferenceservicesNAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGEmicrosoft-bloom-deepspeed-inference-fp16 http://microsoft-bloom-deepspeed-inference-fp16.tenant-demo.knative.ord1.coreweave.cloud True 100 microsoft-bloom-deepspeeda7e1fc0ba9c8977d6db7956f04d85acf-00001 16m
Using the provided URL, you can make an HTTP request via your preferred means.
Here is a simple cURL example:
$curl http://microsoft-bloom-deepspeed-inference-fp16.tenant-demo.coreweave.com/generate/ -H 'Content-Type: application/json' -d '{"text": ["Deepspeed is"], "repetition_penalty": 10.0}'{"method":"generate","num_generated_tokens":[75],"query_id":1,"text":["Deepspeed is a leading provider of high-performance, low-latency data center connectivity. We offer the fastest and most reliable Internet access available in North America.\nWe are committed to providing our customers with unmatched performance at an affordable price point through innovative technology solutions that deliver exceptional value for their business needs\nOur network consists entirely on dark fiber routes between major metropolitan areas across Canada & USA"],"total_time_taken":"4.86 secs"}
For a complete list of available request parameters, check the Hugging Face GitHub.
Parameters may be modified with each request by supplying the parameters found at the above link as keys, along with the desired value in your request data:
$curl http://microsoft-bloom-deepspeed-inference-fp16.tenant-demo.knative.ord1.coreweave.cloud/generate/ -H 'Content-Type: application/json' -d '{"text": ["Deepspeed is"], "repetition_penalty": 10.0, "top_k": 10, "do_sample": true}'
Autoscaling
Scaling is controlled in the InferenceService
configuration. This example is set to always run one replica, regardless of number of requests.
Increasing the number of maxReplicas
will allow the CoreWeave infrastructure to automatically scale up replicas when there are multiple outstanding requests to your endpoints. Replicas will automatically be scaled down as demand decreases.
Example
spec:predictor:minReplicas: 1maxReplicas: 5
By setting minReplicas
to 0
, Scale To Zero can be enabled, which will completely scale down the InferenceService
when there have been no requests for a period of time.
When a service is scaled to zero, no cost is incurred. Please note that due to the size of the BLOOM model, Scale to Zero will lead to very long request completion times if the model has to be scaled from zero. This can take around 5 minutes.
Hardware and Performance
This example is set to use eight NVIDIA A100 80GB NVLink GPUs, as required by Microsoft's pre-sharded weights. This combination offers the highest available throughput for a production grade deployment.
DeepSpeed offers a dramatic speedup to the model over vanilla transformers
accelerate as indicated by benchmark testing. The benchmarks below were run on CoreWeave Cloud using BLOOM's inference scripts.
DeepSpeed benchmarks
*** Running benchmark*** Performance stats:Throughput per token including tokenize: 62.73 msecsStart to ready to generate: 129.698 secsTokenize and generate 500 (bs=1) tokens: 6.280 secsStart to finish: 135.978 secs*** Running benchmark*** Performance stats:Throughput per token including tokenize: 7.58 msecsStart to ready to generate: 122.540 secsTokenize and generate 4000 (bs=8) tokens: 6.088 secsStart to finish: 128.628 secs
HuggingFace transformers with accelerate benchmarks
*** Running benchmark*** Performance stats:Throughput per token including tokenize: 318.38 msecsStart to ready to generate: 338.782 secsTokenize and generate 500 (bs=1) tokens: 39.511 secsStart to finish: 378.292 secs*** Running benchmark*** Performance stats:Throughput per token including tokenize: 57.81 msecsStart to ready to generate: 353.200 secsTokenize and generate 4000 (bs=8) tokens: 56.108 secsStart to finish: 409.308 secs