Transformers Accelerate: BigScience BLOOM
Deploy BigScience BLOOM as an Inference Service using the Accelerate Library
BigScience BLOOM is an autoregressive Large Language Model (LLM), trained to construct new text from a prompt based on vast amounts of text data using industrial-scale computational resources.
It is capable of outputting coherent, functional, and stylistically consistent text in 46 human languages as well as 13 programming languages. BLOOM can also be instructed to perform text-based tasks for which it hasn't been explicitly trained, by casting the tasks as text generation tasks.
The following tutorial demonstrates how to deploy BigScience BLOOM as an Inference Service on CoreWeave Cloud, complete a simple HTTP API to perform Text Generation while leveraging Hugging Face's Transformers Accelerate library. The deployment will be run on NVIDIA A100 GPUs with autoscaling, with Scale-to-Zero enabled.
Prerequisites
This tutorial assumes that kubectl is already installed on your system.
It also assumes you already have an active CoreWeave Cloud account, with account credentials configured and a working Kubeconfig file installed on your system.
Example source code
To follow along with this tutorial, first clone the BLOOM example code from the CoreWeave kubernetes-cloud
repository.
The online-inference/bloom-176b
directory contains all of the manifests used to deploy the resources, as well as everything needed to build your own Docker image, if desired.
Advanced options
In this tutorial, public images of the BLOOM container are used. However, if you would prefer to build and push your own image, see "Build your own image" below. For those instructions, a public registry on Docker Hub is used.
Container images may alternatively be pushed to CoreWeave Object Storage.
To use a private registry, an imagePullSecret
must be defined. This is outside the scope of this tutorial.
Hardware and performance
This example is set to use five NVIDIA A100 PCIe GPUs.
Please be aware that current generation performance with the current codebase is suboptimal. We are actively working to integrate optimizations into the BLOOM example. This tutorial is primarily for demonstration purposes - optimized performance should not currently be expected.
The highest performing GPU combination for production grade deployment of BLOOM is 8x A100 NVLINK 80GB GPUs. To access these, please contact support.
Procedure
Once the tutorial source code is cloned to your local system, change directories to kubernetes-cloud/online-inference/bloom-176b
.
cd kubernetes-cloud/online-inference/bloom-176b
This directory contains the following files:
Filename | Description |
---|---|
model | A directory containing additional files |
00-bloom-176b-pvc.yaml | The Kubernetes manifest used to create the PVC, in which the model will be stored |
01-bloom-176b-download-job.yaml | The Kubernetes manifest used to deploy the download job, which will download the model into the created PVC |
02-bloom-176b-inferenceservice.yaml | The Inference Service, which will run the experiment |
The model
directory contains the following files, used for creating the BLOOM container image:
Filename | Description |
---|---|
Dockerfile | The Dockerfile used to create the BLOOM container image |
bloom.py | Utilized by the download-job manifest to download the model into the PVC |
requirements.txt | Requirements for bloom.py |
scripts/download-model | The download-model script in the scripts directory is used to automate downloading the Hugging Face model |
Deploy the Kubernetes resources
The manifests in this tutorial's source code are configured by default to point to publicly available BLOOM images. If you are building and pushing your own image instead, be sure to adjust the value of image
in the download-job
and inferenceservice
manifests so that image
points to your image source prior to deploying the Kubernetes resources.
Next, deploy each of the Kubernetes resources. All resource manifests are contained within the online-inference/bloom-176b
directory.
Deploy the PVC
In order to store the model, a PVC is needed. To create the PVC, apply the 00-bloom-176b-pvc.yaml
manifest.
kubectl apply -f 00-bloom-176b-pvc.yaml
Deploy the model download job
The model is quite large at 329Gi
, so it may take around 30 minutes for the download job to complete.
Next, the model will be downloaded to the PVC. To deploy the job to download the model, apply the 01-bloom-176b-download-job.yaml
manifest.
kubectl apply -f 01-bloom-176b-download-job.yaml
To check the model's download status, follow the job logs to monitor progress using kubectl logs
:
kubectl logs -l job-name=bloom-176b-download --follow
Or, wait for the job to be in a Completed
state:
kubectl get pods
NAME READY STATUS RESTARTS AGE
bloom-176b-download-hkws6 0/1 Completed 0 1h
Deploy the Inference Service
Once the model has finished downloading, apply the Inference Service by applying the final manifest, 02-bloom-176b-inferenceservice.yaml
.
kubectl apply -f 02-bloom-176b-inferenceservice.yaml
Due to the size of the model, loading into GPU memory can take around 5 minutes.
To monitor the loading progress, either wait to see the KServe workers start in the Pod logs by using kubectl logs
:
kubectl logs -f -l serving.kubeflow.org/inferenceservice=bloom-176b kfserving-container
Or, wait for the InferenceService
to show that READY
is True
, and that it has a URL:
kubectl get inferenceservices
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
bloom-176b http://bloom-176b.tenant-sta-tweldon-workbench.knative.ord1.coreweave.cloud True 100 bloom-176b-predictor-default-00001 9m2s
Submit a request
The URL shown when describing the Inference Service is the endpoint to send HTTP requests using your preferred means.
For example, here is a request submitted using cURL:
curl http://bloom-176b.tenant-sta-tweldon-workbench.knative.ord1.coreweave.cloud/v1/models/bigscience-bloom:predict -d '{"instances": ["That was fun"]}' | jq
{
"predictions": [
[
{
"generated_text": "That was fun.\n- Yeah, it was good to see you guys again.\nYeah, you too.\nYou know what?\nI think I'm gonna go home and get some sleep.\nI'm beat.\n-"
}
]
]
}
The following parameters are supported:
min_length
max_length
temperature
top_k
top_p
repetition_penalty
You can modify the model parameters by adjusting the curl
command. For example:
curl http://bloom-176b.tenant-sta-tweldon-workbench.knative.ord1.coreweave.cloud/v1/models/bigscience-bloom:predict -d '{"instances": ["This will generate some text"], "parameters":{"max_length": 20, "min_length": 20}}'
Autoscaling options (advanced)
Scaling is controlled in the InferenceService
manifest. For this tutorial, the manifest is set to always run one replica, regardless of the number of requests.
Increasing the number of maxReplicas
will allow the CoreWeave infrastructure to automatically create additional replicas in the event that there are multiple outstanding requests to your endpoints.
For example:
spec:
predictor:
minReplicas: 1
maxReplicas: 4
Replicas will then automatically be scaled down as demand decreases.
Scale-to-Zero
By setting minReplicas
to 0
, Scale-to-Zero is enabled, which will completely scale down the InferenceService
if there have been no requests for a period of time.
When a service is scaled to zero, no cost is incurred.
Please note that due to the size of the BLOOM model, enabling Scale-to-Zero may lead to very long request completion times if the model has to be scaled up from zero replicas. This can take up to around 5 minutes.
Build your own image (advanced)
Log in to Docker Hub (if applicable)
If you are using Docker Hub, log in to Docker Hub now.
docker login
Set your Docker Hub username as the environment variable $DOCKER_USER
.
export DOCKER_USER=coreweave
Build the Docker image
CoreWeave strongly discourages using the default Docker tag latest
. Containers are cached on the nodes and in other parts of the CoreWeave stack, so using this tag can cause issues. Instead, use a simple versioning system.
Similarly, once you have pushed to a tag, do not push to the same tag again.
Using the provided Dockerfile, build the HuggingFace image. For this tutorial, a very simple versioning scheme of single numbers is used. The tag 1
is used for the first iteration of the image.
docker build -t $DOCKER_USER/bloom-176b:1 -f Dockerfile .
Next, push the built and tagged image to the Docker Hub container registry.
docker push $DOCKER_USER/bloom-176b:1
Once the image is pushed to the registry, continue the rest of this tutorial, starting at Deploy the Kubernetes resources.