Skip to main content

Real World Impact Benchmark

Compare Tensorizer using a real-world scenario

This benchmark tutorial constructs two identical services, one leveraging Tensorizer and the other leveraging a Hugging Face transformer to serve GPT-J-6B.

Serving traffic using machine learning models for inference requires a trade-off between cost, accuracy and latency. The illustration below showcases how one can optimize these metrics by using Tensorizer.

Additional Ressources

To learn more about Tensorizer before embarking on this tutorial, check out our blog post, "Decrease PyTorch Model Load Times with CoreWeave’s Tensorizer," or our slideshow presentation of the same title.

Prerequisites

This guide presumes that kubectl and python are installed on the host system, and that the user has some basic familiarity with Kubernetes.

Example source code

To follow along with this tutorial, first clone the source code from CoreWeave's kubernetes-cloud repository.

Deploy all resources

After cloning the source code, change directories to tensorizer-isvc. From this directory, provision the Persistent Volume Claim (PVC) defined in pvc.yaml:

$ kubectl apply -f pvc.yaml

Next, download the model to the newly deployed PVC by deploying the model download job located at model-download/model-download-job.yaml.

$ kubectl apply -f model-download/model-download-job.yaml

Now, run the Hugging Face InferenceService by deploying its manifest at tensorizer_hf_isvc/kserve/hf-isvc.yaml. In this example, kserve is used as the server.

$ kubectl apply -f tensorizer_hf_isvc/kserve/hf-isvc.yaml

Next, run the Tensorizer InferenceService by deploying the manifest at tensorizer_hf_isvc/kserve/tensorizer-isvc.yaml. In this example, kserve is once again used as the server.

$ kubectl apply -f tensorizer_hf_isvc/kserve/tensorizer-isvc.yaml

Acquire the Inference Service's URL

View the InferenceService deployment's information to acquire its URL under the URL field using kubectl get:

$ kubectl get ksvc
note

http:// may be required instead of https:// when connecting to the given URL.

Test the InferenceService

The KServe services use KServe's V1 protocol. The basic POST command below may be used to test the Service when served with KServe:

$ curl http://<URL>/v1/models/gptj:predict -X POST -H 'Content-Type: application/json' -d '{"instances": ["Hello!"]}'

The Flask services simply encode queries into the URL path component. The basic curl command below may be used to test the Service when served with Flask:

$ curl http://<URL>/predict/Hello%21

Run the benchmark

Use python to run the benchmark test. The load_test.py test defaults to running async requests with aiohttp. In this case, KServe is used as the server:

$ python benchmark/load_test.py --kserve --url=<ISVC_URL> --requests=<NUMBER_OF_REQUESTS>

As an alternative to using requests, the --sync option may be added to the command line to send requests sequentially.

Delete the InferenceService

To remove the InferenceService, use kubectl delete to target the same manifest file applied earlier. For example:

$ kubectl delete -f tensorizer_hf_isvc/<...>/<...>-isvc.yaml

InferenceServer containers

It is worth noting that each InferenceService manifest (those whose filenames end in -isvc.yaml) runs a container defined in a Dockerfile located in the same directory. For example, tensorizer_hf_isvc/kserve/Dockerfile.

These containers may be changed and rebuilt to customize the behavior of the InferenceService. The build context for each Dockerfile is its parent directory. Build commands are subsequently structured as follows:

$ docker build ./tensorizer_hf_isvc -f ./tensorizer_hf_isvc/kserve/Dockerfile
$ docker build ./tensorizer_hf_isvc -f ./tensorizer_hf_isvc/flask/Dockerfile

Results

RAM Usage

Tensorizer uses only an amount of RAM that is equal to the size of the largest tensor to load the model directly into GPU in "Plaid" mode. As models get larger, the cost-effectiveness of using Tensorizer becomes more prevalent, as the entire model does not need to be loaded into CPU RAM before transferring it to the GPU. Therefore, scaling inference services during bursts of traffic minimizes costs for RAM.

Model load times

CoreWeave’s Tensorizer outperforms SafeTensors and Hugging Face for model load times on OPT-30B with NVIDIA A100s.

Here, "model load times" refers to completely initializing the model from a checkpoint - not swapping in weights from a checkpoint to an already initialized model.

Smaller model metrics

CoreWeave’s Tensorizer outperforms SafeTensors and HuggingFace for model load times on GPT-J-6B with NVIDIA A40s.

Larger model metrics

CoreWeave’s Tensorizer outperforms SafeTensors and HuggingFace for model load times on OPT-30B with NVIDIA A100s.

Handling burst requests

In CoreWeave's tests of Tensorizer's performance, CoreWeave’s Tensorizer presented approximately five times faster average latency over the burst in requests, and required fewer Pod spin-ups compared to Hugging Face.

The burst involved 100 concurrent requests on GPT-J using NVIDIA A40s.

In these results, the data shown reflects how the InferenceService scaled from 1 idle GPU with zero requests to 100 requests sent at once. the InferenceService ran on NVIDIA A40s.

The average latency per request is 0.43s for Tensorizer end-to-end inference on GPT-Jm, as compared to 2.45 seconds. Tensorizer provides cost effectiveness, lower latency and minimal RAM usage as model size gets larger.