Skip to main content

Tensorizer

Get extremely fast PyTorch model loads from HTTP/HTTPS and S3 endpoints with CoreWeave Tensorizer

CoreWeave Tensorizer is a PyTorch module, model, and tensor serializer and deserializer, which makes it possible to load models at high speeds from HTTP, HTTPS, and S3 endpoints. Tensorizer enables faster load times and lower resource costs, whether loading models from a network or local disk volumes.

What is serialization?

In machine learning, serialization is the process of converting a data object (e.g. a model) into a format that can be stored or transmitted with minimal processing power. This process is typically quite slow, particularly when loading a large model into GPU memory, as it happens infrequently and is not optimized for write speeds.

Deserialization is the same process in the opposite direction: that is, recreating a serialized data object when needed. Deserialization occurs more frequently than serialization, but is still traditionally a slow process.

How Tensorizer works

Tensorizer significantly reduces latency and resource usage due to the "zero-copy" model loading method.

Instead of loading the entire model into RAM before transferring it to the GPU, Tensorizer pulls the model over "chunk by chunk." Tensorizer uses a buffer of the largest tensor size, plus some additional metadata to fetch the locations of tensors.

This "tensor streaming" process is made possible by Tensorizer's bespoke serialization format, which puts all necessary metadata at the beginning of a single binary file. This file can then be loaded quickly and efficiently whether it comes from local storage, an HTTP/HTTPS endpoint, or an S3 bucket.

Note

CoreWeave's Tensorizer serializes a model into a single binary file. This requires the metadata up front and supports bfloat16. This is a safe process, since no arbitrary code is saved.

Larger model metrics

When tested on a larger model size using a higher-performing GPU, the impact of Tensorizer on model load time becomes more pronounced. The chart below shows that CoreWeave's Tensorizer outperformed both SafeTensors and Hugging Face on average model load times on OPT-30B with NVIDIA A100 Tensor Core GPUs.

  • CoreWeave's Tensorizer: 23.23 sec. (median); 22.8 sec. (average)
  • SafeTensors: 36.75 sec. (median); 39.3 sec. (average)
  • HuggingFace: 35.18 sec. (median); 32.1 sec. (average)

Benchmarks

Currently, two benchmarks are available for Tensorizer, which can be reproduced to see first-hand how Tensorizer works.

Real world impact benchmark

The real world impact benchmark sets up two Inference Services, with one using Tensorizer to load the model, and the other using a Hugging Face transformer to serve GPT-J-6B. With both Services sitting idle with 1 GPU, they are then each hit with 100 concurrent requests. Metrics on average response time and autoscaling capabilities were extracted for comparison.

This benchmark may be replicated by following the real world impact benchmark tutorial.

Comparison benchmark

The comparison benchmark measures raw model load time from a PVC across vanilla Hugging Face, Safetensors, and Tensorizer, for comparison between all three. This comparison is much like a lab test for performance measuring.

The code for this benchmark may be viewed on CoreWeave's GitHub.

Features

📈 Reduction in resource usage

With faster model load times for LLMs and reduced GPU memory utilization, Tensorizer helps accelerate model instance spin up times, while reducing resource costs to serve inference. Transfers to the GPU are nearly instantaneous when using Tensorizer, which reduces the amount of CPU and RAM necessary for the instance, resulting in a reduction in the amount of resources used and lower incurred costs.

Tensorizer's average latency per request was >5x faster compared to Hugging Face when scaling from zero. Due to its "lazy loading" capability, Tensorizer also required fewer pod spin ups and significantly less RAM than competitors.

👍 S3-compatible

Tensorizer is S3/HTTP-compatible, enabling model streams directly from S3 into the container without the need to download the model to the container's local filesystem. Tensorizer serializes models and their associated tensors into a single file, which can then be loaded quickly and efficiently from an HTTP/HTTPS or S3 endpoint. Serialized models may also be stored in CoreWeave's S3-compatible Object Storage, enabling model streams directly from S3 into the container without having to download the model to the container's local filesystem.

Extremely fast model loading speeds

Tensorizer decreases model load times by reducing the container image size, as it avoids embedding the model in the container image. The impact of this scales with the size of the model, becoming particularly important for models that are already large in size, such as EleutherAI/gpt-neox-20B, which weighs in at ~40GB.

✏️ Flexible iteration

By decoupling the model from the container image, model updates do not require rebuilding of container images, which allows for quick iterations on the model itself. This also allows for the deployment of new model versions without waiting for the container image to build, or for the container image cache to be populated.

🔒 Tensor weight encryption

Tensorizer encrypts tensor weights using a modified XSalsa20-Poly1305 symmetric authenticated encryption scheme. This process occurs in the background while the model loads, and is parallelized over multiple CPU cores, resulting in lower latency compared to other encryption methods. The encryption algorithm is applied to each data "chunk" independently, allowing authentication of the integrity of data within each "chunk".

Warning

Tensorizer only encrypts tensor weights, not the model in its entirety. Tensorizer does not provide encryption or authentication for any other information about the model and its architecture, including metadata about the tensor, such as the name, dtype, shape, size, or non-keyed hashes. Additionally, the tensor weight encryption does not provide message authentication for metadata, and does not protect against reordering or truncation of chunks.

For more information, including reference and tutorials, see the Tensorizer open-source documentation.

🏡 Local filesystem support

Tensorizer also supports loading and serializing models from a local filesystem at high speeds, following the same principles used to provide fast load times for HTTP, HTTPS, and S3 endpoints.

Pre-Tensorized models on CoreWeave

The CoreWeave Cloud provides multiple pre-Tensorized models, which can be used with the TensorDeserializer class. Object Storage support defaults to the accel-object.ord1.coreweave.com endpoint, uses the tensorized bucket.

Access all available pre-Tensorized models in the source code GitHub repository's README file.

Additional Resources

Run the tutorial using custom serialization

Click to expand

Tensorizer models can be loaded from and serialized to a local filesystem or network storage. When using a Tensorizer model for inference, we recommend uploading it to an S3 endpoint, as described below.

Generate an S3 key

First, generate an S3 key from the Object Storage section of the CoreWeave Cloud App.

Create an Object Storage bucket

Next, create a new Object Storage bucket using the s3cmd tool:

Example
$
s3cmd mb s3://YOURBUCKET

Install the S3 secrets and endpoint hostname

To install the S3 access and secret keys created earlier, first base64-encode each of the key values.

Example
$
echo -n "<your key>" | base64"

For example:

Example
$
echo -n "<YOUR ACCESS KEY>" | base64
QUNDRVNTX0tFWV9IRVJF
$
echo -n "<YOUR SECRET KEY>" | base64
U0VDUkVUX0tFWV9IRVJF

Then, in the 00-optional-s3-secret.yaml file, replace the access and Secret key placeholders in the .data.access_key and .data.secret_key fields with your base64-encoded keys, respectively. For example:

title="00-optional-s3-secret.yaml"
apiVersion: v1
data:
access_key: QUNDRVNTX0tFWV9IRVJF
kind: Secret
metadata:
name: s3-access-key
type: Opaque
---
apiVersion: v1
data:
secret_key: U0VDUkVUX0tFWV9IRVJF
kind: Secret
metadata:
name: s3-secret-key
type: Opaque

The S3 endpoint URL of the new S3 bucket must also be included in the 00-optional-s3-secret.yaml file.

First, base64-encode the endpoint URL. The endpoint URL should correspond to the region in which your new bucket is hosted. In this example, the ORD1 region is used, which means the hostname of the Object Storage endpoint URL is object.ord1.coreweave.com.

Example
$
echo -n "object.ord1.coreweave.com" | base64
b2JqZWN0Lm9yZDEuY29yZXdlYXZlLmNvbQ==

Replace the host URL placeholder (.data.url) with the base64-encoded S3 endpoint URL of the new bucket. For example:

title="00-optional-s3-secret.yaml"
apiVersion: v1
data:
url: b2JqZWN0Lm9yZDEuY29yZXdlYXZlLmNvbQ==
kind: Secret
metadata:
name: s3-host-url
type: Opaque

Once these values are replaced, create the Secrets object by applying the 00-optional-s3-secret.yaml file using kubectl.

Example
$
kubectl apply -f 00-optional-s3-secret.yaml

Serialize the model

01-optional-s3-serialize-job.yaml runs the serialization Job for the model when deployed.

Before deploying the Job, adjust the following command arguments in the 01-optional-s3-serialize-job.yaml file.

  • Replace the value of the command option --dest-bucket with the name of the bucket to which the model will be serialized.
  • Replace the value of the command option --hf-model-id with the ID of the model you would like to serialize. (By default, the model ID is set torunwayml/stable-diffusion-v1-5. Additional model IDs are available on Hugging Face.)

Once these values are added, deploy the Job using kubectl to run it:

Example
$
kubectl apply -f 01-optional-s3-serialize-job.yaml

Run the Inference Service

To run the Inference Service, replace the model's URI in the 02-inference-service.yaml file with the S3 URI pointing to your custom model.

title="02-inference-service.yaml"
containers:
- name: kfserving-container
image: ghcr.io/coreweave/ml-containers/sd-inference:amercurio-sd-overhaul-7d29c61
command:
- "python3"
- "/app/service.py"
- "--model-uri=s3://tensorized/runwayml/stable-diffusion-v1-5"
- "--precision=float16"
- "--port=80"

If you are using a custom-built Inference Service image using the Dockerfile provided in the tutorial repository, additionally replace the URL in .containers.image to point to the custom image.

title="02-inference-service.yaml"
containers:
- name: kfserving-container
image: ghcr.io/coreweave/ml-containers/sd-inference:amercurio-sd-overhaul-7d29c61
command:
- "python3"
- "/app/service.py"
- "--model-uri=s3://tensorized/runwayml/stable-diffusion-v1-5"
- "--precision=float16"
- "--port=80"

Finally, start the Inference Service using kubectl.

Example
$
kubectl apply -f 02-inference-service.yaml.

Learn more