Tensorizer
Get extremely fast PyTorch model loads from HTTP/HTTPS and S3 endpoints with CoreWeave Tensorizer
CoreWeave Tensorizer is a PyTorch module, model, and tensor serializer and deserializer, which makes it possible to load models at high speeds from HTTP, HTTPS, and S3 endpoints. Tensorizer enables faster load times and lower resource costs, whether loading models from a network or local disk volumes.
What is serialization?
In machine learning, serialization is the process of converting a data object (e.g. a model) into a format that can be stored or transmitted with minimal processing power. This process is typically quite slow, particularly when loading a large model into GPU memory, as it happens infrequently and is not optimized for write speeds.
Deserialization is the same process in the opposite direction: that is, recreating a serialized data object when needed. Deserialization occurs more frequently than serialization, but is still traditionally a slow process.
How Tensorizer works
Tensorizer significantly reduces latency and resource usage due to the "zero-copy" model loading method.
Instead of loading the entire model into RAM before transferring it to the GPU, Tensorizer pulls the model over "chunk by chunk." Tensorizer uses a buffer of the largest tensor size, plus some additional metadata to fetch the locations of tensors.
This "tensor streaming" process is made possible by Tensorizer's bespoke serialization format, which puts all necessary metadata at the beginning of a single binary file. This file can then be loaded quickly and efficiently whether it comes from local storage, an HTTP/HTTPS endpoint, or an S3 bucket.
CoreWeave's Tensorizer serializes a model into a single binary file. This requires the metadata up front and supports bfloat16. This is a safe process, since no arbitrary code is saved.
Larger model metrics
When tested on a larger model size using a higher-performing GPU, the impact of Tensorizer on model load time becomes more pronounced. The chart below shows that CoreWeave's Tensorizer outperformed both SafeTensors and Hugging Face on average model load times on OPT-30B with NVIDIA A100 Tensor Core GPUs.
- CoreWeave's Tensorizer: 23.23 sec. (median); 22.8 sec. (average)
- SafeTensors: 36.75 sec. (median); 39.3 sec. (average)
- HuggingFace: 35.18 sec. (median); 32.1 sec. (average)
Benchmarks
Currently, two benchmarks are available for Tensorizer, which can be reproduced to see first-hand how Tensorizer works.
Real world impact benchmark
The real world impact benchmark sets up two Inference Services, with one using Tensorizer to load the model, and the other using a Hugging Face transformer to serve GPT-J-6B. With both Services sitting idle with 1
GPU, they are then each hit with 100 concurrent requests. Metrics on average response time and autoscaling capabilities were extracted for comparison.
This benchmark may be replicated by following the real world impact benchmark tutorial.
Comparison benchmark
The comparison benchmark measures raw model load time from a PVC across vanilla Hugging Face, Safetensors, and Tensorizer, for comparison between all three. This comparison is much like a lab test for performance measuring.
The code for this benchmark may be viewed on CoreWeave's GitHub.
Features
📈 Reduction in resource usage
With faster model load times for LLMs and reduced GPU memory utilization, Tensorizer helps accelerate model instance spin up times, while reducing resource costs to serve inference. Transfers to the GPU are nearly instantaneous when using Tensorizer, which reduces the amount of CPU and RAM necessary for the instance, resulting in a reduction in the amount of resources used and lower incurred costs.
Tensorizer's average latency per request was >5x faster compared to Hugging Face when scaling from zero. Due to its "lazy loading" capability, Tensorizer also required fewer pod spin ups and significantly less RAM than competitors.
👍 S3-compatible
Tensorizer is S3/HTTP-compatible, enabling model streams directly from S3 into the container without the need to download the model to the container's local filesystem. Tensorizer serializes models and their associated tensors into a single file, which can then be loaded quickly and efficiently from an HTTP/HTTPS or S3 endpoint. Serialized models may also be stored in CoreWeave's S3-compatible Object Storage, enabling model streams directly from S3 into the container without having to download the model to the container's local filesystem.
⚡ Extremely fast model loading speeds
Tensorizer decreases model load times by reducing the container image size, as it avoids embedding the model in the container image. The impact of this scales with the size of the model, becoming particularly important for models that are already large in size, such as EleutherAI/gpt-neox-20B, which weighs in at ~40GB
.
✏️ Flexible iteration
By decoupling the model from the container image, model updates do not require rebuilding of container images, which allows for quick iterations on the model itself. This also allows for the deployment of new model versions without waiting for the container image to build, or for the container image cache to be populated.
🔒 Tensor weight encryption
Tensorizer encrypts tensor weights using a modified XSalsa20-Poly1305
symmetric authenticated encryption scheme. This process occurs in the background while the model loads, and is parallelized over multiple CPU cores, resulting in lower latency compared to other encryption methods. The encryption algorithm is applied to each data "chunk" independently, allowing authentication of the integrity of data within each "chunk".
Tensorizer only encrypts tensor weights, not the model in its entirety. Tensorizer does not provide encryption or authentication for any other information about the model and its architecture, including metadata about the tensor, such as the name, dtype, shape, size, or non-keyed hashes. Additionally, the tensor weight encryption does not provide message authentication for metadata, and does not protect against reordering or truncation of chunks.
For more information, including reference and tutorials, see the Tensorizer open-source documentation.
🏡 Local filesystem support
Tensorizer also supports loading and serializing models from a local filesystem at high speeds, following the same principles used to provide fast load times for HTTP, HTTPS, and S3 endpoints.
Pre-Tensorized models on CoreWeave
The CoreWeave Cloud provides multiple pre-Tensorized models, which can be used with the TensorDeserializer
class. Object Storage support defaults to the accel-object.ord1.coreweave.com
endpoint, uses the tensorized
bucket.
Access all available pre-Tensorized models in the source code GitHub repository's README
file.
Read more about CoreWeave Object Storage endpoints.
Run the tutorial using custom serialization
Click to expand
Tensorizer models can be loaded from and serialized to a local filesystem or network storage. When using a Tensorizer model for inference, we recommend uploading it to an S3 endpoint, as described below.
Generate an S3 key
First, generate an S3 key from the Object Storage section of the CoreWeave Cloud App.
Create an Object Storage bucket
Next, create a new Object Storage bucket using the s3cmd
tool:
$s3cmd mb s3://YOURBUCKET
Install the S3 secrets and endpoint hostname
To install the S3 access and secret keys created earlier, first base64-encode each of the key values.
$echo -n "<your key>" | base64"
For example:
$echo -n "<YOUR ACCESS KEY>" | base64QUNDRVNTX0tFWV9IRVJF$echo -n "<YOUR SECRET KEY>" | base64U0VDUkVUX0tFWV9IRVJF
Then, in the 00-optional-s3-secret.yaml
file, replace the access and Secret key placeholders in the .data.access_key
and .data.secret_key
fields with your base64-encoded keys, respectively. For example:
apiVersion: v1data:access_key: QUNDRVNTX0tFWV9IRVJFkind: Secretmetadata:name: s3-access-keytype: Opaque---apiVersion: v1data:secret_key: U0VDUkVUX0tFWV9IRVJFkind: Secretmetadata:name: s3-secret-keytype: Opaque
The S3 endpoint URL of the new S3 bucket must also be included in the 00-optional-s3-secret.yaml
file.
First, base64-encode the endpoint URL. The endpoint URL should correspond to the region in which your new bucket is hosted. In this example, the ORD1
region is used, which means the hostname of the Object Storage endpoint URL is object.ord1.coreweave.com
.
$echo -n "object.ord1.coreweave.com" | base64b2JqZWN0Lm9yZDEuY29yZXdlYXZlLmNvbQ==
Replace the host URL placeholder (.data.url
) with the base64-encoded S3 endpoint URL of the new bucket. For example:
apiVersion: v1data:url: b2JqZWN0Lm9yZDEuY29yZXdlYXZlLmNvbQ==kind: Secretmetadata:name: s3-host-urltype: Opaque
Once these values are replaced, create the Secrets object by applying the 00-optional-s3-secret.yaml
file using kubectl
.
$kubectl apply -f 00-optional-s3-secret.yaml
Serialize the model
01-optional-s3-serialize-job.yaml
runs the serialization Job for the model when deployed.
Before deploying the Job, adjust the following command arguments in the 01-optional-s3-serialize-job.yaml
file.
- Replace the value of the command option
--dest-bucket
with the name of the bucket to which the model will be serialized. - Replace the value of the command option
--hf-model-id
with the ID of the model you would like to serialize. (By default, the model ID is set torunwayml/stable-diffusion-v1-5
. Additional model IDs are available on Hugging Face.)
Once these values are added, deploy the Job using kubectl
to run it:
$kubectl apply -f 01-optional-s3-serialize-job.yaml
Run the Inference Service
To run the Inference Service, replace the model's URI in the 02-inference-service.yaml
file with the S3 URI pointing to your custom model.
containers:- name: kfserving-containerimage: ghcr.io/coreweave/ml-containers/sd-inference:amercurio-sd-overhaul-7d29c61command:- "python3"- "/app/service.py"- "--model-uri=s3://tensorized/runwayml/stable-diffusion-v1-5"- "--precision=float16"- "--port=80"
If you are using a custom-built Inference Service image using the Dockerfile provided in the tutorial repository, additionally replace the URL in .containers.image
to point to the custom image.
containers:- name: kfserving-containerimage: ghcr.io/coreweave/ml-containers/sd-inference:amercurio-sd-overhaul-7d29c61command:- "python3"- "/app/service.py"- "--model-uri=s3://tensorized/runwayml/stable-diffusion-v1-5"- "--precision=float16"- "--port=80"
Finally, start the Inference Service using kubectl
.
$kubectl apply -f 02-inference-service.yaml.