Tensorizer

Get extremely fast PyTorch model loads from HTTP/HTTPS and S3 endpoints with CoreWeave Tensorizer

CoreWeave Tensorizer is a PyTorch module, model, and tensor serializer and deserializer, which makes it possible to load models extremely quickly from HTTP/HTTPS and S3 endpoints. It also enables faster network load times, as well as load times from local disk volumes.

What is serialization?

In machine learning, serialization is the process of converting a data object (e.g. a model) into a format that can be stored or transmitted easily.

Serialization is typically a very slow process, not optimized for write speeds because it only happens infrequently. Deserialization, on the other hand, is the same process in the opposite direction: that is, recreating a serialized data object when needed. Deserialization is much more frequent than serialization, but is still traditionally a slow process.

How Tensorizer works

Typically, loading a very large model into GPU memory normally can be extremely slow. With Tensorizer, latency and resource usage is significantly reduced, thanks to Tensorizer's “zero-copy” model loading.

This means that instead of loading the whole model into RAM before transferring it to the GPU, Tensorizer pulls the model over "chunk by chunk." Tensorizer uses a buffer of the largest tensor size, plus some additional metadata to fetch the locations of tensors.

This "tensor streaming" process is made possible by Tensorizer's bespoke serialization format, which puts all necessary metadata at the beginning of a single binary file. This file can then be loaded quickly and efficiently either from local storage, an HTTP/HTTPS endpoint, or from an S3 bucket.

Note

CoreWeave’s Tensorizer serializes a model into a single binary file. This requires the metadata up front and supports bfloat16. This is a safe process, since no arbitrary code is saved.

Larger model metrics

When tested on a larger model size using a higher-performing GPU, the impact of Tensorizer on model load time significantly increased. The chart below shows that CoreWeave’s Tensorizer outperformed both SafeTensors and Hugging Face on average model load times on OPT-30B with NVIDIA A100 Tensor Core GPUs.

  • CoreWeave’s Tensorizer: 23.23 sec. (median); 22.8 sec. (average)

  • SafeTensors: 36.75 sec. (median); 39.3 sec. (average)

  • HuggingFace: 35.18 sec. (median); 32.1 sec. (average)

Benchmarks

Currently, two benchmarks are available for Tensorizer, which can be reproduced to see first-hand how Tensorizer works.

Real world impact benchmark

The real world impact benchmark sets up two inference services, one which uses Tensorizer to load the model, and one which does not. Then, with both services sitting idle with 1 GPU, they are each hit with 100 concurrent requests. Metrics on average response time and autoscaling capabilities were extracted for comparison.

This benchmark may be replicated by following the real world impact benchmark tutorial.

Comparison benchmark

The comparison benchmark measures raw model load time from a PVC across vanilla Hugging Face, Safetensors, and Tensorizer, for comparison between all three. This comparison is much like a lab test for performance measuring.

The code for this benchmark may be viewed on CoreWeave's GitHub.

Features

📈 Reduction in resource usage

With faster model load times for LLMs and reduced GPU memory utilization, Tensorizer helps accelerate model instance spin up times, while reducing overall costs to serve inference - because transfers occur nearly instantly to the GPU when using Tensorizer, the amount of CPU and RAM necessary for the instance is also reduced, resulting in lower incurred costs by a reduction in the amount of resources used. The average latency per request was >5x faster for Tensorizer compared to Hugging Face when scaling from zero, and required fewer pod spin ups and significantly less RAM thanks to its "lazy loading" capability.

👍 S3-compatible

Tensorizer is S3/HTTP-compatible, enabling model streams directly from S3 into the container without having to download the model to the container's local filesystem. Tensorizer serializes models and their associated tensors into a single file, which can then be loaded quickly and efficiently from an HTTP/HTTPS or S3 endpoint. Serialized models may also be stored in CoreWeave's S3-compatible Object Storage, enabling model streams directly from S3 into the container without having to download the model to the container's local filesystem.

Extremely fast model loading speeds

By avoiding embedding the model in the container image, the container image size is greatly reduced; thus, so is the time it takes to load the model. This is especially important for models that are already large, such as EleutherAI/gpt-neox-20B, which weighs in at ~40GB.

✏️ Flexible iteration

By decoupling the model from the container image, model updates do not require having to rebuild container images, which allows for quick iterations on the model itself, as well as the ability to deploy new versions without having to wait for the container image to build, or for the container image cache to be populated.

🏡 Local filesystem support

Tensorizer also has support for loading models from a local filesystem, so it can be used to serialize models locally and load them locally. This option provides extremely fast load times, as the same principles that make it fast for HTTP/HTTPS and S3 endpoints also apply to local filesystems.

Tip

The same is true with regards to HTTP/HTTPS endpoints, as S3 is really just another HTTP/HTTPS endpoint.

Pre-Tensorized models on CoreWeave

Several pre-Tensorized models are available on the CoreWeave Cloud for free, and can be used with the TensorDeserializer class. Object Storage support defaults to the accel-object.ord1.coreweave.com endpoint, and the bucket to use tensorized.

Additional Resources

Read more about CoreWeave Object Storage endpoints.

See all available pre-Tensorized models in the source code GitHub repository's README file:

Learn more

Last updated