Get extremely fast PyTorch model loads from HTTP/HTTPS and S3 endpoints with CoreWeave Tensorizer
CoreWeave Tensorizer is a PyTorch module, model, and tensor serializer and deserializer, which makes it possible to load models extremely quickly from HTTP/HTTPS and S3 endpoints. It also enables faster network load times, as well as load times from local disk volumes.
In machine learning, serialization is the process of converting a data object (e.g. a model) into a format that can be stored or transmitted easily.
Serialization is typically a very slow process, not optimized for write speeds because it only happens infrequently. Deserialization, on the other hand, is the same process in the opposite direction: that is, recreating a serialized data object when needed. Deserialization is much more frequent than serialization, but is still traditionally a slow process.
Typically, loading a very large model into GPU memory normally can be extremely slow. With Tensorizer, latency and resource usage is significantly reduced, thanks to Tensorizer's “zero-copy” model loading.
How Tensorizer fits into the flow of the request in the inference enginer on CoreWeave Cloud
This means that instead of loading the whole model into RAM before transferring it to the GPU, Tensorizer pulls the model over "chunk by chunk." Tensorizer uses a buffer of the largest tensor size, plus some additional metadata to fetch the locations of tensors.
This "tensor streaming" process is made possible by Tensorizer's bespoke serialization format, which puts all necessary metadata at the beginning of a single binary file. This file can then be loaded quickly and efficiently either from local storage, an HTTP/HTTPS endpoint, or from an S3 bucket.
When tested on a larger model size using a higher-performing GPU, the impact of Tensorizer on model load time significantly increased. The chart below shows that CoreWeave’s Tensorizer outperformed both SafeTensors and Hugging Face on average model load times on OPT-30B with NVIDIA A100 Tensor Core GPUs.
- CoreWeave’s Tensorizer: 23.23 sec. (median); 22.8 sec. (average)
- SafeTensors: 36.75 sec. (median); 39.3 sec. (average)
- HuggingFace: 35.18 sec. (median); 32.1 sec. (average)
Currently, two benchmarks are available for Tensorizer, which can be reproduced to see first-hand how Tensorizer works.
The real world impact benchmark sets up two inference services, one which uses Tensorizer to load the model, and one which does not. Then, with both services sitting idle with
1GPU, they are each hit with
100concurrent requests. Metrics on average response time and autoscaling capabilities were extracted for comparison.
The comparison benchmark measures raw model load time from a PVC across vanilla Hugging Face, Safetensors, and Tensorizer, for comparison between all three. This comparison is much like a lab test for performance measuring.
The code for this benchmark may be viewed on CoreWeave's GitHub.
With faster model load times for LLMs and reduced GPU memory utilization, Tensorizer helps accelerate model instance spin up times, while reducing overall costs to serve inference - because transfers occur nearly instantly to the GPU when using Tensorizer, the amount of CPU and RAM necessary for the instance is also reduced, resulting in lower incurred costs by a reduction in the amount of resources used. The average latency per request was >5x faster for Tensorizer compared to Hugging Face when scaling from zero, and required fewer pod spin ups and significantly less RAM thanks to its "lazy loading" capability.
Tensorizer is S3/HTTP-compatible, enabling model streams directly from S3 into the container without having to download the model to the container's local filesystem. Tensorizer serializes models and their associated tensors into a single file, which can then be loaded quickly and efficiently from an HTTP/HTTPS or S3 endpoint. Serialized models may also be stored in CoreWeave's S3-compatible Object Storage, enabling model streams directly from S3 into the container without having to download the model to the container's local filesystem.
By avoiding embedding the model in the container image, the container image size is greatly reduced; thus, so is the time it takes to load the model. This is especially important for models that are already large, such as EleutherAI/gpt-neox-20B, which weighs in at
By decoupling the model from the container image, model updates do not require having to rebuild container images, which allows for quick iterations on the model itself, as well as the ability to deploy new versions without having to wait for the container image to build, or for the container image cache to be populated.
Tensorizer also has support for loading models from a local filesystem, so it can be used to serialize models locally and load them locally. This option provides extremely fast load times, as the same principles that make it fast for HTTP/HTTPS and S3 endpoints also apply to local filesystems.
The same is true with regards to HTTP/HTTPS endpoints, as S3 is really just another HTTP/HTTPS endpoint.
Several pre-Tensorized models are available on the CoreWeave Cloud for free, and can be used with the
TensorDeserializerclass. Object Storage support defaults to the
accel-object.ord1.coreweave.comendpoint, and the bucket to use
See all available pre-Tensorized models in the source code GitHub repository's