Maximizing read performance with CoreWeave AI Object Storage is critical for keeping your GPUs fully utilized. This guide explains how to maximize read performance and how to use the LOTA (Local Object Transport Accelerator) cache effectively. There are two different types of performance bottlenecks in object storage: metadata path issues and data path issues. Identifying which category your bottleneck falls into is the first step toward resolving it.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Metadata path optimization
Each AI Object Storage request requires a metadata lookup to authenticate the request and obtain the location of the data. If the metadata lookup is slow, the request response will be slow. Optimizing the metadata path ensures the best performance for requests. For metadata path optimization, you can use the following techniques:Avoid key-range hot-spotting
Hot-spotting occurs when many concurrent requests target the same narrow range of object keys (the name given to an object in a bucket). Sequential object keys, such assample_000001, sample_000002, are a common cause, because object storage systems partition data by key range and concentrated access patterns can overload individual partitions.
Use hashed prefixes. Rather than organizing objects with sequential or predictable key names, prepend a hash to distribute keys evenly across partitions.
hashing-prefixes.py
If-None-Match: *) to guarantee that only one write succeeds.
Avoid small object overhead
As object size decreases, the per-request metadata lookup becomes a larger proportion of total latency. The same problem arises when issuing many small range reads against a large object. In both cases, the overhead of the metadata path dominates, and throughput gets throttled. Keep read sizes large. Aim for at least 15 MB per request. Performance begins to degrade noticeably below 1 MB. If your data is organized into many small files, consider the following strategies:- Consolidate small files into larger archives. Formats like TAR, WebDataset, or TFRecord let you pack thousands of small samples into multi-megabyte or multi-gigabyte objects that can be read efficiently.
- Use large range reads. When reading portions of a large object, request contiguous ranges of at least 15 MB rather than many small, scattered offsets.
Data path optimization
Each request to object storage requires writing or retrieving data from the backend storage service. Optimizing the data path ensures the best performance for requests. For data path optimization, you can use the following techniques:Minimize network contention with LOTA
LOTA (Local Object Transport Accelerator) is a caching proxy installed on every Node in your CKS cluster. When you use the LOTA endpoint (http://cwlota.com) instead of the primary endpoint (https://cwobject.com), reads are served from local NVMe SSDs attached to Nodes in the same cluster as your workload, eliminating network round-trips to the storage backend. The cache is shared across all Nodes in the cluster, scales with cluster size, uses LRU eviction, and maintains strong consistency.
LOTA also supports cross-region reads. When a workload reads from a bucket whose home region is elsewhere, LOTA fetches the object from the home region’s repository and caches it on the local Nodes. Subsequent reads are served from the local cache at full speed, so workloads in any region can access a single global dataset with local-like performance.
Switching to LOTA requires only an endpoint change:
| Scenario | Endpoint | Notes |
|---|---|---|
| Inside your CoreWeave cluster | http://cwlota.com | LOTA caches GET requests only. Write operations are proxied to the storage backend. |
| Outside CoreWeave | https://cwobject.com | No local caching; all requests go directly to the storage backend. |
Maximize parallelism
AI Object Storage is designed for massively parallel access. Most high-performance applications already parallelize their reads, but it is worth verifying that your client is using the maximum concurrency available. Internal benchmarking with Warp shows that 9000 concurrent operations across a 30-node cluster (300 per node) achieves strong throughput on CoreWeave AI Object Storage. The optimal concurrency varies by workload. Start at 300 per node and adjust based on whether throughput is still climbing or declining. Setmax_pool_connections in the Config to match your desired concurrency level or higher. The default (10) is frequently too low for high-throughput workloads.
The following partial example shows how to configure multi-threaded GETs with Boto3 and concurrent.futures.
Before completing and running this Boto3 code example, make sure you have configured your CoreWeave credentials.
Using a separate profile for CoreWeave AI Object Storage is recommended to avoid conflicts with your other AWS profiles and S3-compatible services; if you do not set up this configuration, you may encounter errors when using AI Object Storage.
Configure CoreWeave credentials
Configure CoreWeave credentials
-
Create a new credentials file and profile in your CoreWeave configuration directory.
Create a new credentials file and profile
-
When prompted for information, provide the following values:
- AWS Access Key ID: The Access Key ID of your CoreWeave AI Object Storage Access Key.
- AWS Secret Access Key: The Secret Key of your CoreWeave AI Object Storage Access Key.
- Default region name: Optional. To set a default region, refer to the CoreWeave Availability Zones.
- Default output format: Use
jsonfor JSON output.
-
Set the default endpoint URL to the appropriate endpoint for your use case:
- The primary endpoint,
https://cwobject.com, for use when running outside of a CoreWeave cluster. - The LOTA endpoint,
http://cwlota.com, for use when running inside a CoreWeave cluster. The LOTA endpoint routes to the LOTA path for best performance.
Set the primary endpoint for local development - The primary endpoint,
-
Set the S3
addressing_styletovirtual:Set virtual addressing style
maximizing-parallelism.py
Optimize LOTA cache performance
When using the LOTA endpoint, requests get better performance because LOTA improves network efficiency and serves data from high-performance storage devices. The following techniques help you get the most out of LOTA caching:- Pre-stage data in the cache
- Use multipart uploads for large objects
- Contact CoreWeave for large dataset caching
- Cross-region writes
Pre-stage data in the cache
LOTA caches data on first read, but you can proactively warm the cache before a production workload starts by pre-staging objects with a HeadObject call. This eliminates cold-start latency on the first read. See Pre-stage the LOTA cache for instructions.Use multipart uploads for large objects
Uploading large objects using the S3 multipart API distributes data across LOTA partitions, optimizing performance. When you use multipart upload, CoreWeave AI Object Storage preserves the parts, which enables LOTA to spread the data across Nodes. Conversely, objects uploaded with a singlePutObject call are stored on one Node, which can create a bottleneck for large objects.
LOTA automatically invalidates cached parts when they are updated, so no stale data is served during iterative uploads of the same object.
Use a minimum part size of 50 MB to reduce HTTP request overhead while still enabling efficient distribution. LOTA only caches objects larger than 4 MB. Smaller objects bypass the cache.