Metadata path optimization
Each AI Object Storage request requires a metadata lookup to authenticate the request and obtain the location of the data. If the metadata lookup is slow, the request response is slow. Optimizing the metadata path improves response times. For metadata path optimization, you can use the following techniques:- Avoid key-range hot-spotting.
- Avoid small object overhead.
Avoid key-range hot-spotting
Hot-spotting occurs when many concurrent requests target the same narrow range of object keys (the name given to an object in a bucket). Sequential object keys, such assample_000001, sample_000002, are a common cause. Object storage systems partition data by key range, and concentrated access patterns can overload individual partitions.
Use hashed prefixes. Rather than organizing objects with sequential or predictable key names, prepend a hash to distribute keys evenly across partitions.
hashing-prefixes.py
If-None-Match: *) to guarantee that only one write succeeds.
Avoid small object overhead
As object size decreases, the per-request metadata lookup becomes a larger proportion of total latency. The same problem arises when you issue many small range reads against a large object. In both cases, the overhead of the metadata path dominates and throttles throughput. Keep read sizes large. Aim for at least 15 MB per request. Performance begins to degrade noticeably below 1 MB. If your data is organized into many small files, consider the following strategies:- Consolidate small files into larger archives. Formats like TAR, WebDataset, or TFRecord let you pack thousands of small samples into multi-megabyte or multi-gigabyte objects that can be read efficiently.
- Use large range reads. When reading portions of a large object, request contiguous ranges of at least 15 MB rather than many small, scattered offsets.
Data path optimization
Each request to object storage requires writing or retrieving data from the backend storage service. Optimizing the data path improves throughput and latency. For data path optimization, you can use the following techniques:- Minimize network contention with LOTA.
- Maximize parallelism.
Minimize network contention with LOTA
LOTA (Local Object Transport Accelerator) is a caching proxy installed on every Node in your CoreWeave Kubernetes Service (CKS) cluster. When you use the LOTA endpoint (http://cwlota.com) instead of the primary endpoint (https://cwobject.com), reads come from local NVMe SSDs attached to Nodes in the same cluster as your workload. This eliminates network round-trips to the storage backend. The cache is shared across all Nodes in the cluster, scales with cluster size, uses LRU eviction, and maintains strong consistency.
LOTA also supports cross-region reads. When a workload reads from a bucket whose home region is elsewhere, LOTA fetches the object from the home region’s repository and caches it on the local Nodes. Later reads come from the local cache at full speed, so workloads in any region can access a single global dataset with local-like performance.
Switching to LOTA requires only an endpoint change:
| Scenario | Endpoint | Notes |
|---|---|---|
| Inside your CoreWeave cluster | http://cwlota.com | LOTA caches GET requests only. Write operations are proxied to the storage backend. |
| Outside CoreWeave | https://cwobject.com | No local caching. All requests go directly to the storage backend. |
Maximize parallelism
AI Object Storage is designed for massively parallel access. Most high-performance applications already parallelize their reads, but it’s worth verifying that your client uses the maximum concurrency available. Internal benchmarking with Warp shows that 9,000 concurrent operations across a 30-node cluster (300 per node) achieves high throughput on AI Object Storage. The optimal concurrency varies by workload. Start at 300 per node and adjust based on whether throughput is still climbing or declining. Setmax_pool_connections in the Config to match your desired concurrency level or higher. The default (10) is frequently too low for high-throughput workloads.
The following partial example shows how to configure multi-threaded GETs with Boto3 and concurrent.futures.
Before completing and running this Boto3 code example, make sure you have configured your CoreWeave credentials.
We recommend using a separate profile for CoreWeave AI Object Storage to avoid conflicts with your other AWS profiles and S3-compatible services. If you don’t set up this configuration, you might encounter errors when using AI Object Storage.
Configure CoreWeave credentials
Configure CoreWeave credentials
-
Create a new credentials file and profile in your CoreWeave configuration directory.
Create a new credentials file and profile
-
When prompted, provide the following values:
- AWS Access Key ID: The Access Key ID of your CoreWeave AI Object Storage Access Key.
- AWS Secret Access Key: The Secret Key of your CoreWeave AI Object Storage Access Key.
- Default region name (Optional): To set a default region, see CoreWeave Availability Zones.
- Default output format: Use
jsonfor JSON output.
-
Set the default endpoint URL to the appropriate endpoint for your use case:
- The primary endpoint,
https://cwobject.com, for use outside a CoreWeave cluster. - The LOTA endpoint,
http://cwlota.com, for use inside a CoreWeave cluster. The LOTA endpoint routes to the LOTA path for best performance.
Set the primary endpoint for local development - The primary endpoint,
-
Set the S3
addressing_styletovirtual:Set virtual addressing style
maximizing-parallelism.py
Optimize LOTA cache performance
When you use the LOTA endpoint, requests get better performance because LOTA improves network efficiency and serves data from high-performance storage devices. The following sections describe techniques that help you get the most out of LOTA caching:- Pre-stage data in the cache.
- Use multipart uploads for large objects.
- Contact CoreWeave for large dataset caching.
- Handle cross-region writes.
Pre-stage data in the cache
LOTA caches data on first read, but you can proactively warm the cache before a production workload starts by pre-staging objects with a HeadObject call. This eliminates cold-start latency on the first read. See Pre-stage the LOTA cache for instructions.Use multipart uploads for large objects
Uploading large objects using the S3 multipart API distributes data across LOTA partitions, optimizing performance. When you use multipart upload, AI Object Storage preserves the parts, which lets LOTA spread the data across Nodes. Conversely, objects uploaded with a singlePutObject call reside on one Node, which can create a bottleneck for large objects.
LOTA automatically invalidates cached parts when they’re updated, so no stale data reaches clients during iterative uploads of the same object.
Use a minimum part size of 50 MB to reduce HTTP request overhead while still allowing efficient distribution. LOTA only caches objects larger than 4 MB. Smaller objects bypass the cache.