Performance best practices

Maximizing read performance with CoreWeave AI Object Storage is critical for keeping your GPUs fully utilized. This guide explains how to maximize read performance and how to use the LOTA (Local Object Transport Accelerator) cache effectively. There are two different types of performance bottlenecks in object storage: metadata path issues and data path issues. Identifying which category your bottleneck falls into is the first step toward resolving it.

Metadata path optimization

Each AI Object Storage request requires a metadata lookup to authenticate the request and obtain the location of the data. If the metadata lookup is slow, the request response will be slow. Optimizing the metadata path ensures the best performance for requests. For metadata path optimization, you can use the following techniques:

Avoid key-range hot-spotting
Avoid small object overhead

Avoid key-range hot-spotting

Hot-spotting occurs when many concurrent requests target the same narrow range of object keys (the name given to an object in a bucket). Sequential object keys, such as sample_000001, sample_000002, are a common cause, because object storage systems partition data by key range and concentrated access patterns can overload individual partitions. Use hashed prefixes. Rather than organizing objects with sequential or predictable key names, prepend a hash to distribute keys evenly across partitions.

hashing-prefixes.py

import hashlib

def hashed_key(original_key: str) -> str:
    """Prepend a short hash to distribute keys across partitions."""
    prefix = hashlib.md5(original_key.encode()).hexdigest()[:6]
    return f"{prefix}/{original_key}"

# Instead of:
#   dataset/train/sample_000001.bin
#   dataset/train/sample_000002.bin

# Use:
#   a3f1b2/dataset/train/sample_000001.bin
#   7c9e4d/dataset/train/sample_000002.bin

Use exponential back-off. In rarer cases, many clients may hit the same small key range simultaneously. For example, multiple training jobs may read the same metadata index at startup. Hashing alone does not prevent this pattern. Retry with exponential back-off, since transient overloads often resolve when clients stagger their retries. For objects that every client needs (for example, shared index files), replicate them under different prefixes and distribute reads across the copies. Hashing the prefix does not help with writes or checkpointing to a bucket, but it does help for the subsequent loading. When multiple workers write to the same checkpoint key, use conditional writes (If-None-Match: *) to guarantee that only one write succeeds.

Avoid small object overhead

As object size decreases, the per-request metadata lookup becomes a larger proportion of total latency. The same problem arises when issuing many small range reads against a large object. In both cases, the overhead of the metadata path dominates, and throughput gets throttled. Keep read sizes large. Aim for at least 15 MB per request. Performance begins to degrade noticeably below 1 MB. If your data is organized into many small files, consider the following strategies:

Consolidate small files into larger archives. Formats like TAR, WebDataset, or TFRecord let you pack thousands of small samples into multi-megabyte or multi-gigabyte objects that can be read efficiently.
Use large range reads. When reading portions of a large object, request contiguous ranges of at least 15 MB rather than many small, scattered offsets.

Data path optimization

Each request to object storage requires writing or retrieving data from the backend storage service. Optimizing the data path ensures the best performance for requests. For data path optimization, you can use the following techniques:

Minimize network contention with LOTA
Maximize parallelism

Minimize network contention with LOTA

LOTA (Local Object Transport Accelerator) is a caching proxy installed on every Node in your CKS cluster. When you use the LOTA endpoint (http://cwlota.com) instead of the primary endpoint (https://cwobject.com), reads are served from local NVMe SSDs attached to Nodes in the same cluster as your workload, eliminating network round-trips to the storage backend. The cache is shared across all Nodes in the cluster, scales with cluster size, uses LRU eviction, and maintains strong consistency. LOTA also supports cross-region reads. When a workload reads from a bucket whose home region is elsewhere, LOTA fetches the object from the home region’s repository and caches it on the local Nodes. Subsequent reads are served from the local cache at full speed, so workloads in any region can access a single global dataset with local-like performance. Switching to LOTA requires only an endpoint change:

Scenario	Endpoint	Notes
Inside your CoreWeave cluster	`http://cwlota.com`	LOTA caches `GET` requests only. Write operations are proxied to the storage backend.
Outside CoreWeave	`https://cwobject.com`	No local caching; all requests go directly to the storage backend.

See Attaching endpoints for setup details.

Maximize parallelism

AI Object Storage is designed for massively parallel access. Most high-performance applications already parallelize their reads, but it is worth verifying that your client is using the maximum concurrency available. Internal benchmarking with Warp shows that 9000 concurrent operations across a 30-node cluster (300 per node) achieves strong throughput on CoreWeave AI Object Storage. The optimal concurrency varies by workload. Start at 300 per node and adjust based on whether throughput is still climbing or declining. Set max_pool_connections in the Config to match your desired concurrency level or higher. The default (10) is frequently too low for high-throughput workloads. The following partial example shows how to configure multi-threaded GETs with Boto3 and concurrent.futures. Before completing and running this Boto3 code example, make sure you have configured your CoreWeave credentials. Using a separate profile for CoreWeave AI Object Storage is recommended to avoid conflicts with your other AWS profiles and S3-compatible services; if you do not set up this configuration, you may encounter errors when using AI Object Storage.

Configure CoreWeave credentials

Create a new credentials file and profile in your CoreWeave configuration directory.
Create a new credentials file and profile
```
AWS_SHARED_CREDENTIALS_FILE=~/.coreweave/cw.credentials aws configure --profile cw
```
When prompted for information, provide the following values:
- AWS Access Key ID: The Access Key ID of your CoreWeave AI Object Storage Access Key.
- AWS Secret Access Key: The Secret Key of your CoreWeave AI Object Storage Access Key.
- Default region name: Optional. To set a default region, refer to the CoreWeave Availability Zones.
- Default output format: Use json for JSON output.
Set the default endpoint URL to the appropriate endpoint for your use case:
- The primary endpoint, https://cwobject.com, for use when running outside of a CoreWeave cluster.
- The LOTA endpoint, http://cwlota.com, for use when running inside a CoreWeave cluster. The LOTA endpoint routes to the LOTA path for best performance.
Set the primary endpoint for local development
```
AWS_CONFIG_FILE=~/.coreweave/cw.config aws configure set endpoint_url https://cwobject.com --profile cw
```

Set the S3 addressing_style to virtual:

Set virtual addressing style

AWS_CONFIG_FILE=~/.coreweave/cw.config aws configure set s3.addressing_style virtual --profile cw

maximizing-parallelism.py

import boto3
from botocore.client import Config
from concurrent.futures import ThreadPoolExecutor, as_completed

s3 = boto3.client(
    's3',
    endpoint_url='http://cwlota.com',
    config=Config(
        max_pool_connections=50, # Adjust this value to match your desired concurrency level or higher
        s3={'addressing_style': 'virtual'}
    )
)

object_keys = [f"dataset/shard-{i:05d}.tar" for i in range(1000)]

def download_object(key):
    response = s3.get_object(Bucket='[BUCKET-NAME]', Key=key)
    return response['Body'].read()

with ThreadPoolExecutor(max_workers=50) as executor:
    futures = {executor.submit(download_object, key): key for key in object_keys}
    for future in as_completed(futures):
        data = future.result()
        # Process data...

Optimize LOTA cache performance

When using the LOTA endpoint, requests get better performance because LOTA improves network efficiency and serves data from high-performance storage devices. The following techniques help you get the most out of LOTA caching:

Pre-stage data in the cache
Use multipart uploads for large objects
Contact CoreWeave for large dataset caching
Cross-region writes

Pre-stage data in the cache

LOTA caches data on first read, but you can proactively warm the cache before a production workload starts by pre-staging objects with a HeadObject call. This eliminates cold-start latency on the first read. See Pre-stage the LOTA cache for instructions.

Use multipart uploads for large objects

Uploading large objects using the S3 multipart API distributes data across LOTA partitions, optimizing performance. When you use multipart upload, CoreWeave AI Object Storage preserves the parts, which enables LOTA to spread the data across Nodes. Conversely, objects uploaded with a single PutObject call are stored on one Node, which can create a bottleneck for large objects. LOTA automatically invalidates cached parts when they are updated, so no stale data is served during iterative uploads of the same object. Use a minimum part size of 50 MB to reduce HTTP request overhead while still enabling efficient distribution. LOTA only caches objects larger than 4 MB. Smaller objects bypass the cache.

Contact CoreWeave for large dataset caching

If you have a large dataset and want all of it resident in the LOTA cache, reach out to CoreWeave support. LOTA has a substantial amount of cache capacity available and can likely accommodate your entire dataset. The CoreWeave team will ensure your organization has the proper cache allocation and settings configured.

Cross-region writes

CoreWeave AI Object Storage supports cross-region writes, but only reads are cached by LOTA. If your workload writes across regions, ensure it tolerates the higher latencies associated with sending data between regions.

Data transfer optimization

For copying data from PVC to AI Object Storage, CoreWeave recommends using the CoreWeave fork of s5cmd.

Storage

Documentation Index

​Metadata path optimization

​Avoid key-range hot-spotting

​Avoid small object overhead

​Data path optimization

​Minimize network contention with LOTA

​Maximize parallelism

​Optimize LOTA cache performance

​Pre-stage data in the cache

​Use multipart uploads for large objects

​Contact CoreWeave for large dataset caching

​Cross-region writes

​Data transfer optimization