Skip to main content
Maximizing read performance with CoreWeave AI Object Storage is critical for keeping your GPUs busy. This guide is for engineers tuning training, inference, or data-pipeline workloads against AI Object Storage. It explains how to maximize read performance and how to use the LOTA (Local Object Transport Accelerator) cache effectively. Object storage has two types of performance bottlenecks: metadata path issues and data path issues. Identifying which category your bottleneck falls into is the first step toward resolving it. The following sections describe techniques for each category, then explain how to get the most out of the LOTA cache.

Metadata path optimization

Each AI Object Storage request requires a metadata lookup to authenticate the request and obtain the location of the data. If the metadata lookup is slow, the request response is slow. Optimizing the metadata path improves response times. For metadata path optimization, you can use the following techniques:
  • Avoid key-range hot-spotting.
  • Avoid small object overhead.

Avoid key-range hot-spotting

Hot-spotting occurs when many concurrent requests target the same narrow range of object keys (the name given to an object in a bucket). Sequential object keys, such as sample_000001, sample_000002, are a common cause. Object storage systems partition data by key range, and concentrated access patterns can overload individual partitions. Use hashed prefixes. Rather than organizing objects with sequential or predictable key names, prepend a hash to distribute keys evenly across partitions.
hashing-prefixes.py
import hashlib

def hashed_key(original_key: str) -> str:
    """Prepend a short hash to distribute keys across partitions."""
    prefix = hashlib.md5(original_key.encode()).hexdigest()[:6]
    return f"{prefix}/{original_key}"

# Instead of:
#   dataset/train/sample_000001.bin
#   dataset/train/sample_000002.bin

# Use:
#   a3f1b2/dataset/train/sample_000001.bin
#   7c9e4d/dataset/train/sample_000002.bin
Use exponential back-off. In rarer cases, many clients may hit the same small key range simultaneously. For example, multiple training jobs may read the same metadata index at startup. Hashing alone does not prevent this pattern. Retry with exponential back-off, since transient overloads often resolve when clients stagger their retries. For objects that every client needs (for example, shared index files), replicate them under different prefixes and distribute reads across the copies. Hashing the prefix does not help with writes or checkpointing to a bucket, but it does help for later loading. When multiple workers write to the same checkpoint key, use conditional writes (If-None-Match: *) to guarantee that only one write succeeds.

Avoid small object overhead

As object size decreases, the per-request metadata lookup becomes a larger proportion of total latency. The same problem arises when you issue many small range reads against a large object. In both cases, the overhead of the metadata path dominates and throttles throughput. Keep read sizes large. Aim for at least 15 MB per request. Performance begins to degrade noticeably below 1 MB. If your data is organized into many small files, consider the following strategies:
  • Consolidate small files into larger archives. Formats like TAR, WebDataset, or TFRecord let you pack thousands of small samples into multi-megabyte or multi-gigabyte objects that can be read efficiently.
  • Use large range reads. When reading portions of a large object, request contiguous ranges of at least 15 MB rather than many small, scattered offsets.

Data path optimization

Each request to object storage requires writing or retrieving data from the backend storage service. Optimizing the data path improves throughput and latency. For data path optimization, you can use the following techniques:
  • Minimize network contention with LOTA.
  • Maximize parallelism.

Minimize network contention with LOTA

LOTA (Local Object Transport Accelerator) is a caching proxy installed on every Node in your CoreWeave Kubernetes Service (CKS) cluster. When you use the LOTA endpoint (http://cwlota.com) instead of the primary endpoint (https://cwobject.com), reads come from local NVMe SSDs attached to Nodes in the same cluster as your workload. This eliminates network round-trips to the storage backend. The cache is shared across all Nodes in the cluster, scales with cluster size, uses LRU eviction, and maintains strong consistency. LOTA also supports cross-region reads. When a workload reads from a bucket whose home region is elsewhere, LOTA fetches the object from the home region’s repository and caches it on the local Nodes. Later reads come from the local cache at full speed, so workloads in any region can access a single global dataset with local-like performance. Switching to LOTA requires only an endpoint change:
ScenarioEndpointNotes
Inside your CoreWeave clusterhttp://cwlota.comLOTA caches GET requests only. Write operations are proxied to the storage backend.
Outside CoreWeavehttps://cwobject.comNo local caching. All requests go directly to the storage backend.
See Attaching endpoints for setup details.

Maximize parallelism

AI Object Storage is designed for massively parallel access. Most high-performance applications already parallelize their reads, but it’s worth verifying that your client uses the maximum concurrency available. Internal benchmarking with Warp shows that 9,000 concurrent operations across a 30-node cluster (300 per node) achieves high throughput on AI Object Storage. The optimal concurrency varies by workload. Start at 300 per node and adjust based on whether throughput is still climbing or declining. Set max_pool_connections in the Config to match your desired concurrency level or higher. The default (10) is frequently too low for high-throughput workloads. The following partial example shows how to configure multi-threaded GETs with Boto3 and concurrent.futures. Before completing and running this Boto3 code example, make sure you have configured your CoreWeave credentials. We recommend using a separate profile for CoreWeave AI Object Storage to avoid conflicts with your other AWS profiles and S3-compatible services. If you don’t set up this configuration, you might encounter errors when using AI Object Storage.
  1. Create a new credentials file and profile in your CoreWeave configuration directory.
    Create a new credentials file and profile
    AWS_SHARED_CREDENTIALS_FILE=~/.coreweave/cw.credentials aws configure --profile cw
    
  2. When prompted, provide the following values:
    • AWS Access Key ID: The Access Key ID of your CoreWeave AI Object Storage Access Key.
    • AWS Secret Access Key: The Secret Key of your CoreWeave AI Object Storage Access Key.
    • Default region name (Optional): To set a default region, see CoreWeave Availability Zones.
    • Default output format: Use json for JSON output.
  3. Set the default endpoint URL to the appropriate endpoint for your use case:
    • The primary endpoint, https://cwobject.com, for use outside a CoreWeave cluster.
    • The LOTA endpoint, http://cwlota.com, for use inside a CoreWeave cluster. The LOTA endpoint routes to the LOTA path for best performance.
    Set the primary endpoint for local development
    AWS_CONFIG_FILE=~/.coreweave/cw.config aws configure set endpoint_url https://cwobject.com --profile cw
    
  4. Set the S3 addressing_style to virtual:
    Set virtual addressing style
    AWS_CONFIG_FILE=~/.coreweave/cw.config aws configure set s3.addressing_style virtual --profile cw
    
maximizing-parallelism.py
import boto3
from botocore.client import Config
from concurrent.futures import ThreadPoolExecutor, as_completed

s3 = boto3.client(
    's3',
    endpoint_url='http://cwlota.com',
    config=Config(
        max_pool_connections=50, # Adjust this value to match your desired concurrency level or higher
        s3={'addressing_style': 'virtual'}
    )
)

object_keys = [f"dataset/shard-{i:05d}.tar" for i in range(1000)]

def download_object(key):
    response = s3.get_object(Bucket='[BUCKET-NAME]', Key=key)
    return response['Body'].read()

with ThreadPoolExecutor(max_workers=50) as executor:
    futures = {executor.submit(download_object, key): key for key in object_keys}
    for future in as_completed(futures):
        data = future.result()
        # Process data

Optimize LOTA cache performance

When you use the LOTA endpoint, requests get better performance because LOTA improves network efficiency and serves data from high-performance storage devices. The following sections describe techniques that help you get the most out of LOTA caching:
  • Pre-stage data in the cache.
  • Use multipart uploads for large objects.
  • Contact CoreWeave for large dataset caching.
  • Handle cross-region writes.

Pre-stage data in the cache

LOTA caches data on first read, but you can proactively warm the cache before a production workload starts by pre-staging objects with a HeadObject call. This eliminates cold-start latency on the first read. See Pre-stage the LOTA cache for instructions.

Use multipart uploads for large objects

Uploading large objects using the S3 multipart API distributes data across LOTA partitions, optimizing performance. When you use multipart upload, AI Object Storage preserves the parts, which lets LOTA spread the data across Nodes. Conversely, objects uploaded with a single PutObject call reside on one Node, which can create a bottleneck for large objects. LOTA automatically invalidates cached parts when they’re updated, so no stale data reaches clients during iterative uploads of the same object. Use a minimum part size of 50 MB to reduce HTTP request overhead while still allowing efficient distribution. LOTA only caches objects larger than 4 MB. Smaller objects bypass the cache.

Contact CoreWeave for large dataset caching

If you have a large dataset and want all of it resident in the LOTA cache, contact CoreWeave support. LOTA has ample cache capacity available and can likely accommodate your entire dataset. The CoreWeave team ensures that your organization has the proper cache allocation and settings configured.

Handle cross-region writes

AI Object Storage supports cross-region writes, but LOTA caches only reads. If your workload writes across regions, ensure that it tolerates the higher latencies associated with sending data between regions.

Data transfer optimization

When moving existing data into AI Object Storage, the transfer tool you choose affects throughput. For copying data from PVC to AI Object Storage, we recommend using the CoreWeave fork of s5cmd.
Last modified on June 4, 2026