GET, PUT, DELETE, LIST, STAT, and mixed workloads.
This guide shows you how to deploy Warp inside a CKS cluster and run read, write, mixed, and multipart benchmarks against CoreWeave AI Object Storage. Use these benchmarks to measure storage performance for your workloads, compare LOTA cache performance against the primary endpoint, and validate concurrency and object-size tuning before running production data pipelines.
Prerequisites
This guide assumes that you already have the following prerequisites in place:- A CKS cluster
- A Node Pool with a Node to run the benchmarks on
kubectlinstalled and configured to access your cluster- A CoreWeave AI Object Storage access key and secret key
- The necessary permissions to create an AI Object Storage bucket
Create a dedicated benchmark bucket
Create a dedicated bucket for benchmarking using the Cloud Console. This guide names the bucketwarp-benchmark-bucket and uses the US-EAST-04A availability zone.
Alternatively, you can create a dedicated bucket using S3-compatible clients. For example, you can create a bucket named warp-benchmark-bucket in the US-EAST-04A availability zone using the following command, which works only if you’ve already created a CoreWeave-specific configuration:
warp-. The rest of this guide uses the following bucket configuration values:
- Bucket name:
warp-benchmark-bucket - Availability zone:
US-EAST-04A
Deploy Warp on a Node
Warp must run from inside your CKS cluster to produce meaningful results. Benchmarking from an external network measures network latency rather than storage performance, and the LOTA endpoint (http://cwlota.com) is only accessible from a CoreWeave cluster.
Deploy Warp as a Pod using the official Warp container image. Warp does not use GPU resources, so no GPU resource request or Node affinity is needed. LOTA is available on all Nodes in the cluster. For details on targeting specific Node types, see Target specific GPUs or CPUs.
warp-benchmark.yaml
-
Apply the manifest:
-
Wait for the Pod to be running:
-
Open a shell inside the Pod:
-
Verify that Warp is available:
Set environment variables
In your Pod, set your CoreWeave AI Object Storage credentials, endpoint, and region using environment variables for convenience:Set environment variables
Measure read throughput with GET
Read throughput is the key benchmark for data loading. The following configuration runs aget benchmark against the LOTA endpoint with recommended starting parameters:
warp-test-lota-get.yaml
-
Create the benchmark configuration file:
Create GET benchmark configuration file
-
Run the benchmark using the following command in your Pod:
Run GET benchmark
Recommended benchmark parameters
The following table lists the recommended benchmark parameters for read throughput. You can adjust them as needed to optimize performance for your workload.| Parameter | Value | Description |
|---|---|---|
| Concurrency Level | 300 | The number of parallel download operations to run. Start with 300 and adjust upward or downward. If you see throughput still climbing at 300, try increasing to 500 or more. If you see declining or unexpectedly low throughput, try decreasing the concurrency. |
| Object Size | 15 MiB | The size of the objects to use. Start with 15 MiB to stay above the threshold where metadata overhead becomes significant. |
| Number of Objects | 2,500 | The number of objects to use. Start with 2,500 to provide a pool for random selection. For most workloads, this is a good starting point. |
| Duration | 5 minutes | The duration of the benchmark. Start with 5 minutes to capture stable, representative throughput numbers. For most workloads, this is a good starting point. |
| Range Size | 1 MB | The size of the range to read. Start with 1 MB to avoid small-read overhead. Small reads are still served from the cache as long as the object size is greater than 4 MB. Objects smaller than 4 MB are not cached. |
| Autoterm | true | Automatically stop the benchmark when throughput stabilizes, preventing noisy results from incomplete warm-up periods. |
Measure write throughput with PUT
Write throughput indicates how quickly you can load data into AI Object Storage, which is relevant for workloads such as checkpointing, dataset preparation, and log ingestion. The following configuration runs aput benchmark with recommended starting parameters:
warp-test-lota-put.yaml
-
Create the benchmark configuration file:
Create PUT benchmark configuration file
-
Run the benchmark using the following command in your Pod:
Run PUT benchmark
Measure mixed workload throughput
Themixed command simulates a realistic workload with a configurable mix of GET, PUT, STAT, and DELETE operations.
The following configuration runs a mixed benchmark with the default operation distribution:
warp-test-lota-mixed.yaml
-
Create the benchmark configuration file:
Create mixed workload benchmark configuration file
-
Run the benchmark using the following command in your Pod:
Run mixed workload benchmark
Configure the operation distribution
By default, themixed benchmark uses the following operation distribution: 45% GET, 15% PUT, 30% STAT, 10% DELETE. To customize the distribution, set the distribution key in your YAML configuration file. The following partial example shows where to add it:
Mixed workload benchmark configuration
Measure multipart upload throughput
Multipart uploads split a single object into parts that are uploaded in parallel, which is useful for large objects such as model checkpoints or dataset archives. The following configuration runs amultipart-put benchmark with recommended starting parameters:
warp-test-lota-multipart-put.yaml
-
Create the benchmark configuration file:
Create multipart upload benchmark configuration file
-
Run the benchmark using the following command in your Pod:
Run multipart upload benchmark
Interpret Warp output
Warp reports include several important metrics:Warp output
- Average and median throughput (GiB/s): Your sustained read or write rate. The median is more representative than the average when there are outliers.
- p50 / p90 / p99 request latency: These percentiles show how consistent performance is. A large gap between p50 and p99 may indicate contention or cache misses.
- Slowest 1-second window: Represents worst-case throughput. If this is much lower than the median, investigate potential sources of variability.
Compare LOTA and primary endpoint performance
Run the same benchmark against bothcwlota.com and cwobject.com to quantify the performance benefit of LOTA caching. When benchmarking against cwobject.com, set tls: true in the remote section of your YAML configuration because the primary endpoint requires HTTPS. The LOTA endpoint uses HTTP from within the cluster, so tls: false is correct for cwlota.com. On the first run against LOTA, results reflect cache-miss performance. Run the benchmark a second time to see fully cached performance.
The following example configuration shows a GET benchmark for cwobject.com. You can modify the other benchmark configurations similarly to run against the primary endpoint:
warp-test-cwobject-get.yaml
Run distributed benchmarks
Warp supports distributed benchmarking, where multiple Warp client instances run in parallel across different Nodes and a coordinator aggregates the results. This is how you scale beyond a single Node to test higher concurrency levels. Theconcurrent value applies per client, so three clients with concurrent: 300 produce 900 total concurrent operations.
Running distributed Warp on Kubernetes requires deploying multiple Warp Pods with network connectivity between them. Warp provides a Helm chart and Kubernetes manifests that handle this using a StatefulSet for the client Pods and a Job for the coordinator.
To run distributed benchmarks on CKS, adapt the upstream Warp Kubernetes manifests with the following changes:
- Node scheduling: Add node affinity rules and any required tolerations to schedule Warp Pods on Nodes appropriate for your workload.
- Endpoint: Set the S3 host to
cwlota.cominstead of a MinIO server address. - Credentials: Use your CoreWeave AI Object Storage access key and secret key.