Run SWE-bench evaluations in parallel using sandboxes on CoreWeave. This guide is for ML practitioners and researchers who want to evaluate language model code-fixing performance at scale. By the end, you have a working setup that runs SWE-bench evaluations concurrently across many CoreWeave sandboxes, with results written locally for analysis.
What this does
SWE-bench tests whether language models can fix real bugs in real repositories. Each task applies a patch, runs the test suite, and checks if the fix works.
You can run SWE-bench locally with Docker, but your machine’s resources limit you. CoreWeave sandboxes let you run evaluations at scale on CoreWeave infrastructure. Spin up dozens or hundreds of sandboxes concurrently without managing any of it yourself. The script pulls pre-built images from Epoch AI’s registry on GHCR, so no local Docker build step is needed.
Setup
Complete the following steps to install the required tools and configure authentication before running any evaluations.
-
Install the Python SDK and evaluation dependencies:
uv pip install cwsandbox swebench datasets
-
Clone the repo for the example evaluation scripts:
git clone https://github.com/coreweave/cwsandbox-client.git
cd cwsandbox-client
-
Set your
CWSANDBOX_API_KEY to a CoreWeave API access token for authentication. Replace [ACCESS-TOKEN] with your CoreWeave API access token:
export CWSANDBOX_API_KEY="[ACCESS-TOKEN]"
Quick start
Before you scale up, start with a single-instance run to confirm your environment, credentials, and network access to GHCR all work. The gold option uses the known-correct fix from the dataset:
uv run python examples/swebench/run_evaluation.py \
--predictions-path gold \
--instance-ids astropy__astropy-12907 \
--run-id test
This run passes, confirming that your sandbox can pull the image, apply the patch, and run the test suite end to end.
Run in parallel
Run multiple instances at once:
uv run python examples/swebench/run_evaluation.py \
--predictions-path gold \
--instance-ids \
astropy__astropy-12907 \
django__django-11039 \
django__django-11099 \
django__django-11283 \
matplotlib__matplotlib-23476 \
scikit-learn__scikit-learn-13142 \
sympy__sympy-13031 \
sympy__sympy-13647 \
--run-id parallel-test \
--max-workers 8
This command spins up eight sandboxes and runs them concurrently. All pass because gold patches are the correct fixes.
Evaluate model predictions
To test custom model output:
uv run python examples/swebench/run_evaluation.py \
--predictions-path predictions.json \
--instance-ids django__django-11039 scikit-learn__scikit-learn-13142 \
--run-id eval-run-1 \
--max-workers 10
The predictions file maps instance IDs to patches:
[
{
"instance_id": "django__django-11039",
"model_name_or_path": "gpt-4",
"model_patch": "diff --git a/..."
}
]
Options
| Option | Default | Description |
|---|
--predictions-path | Required | Path to predictions JSON, or gold for gold patches |
--instance-ids | Required | Space-separated instance IDs |
--run-id | Required | Identifier for this run |
--max-workers | 10 | Max parallel sandboxes |
--timeout | 1800 | Per-instance timeout in seconds (30 minutes) |
--output-dir | logs/swebench | Where to write logs and reports |
--dataset | princeton-nlp/SWE-bench_Lite | Hugging Face dataset name |
--force | false | Re-run instances even if report.json exists |
Adjust resources
The default is 2 CPUs and 4Gi memory per sandbox. Change this in run_evaluation.py:
defaults = SandboxDefaults(
tags=(f"swebench-{run_id}",),
resources={"cpu": "4", "memory": "8Gi"},
)
How it works
This section explains the moving parts behind the script so you can adapt it to your own evaluation workflows.
Container images
Epoch AI hosts pre-built images on GHCR. Each instance has its own image with the repo checked out at the right commit, dependencies installed, and test environment ready.
Image format: ghcr.io/epoch-research/swe-bench.eval.x86_64.{instance_id}:latest
Sandboxes pull these directly from GHCR. No local builds are needed.
Evaluation flow
Each instance goes through these steps:
| Step | Where | What happens |
|---|
| 1. Load dataset | Local | Fetch instance metadata from Hugging Face |
| 2. Create sandbox | CoreWeave Sandbox | Start sandbox with the instance’s image |
| 3. Write patch | CoreWeave Sandbox | Write patch to /tmp/patch.diff |
| 4. Apply patch | CoreWeave Sandbox | Run git apply (falls back to patch if needed) |
| 5. Run tests | CoreWeave Sandbox | Execute /root/eval.sh |
| 6. Grade results | Local | Parse output with swebench.harness.grading |
| 7. Write report | Local | Save results to logs/swebench/ |
| 8. Cleanup | CoreWeave Sandbox | Stop sandbox |
Steps 2 to 5 and 8 run remotely. Steps 1, 6, and 7 run on your machine.
Parallel execution
The script uses ThreadPoolExecutor to run instance workflows concurrently. Each thread drives one instance through its workflow. While one instance runs tests, another can apply its patch, another grade locally, another start up. The overlap is where the speed comes from.
Results come back as workflows finish through as_completed().
Cleanup
The script uses a CoreWeave sandbox Session to track sandboxes. When the session exits (normal exit, exception, or Ctrl+C), the session cleans up all sandboxes.
with cwsandbox.Session(defaults=defaults) as session:
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# The session cleans up sandboxes created here when it exits
Output
Results go to {output-dir}/{run-id}/{model-name}/{instance-id}/:
| File | Contents |
|---|
report.json | Resolved status, sandbox ID, duration |
test_output.txt | Full test output |
patch.diff | The applied patch |
{
"instance_id": {
"resolved": true,
"tests_status": {
"PASSED": ["test_foo", "test_bar"],
"FAILED": []
}
},
"sandbox_id": "sb-abc123",
"duration_seconds": 45.2
}
The format is compatible with standard SWE-bench tooling.
Troubleshooting
Patch application fails
If you see APPLY_PATCH_FAIL in logs, the patch is probably malformed, targets the wrong commit, or has whitespace issues. Run git apply --check locally to see what’s wrong. Make sure the instance ID matches the prediction.
Test timeouts
Some test suites take longer than 30 minutes. Model-generated code might also have infinite loops. If needed, increase --timeout, or check the test output to see where it’s stuck.
Image pull errors
If the container fails to start with an image pull error, either the instance ID doesn’t exist in Epoch AI’s registry, or a network issue prevents reaching ghcr.io. Verify the instance ID is in the SWE-bench dataset.
See also
Last modified on May 29, 2026