Skip to main content
Run SWE-bench evaluations in parallel using sandboxes on CoreWeave. This guide is for ML practitioners and researchers who want to evaluate language model code-fixing performance at scale. By the end, you have a working setup that runs SWE-bench evaluations concurrently across many CoreWeave sandboxes, with results written locally for analysis.

What this does

SWE-bench tests whether language models can fix real bugs in real repositories. Each task applies a patch, runs the test suite, and checks if the fix works. You can run SWE-bench locally with Docker, but your machine’s resources limit you. CoreWeave sandboxes let you run evaluations at scale on CoreWeave infrastructure. Spin up dozens or hundreds of sandboxes concurrently without managing any of it yourself. The script pulls pre-built images from Epoch AI’s registry on GHCR, so no local Docker build step is needed.

Setup

Complete the following steps to install the required tools and configure authentication before running any evaluations.
  1. Install the Python SDK and evaluation dependencies:
    uv pip install cwsandbox swebench datasets
    
  2. Clone the repo for the example evaluation scripts:
    git clone https://github.com/coreweave/cwsandbox-client.git
    cd cwsandbox-client
    
  3. Set your CWSANDBOX_API_KEY to a CoreWeave API access token for authentication. Replace [ACCESS-TOKEN] with your CoreWeave API access token:
    export CWSANDBOX_API_KEY="[ACCESS-TOKEN]"
    

Quick start

Before you scale up, start with a single-instance run to confirm your environment, credentials, and network access to GHCR all work. The gold option uses the known-correct fix from the dataset:
uv run python examples/swebench/run_evaluation.py \
    --predictions-path gold \
    --instance-ids astropy__astropy-12907 \
    --run-id test
This run passes, confirming that your sandbox can pull the image, apply the patch, and run the test suite end to end.

Run in parallel

Run multiple instances at once:
uv run python examples/swebench/run_evaluation.py \
    --predictions-path gold \
    --instance-ids \
        astropy__astropy-12907 \
        django__django-11039 \
        django__django-11099 \
        django__django-11283 \
        matplotlib__matplotlib-23476 \
        scikit-learn__scikit-learn-13142 \
        sympy__sympy-13031 \
        sympy__sympy-13647 \
    --run-id parallel-test \
    --max-workers 8
This command spins up eight sandboxes and runs them concurrently. All pass because gold patches are the correct fixes.

Evaluate model predictions

To test custom model output:
uv run python examples/swebench/run_evaluation.py \
    --predictions-path predictions.json \
    --instance-ids django__django-11039 scikit-learn__scikit-learn-13142 \
    --run-id eval-run-1 \
    --max-workers 10
The predictions file maps instance IDs to patches:
[
  {
    "instance_id": "django__django-11039",
    "model_name_or_path": "gpt-4",
    "model_patch": "diff --git a/..."
  }
]

Options

OptionDefaultDescription
--predictions-pathRequiredPath to predictions JSON, or gold for gold patches
--instance-idsRequiredSpace-separated instance IDs
--run-idRequiredIdentifier for this run
--max-workers10Max parallel sandboxes
--timeout1800Per-instance timeout in seconds (30 minutes)
--output-dirlogs/swebenchWhere to write logs and reports
--datasetprinceton-nlp/SWE-bench_LiteHugging Face dataset name
--forcefalseRe-run instances even if report.json exists

Adjust resources

The default is 2 CPUs and 4Gi memory per sandbox. Change this in run_evaluation.py:
defaults = SandboxDefaults(
    tags=(f"swebench-{run_id}",),
    resources={"cpu": "4", "memory": "8Gi"},
)

How it works

This section explains the moving parts behind the script so you can adapt it to your own evaluation workflows.

Container images

Epoch AI hosts pre-built images on GHCR. Each instance has its own image with the repo checked out at the right commit, dependencies installed, and test environment ready. Image format: ghcr.io/epoch-research/swe-bench.eval.x86_64.{instance_id}:latest Sandboxes pull these directly from GHCR. No local builds are needed.

Evaluation flow

Each instance goes through these steps:
StepWhereWhat happens
1. Load datasetLocalFetch instance metadata from Hugging Face
2. Create sandboxCoreWeave SandboxStart sandbox with the instance’s image
3. Write patchCoreWeave SandboxWrite patch to /tmp/patch.diff
4. Apply patchCoreWeave SandboxRun git apply (falls back to patch if needed)
5. Run testsCoreWeave SandboxExecute /root/eval.sh
6. Grade resultsLocalParse output with swebench.harness.grading
7. Write reportLocalSave results to logs/swebench/
8. CleanupCoreWeave SandboxStop sandbox
Steps 2 to 5 and 8 run remotely. Steps 1, 6, and 7 run on your machine.

Parallel execution

The script uses ThreadPoolExecutor to run instance workflows concurrently. Each thread drives one instance through its workflow. While one instance runs tests, another can apply its patch, another grade locally, another start up. The overlap is where the speed comes from. Results come back as workflows finish through as_completed().

Cleanup

The script uses a CoreWeave sandbox Session to track sandboxes. When the session exits (normal exit, exception, or Ctrl+C), the session cleans up all sandboxes.
with cwsandbox.Session(defaults=defaults) as session:
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # The session cleans up sandboxes created here when it exits

Output

Results go to {output-dir}/{run-id}/{model-name}/{instance-id}/:
FileContents
report.jsonResolved status, sandbox ID, duration
test_output.txtFull test output
patch.diffThe applied patch

Report format

{
  "instance_id": {
    "resolved": true,
    "tests_status": {
      "PASSED": ["test_foo", "test_bar"],
      "FAILED": []
    }
  },
  "sandbox_id": "sb-abc123",
  "duration_seconds": 45.2
}
The format is compatible with standard SWE-bench tooling.

Troubleshooting

Patch application fails

If you see APPLY_PATCH_FAIL in logs, the patch is probably malformed, targets the wrong commit, or has whitespace issues. Run git apply --check locally to see what’s wrong. Make sure the instance ID matches the prediction.

Test timeouts

Some test suites take longer than 30 minutes. Model-generated code might also have infinite loops. If needed, increase --timeout, or check the test output to see where it’s stuck.

Image pull errors

If the container fails to start with an image pull error, either the instance ID doesn’t exist in Epoch AI’s registry, or a network issue prevents reaching ghcr.io. Verify the instance ID is in the SWE-bench dataset.

See also

Last modified on May 29, 2026