Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

Run SWE-bench evaluations in parallel using sandboxes on CoreWeave.

What this does

SWE-bench tests whether language models can fix real bugs in real repositories. Each task: apply a patch, run the test suite, see if the fix works. You can run SWE-bench locally with Docker, but you’re limited by your machine’s resources. CoreWeave sandboxes let you run evaluations at scale on CoreWeave infrastructure. Spin up dozens or hundreds of sandboxes concurrently without managing any of it yourself. The script pulls pre-built images from Epoch AI’s registry on GHCR, so there’s no local Docker build step.

Setup

Install the Python SDK and evaluation dependencies:
uv pip install cwsandbox swebench datasets
Clone the repo for the example evaluation scripts:
git clone https://github.com/coreweave/cwsandbox-client.git
cd cwsandbox-client
For authentication, set your CWSANDBOX_API_KEY to a CoreWeave API Access Token:
export CWSANDBOX_API_KEY="your-access-token"

Quick start

Test your setup with a single instance. The gold option uses the known-correct fix from the dataset:
uv run python examples/swebench/run_evaluation.py \
    --predictions-path gold \
    --instance-ids astropy__astropy-12907 \
    --run-id test
This should pass.

Running in parallel

Run multiple instances at once:
uv run python examples/swebench/run_evaluation.py \
    --predictions-path gold \
    --instance-ids \
        astropy__astropy-12907 \
        django__django-11039 \
        django__django-11099 \
        django__django-11283 \
        matplotlib__matplotlib-23476 \
        scikit-learn__scikit-learn-13142 \
        sympy__sympy-13031 \
        sympy__sympy-13647 \
    --run-id parallel-test \
    --max-workers 8
This spins up 8 sandboxes and runs them concurrently. All should pass since gold patches are the correct fixes.

Evaluating model predictions

To test custom model output:
uv run python examples/swebench/run_evaluation.py \
    --predictions-path predictions.json \
    --instance-ids django__django-11039 scikit-learn__scikit-learn-13142 \
    --run-id eval-run-1 \
    --max-workers 10
The predictions file maps instance IDs to patches:
[
  {
    "instance_id": "django__django-11039",
    "model_name_or_path": "gpt-4",
    "model_patch": "diff --git a/..."
  }
]

Options

OptionDefaultDescription
--predictions-pathRequiredPath to predictions JSON, or gold for gold patches
--instance-idsRequiredSpace-separated instance IDs
--run-idRequiredIdentifier for this run
--max-workers10Max parallel sandboxes
--timeout1800Per-instance timeout in seconds (30 min)
--output-dirlogs/swebenchWhere to write logs and reports
--datasetprinceton-nlp/SWE-bench_LiteHuggingFace dataset name
--forcefalseRe-run instances even if report.json exists

Adjusting resources

Default is 2 CPUs and 4Gi memory per sandbox. Change this in run_evaluation.py:
defaults = SandboxDefaults(
    tags=(f"swebench-{run_id}",),
    resources={"cpu": "4", "memory": "8Gi"},
)

How it works

Container images

Epoch AI hosts pre-built images on GHCR. Each instance has its own image with the repo checked out at the right commit, dependencies installed, and test environment ready. Image format: ghcr.io/epoch-research/swe-bench.eval.x86_64.{instance_id}:latest Sandboxes pull these directly from GHCR. No local builds needed.

Evaluation flow

Each instance goes through these steps:
StepWhereWhat happens
1. Load datasetLocalFetch instance metadata from HuggingFace
2. Create sandboxCoreWeave SandboxStart sandbox with the instance’s image
3. Write patchCoreWeave SandboxWrite patch to /tmp/patch.diff
4. Apply patchCoreWeave SandboxRun git apply (falls back to patch if needed)
5. Run testsCoreWeave SandboxExecute /root/eval.sh
6. Grade resultsLocalParse output with swebench.harness.grading
7. Write reportLocalSave results to logs/swebench/
8. CleanupCoreWeave SandboxStop sandbox
Steps 2-5 and 8 run remotely. Steps 1, 6, and 7 run on your machine.

Parallel execution

The script uses ThreadPoolExecutor to run instance workflows concurrently. Each thread drives one instance through its workflow. While one instance runs tests, another can be applying its patch, another grading locally, another starting up. The overlap is where the speed comes from. Results come back as workflows finish via as_completed().

Cleanup

The script uses a CoreWeave sandbox Session to track sandboxes. When the session exits (normal exit, exception, or Ctrl+C), all sandboxes get cleaned up.
with cwsandbox.Session(defaults=defaults) as session:
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Sandboxes created here get cleaned up when the session exits

Output

Results go to {output-dir}/{run-id}/{model-name}/{instance-id}/:
FileContents
report.jsonResolved status, sandbox ID, duration
test_output.txtFull test output
patch.diffThe patch that was applied

Report format

{
  "instance_id": {
    "resolved": true,
    "tests_status": {
      "PASSED": ["test_foo", "test_bar"],
      "FAILED": []
    }
  },
  "sandbox_id": "sb-abc123",
  "duration_seconds": 45.2
}
The format is compatible with standard SWE-bench tooling.

Troubleshooting

Patch application fails

If you see APPLY_PATCH_FAIL in logs, the patch is probably malformed, targeting the wrong commit, or has whitespace issues. Try git apply --check locally to see what’s wrong. Make sure the instance ID matches the prediction.

Tests timeout

Some test suites genuinely take longer than 30 minutes. Model-generated code might also have infinite loops. Increase --timeout if needed, or check the test output to see where it’s hanging.

Image pull errors

If the container fails to start with an image pull error, either the instance ID doesn’t exist in Epoch AI’s registry or there’s a network issue reaching ghcr.io. Verify the instance ID is in the SWE-bench dataset.

See also

Last modified on April 21, 2026