Run SWE-bench in SUNK with Docker
SWE-bench is a benchmark for evaluating large language models on software issues collected from GitHub. SWE-bench uses Docker to create reproducible artifacts that can be ported to different platforms.
This guide explains how to run SWE-bench on SUNK with the following steps:
- Enable support for Docker in SUNK
- Select a node to run the benchmark on
- Install SWE-bench in a Python environment on the selected node
Tested versions
This guide was tested and verified on the following configurations:
- SUNK
cgroup/v1andcgroup/v2- v6.9.1
- v7.1.0
- NVIDIA L40 and H100 GPUs
Prerequisites
To run SWE-bench on SUNK, you first need to enable Docker support. For instructions, see our guide on using Docker in SUNK.
Using Docker in SUNK requires enabling privileged pods and disabling the recommended AppArmor profile. This process grants elevated kernel capabilities and weakens isolation guarantees. See the known security risks section for more details.
It is your responsibility to verify that third-party code is safe to execute alongside your other workloads.
To run SWE-bench in SUNK, select a node to run the benchmark on, set up a Python environment on that node, and then install SWE-bench in the Python environment.
Acquire GPU resources
First, identify a node or partition on which to run the benchmark. In the following examples, we use an H100 node in the h100 partition.
Choose one of the following methods:
Option 1: exec into an existing GPU pod
Run the following command to return a list of pods within your namespace:
$kubectl get pods -n <namespace>
In this example, the target pod is named h100-123-123. Use kubectl exec with the flags shown below to open an interactive terminal session inside the specified pod:
$kubectl exec -it -n <namespace> h100-123-123 -- bash
Option 2: Start an interactive job within a Slurm login pod
In this example, our Slurm login pod is tenant-slurm-login-0:
$kubectl exec -it -n <namespace> tenant-slurm-login-0 -- bash
Use srun to start an interactive session on your chosen partition. In this example, the partition is h100:
$srun --nodes=1 --gres=gpu:1 --partition=h100 --pty bash
Clone SWE-bench and set up Python
Next, clone SWE-bench and set up the Python environment. The following examples use uv to create a Python virtual environment. For venv and pip versions of this process, consult the Python documentation.
-
Install
uvwithcurl:Example$curl -LsSf https://astral.sh/uv/install.sh | shsource $HOME/.local/bin/envFollow the instructions in the provided output about sourcing to add
uvto yourPATH. -
Clone the
SWE-benchrepository:Example$git clone https://github.com/SWE-bench/SWE-bench.git$cd SWE-bench -
Create a Python virtual environment:
Example$uv venv -
Install the current directory,
SWE-bench, as a Python package in the virtual environment:Example$uv pip install . -
Execute the benchmark inside the pod:
Example$uv run python -m swebench.harness.run_evaluation \--predictions_path gold \--max_workers 1 \--instance_ids sympy__sympy-20590 \--namespace '' \--run_id validate-goldThe expected output is as follows:
ExampleBuilt swebench @ file:///opt/nccl-tests/SWE-benchUninstalled 2 packages in 4msInstalled 2 packages in 4ms<frozen runpy>:128: RuntimeWarning: 'swebench.harness.run_evaluation' found in sys.modules after import of package 'swebench.harness', but prior to execution of 'swebench.harness.run_evaluation'; this may result in unpredictable behaviourUsing gold predictionsREADME.md: 3.67kB [00:00, 25.5MB/s]data/dev-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107k/107k [00:00<00:00, 488kB/s]data/test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.11M/1.11M [00:00<00:00, 13.0MB/s]Generating dev split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:00<00:00, 5326.84 examples/s]Generating test split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 25575.54 examples/s]Building base image (sweb.base.py.x86_64:latest)Base images built successfully.Total environment images to build: 1All environment images built successfully.Running 1 instances...Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:56<00:00, 56.05s/it, ✓=1, ✖=0, error=0]All instances run.Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:56<00:00, 56.05s/it, ✓=1, ✖=0, error=0]Cleaning cached images...Removed 0 images.Total instances: 1Instances submitted: 1Instances completed: 1Instances incomplete: 0Instances resolved: 1Instances unresolved: 0Instances with empty patches: 0Instances with errors: 0Unstopped containers: 0Unremoved images: 0Report written to gold.validate-gold.jsonA successful run creates a report file named
gold.validate-gold.jsonin the working directory.
Known limitations
H200 GPU compile error
SWE-bench does not compile on an H200 GPU. The benchmark terminates with the following error:
ebench.harness.docker_build.BuildImageError: Error building image sweb.base.py.x86_64:latest: The command '/bin/sh -c apt update && apt install -y wget git build-essential libffi-dev libtiff-dev python3 python3-pip python-is-python3 jq curl locales locales-all tzdata && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 255Check (logs/build_images/base/sweb.base.py.x86_64__latest/build_image.log) for more information.
Non-InfiniBand node behavior
Enabling privileged pods on non-InfiniBand nodes may result in NCCL failing to use eth0 correctly. We recommend enforcing it by adding the following environment variables:
export NCCL_SOCKET_IFNAME=eth0export UCX_NET_DEVICES=eth0export NCCL_COLLNET_ENABLE=0export NCCL_IB_HCA=eth0