Skip to main content

Run SWE-bench in SUNK with Docker

SWE-bench is a benchmark for evaluating large language models on software issues collected from GitHub. SWE-bench uses Docker to create reproducible artifacts that can be ported to different platforms.

This guide explains how to run SWE-bench on SUNK with the following steps:

  1. Enable support for Docker in SUNK
  2. Select a node to run the benchmark on
  3. Install SWE-bench in a Python environment on the selected node

Tested versions

This guide was tested and verified on the following configurations:

  • SUNK cgroup/v1 and cgroup/v2
    • v6.9.1
    • v7.1.0
  • NVIDIA L40 and H100 GPUs

Prerequisites

To run SWE-bench on SUNK, you first need to enable Docker support. For instructions, see our guide on using Docker in SUNK.

Warning

Using Docker in SUNK requires enabling privileged pods and disabling the recommended AppArmor profile. This process grants elevated kernel capabilities and weakens isolation guarantees. See the known security risks section for more details.

It is your responsibility to verify that third-party code is safe to execute alongside your other workloads.

To run SWE-bench in SUNK, select a node to run the benchmark on, set up a Python environment on that node, and then install SWE-bench in the Python environment.

Acquire GPU resources

First, identify a node or partition on which to run the benchmark. In the following examples, we use an H100 node in the h100 partition.

Choose one of the following methods:

Option 1: exec into an existing GPU pod

Run the following command to return a list of pods within your namespace:

Example
$
kubectl get pods -n <namespace>

In this example, the target pod is named h100-123-123. Use kubectl exec with the flags shown below to open an interactive terminal session inside the specified pod:

Example
$
kubectl exec -it -n <namespace> h100-123-123 -- bash

Option 2: Start an interactive job within a Slurm login pod

In this example, our Slurm login pod is tenant-slurm-login-0:

Example
$
kubectl exec -it -n <namespace> tenant-slurm-login-0 -- bash

Use srun to start an interactive session on your chosen partition. In this example, the partition is h100:

Example
$
srun --nodes=1 --gres=gpu:1 --partition=h100 --pty bash

Clone SWE-bench and set up Python

Next, clone SWE-bench and set up the Python environment. The following examples use uv to create a Python virtual environment. For venv and pip versions of this process, consult the Python documentation.

  1. Install uv with curl:

    Example
    $
    curl -LsSf https://astral.sh/uv/install.sh | sh
    source $HOME/.local/bin/env

    Follow the instructions in the provided output about sourcing to add uv to your PATH.

  2. Clone the SWE-bench repository:

    Example
    $
    git clone https://github.com/SWE-bench/SWE-bench.git
    $
    cd SWE-bench
  3. Create a Python virtual environment:

    Example
    $
    uv venv
  4. Install the current directory, SWE-bench, as a Python package in the virtual environment:

    Example
    $
    uv pip install .
  5. Execute the benchmark inside the pod:

    Example
    $
    uv run python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids sympy__sympy-20590 \
    --namespace '' \
    --run_id validate-gold

    The expected output is as follows:

    Example
    Built swebench @ file:///opt/nccl-tests/SWE-bench
    Uninstalled 2 packages in 4ms
    Installed 2 packages in 4ms
    <frozen runpy>:128: RuntimeWarning: 'swebench.harness.run_evaluation' found in sys.modules after import of package 'swebench.harness', but prior to execution of 'swebench.harness.run_evaluation'; this may result in unpredictable behaviour
    Using gold predictions
    README.md: 3.67kB [00:00, 25.5MB/s]
    data/dev-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107k/107k [00:00<00:00, 488kB/s]
    data/test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.11M/1.11M [00:00<00:00, 13.0MB/s]
    Generating dev split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:00<00:00, 5326.84 examples/s]
    Generating test split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 25575.54 examples/s]
    Building base image (sweb.base.py.x86_64:latest)
    Base images built successfully.
    Total environment images to build: 1
    All environment images built successfully.
    Running 1 instances...
    Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:56<00:00, 56.05s/it, ✓=1, ✖=0, error=0]All instances run.
    Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:56<00:00, 56.05s/it, ✓=1, ✖=0, error=0]
    Cleaning cached images...
    Removed 0 images.
    Total instances: 1
    Instances submitted: 1
    Instances completed: 1
    Instances incomplete: 0
    Instances resolved: 1
    Instances unresolved: 0
    Instances with empty patches: 0
    Instances with errors: 0
    Unstopped containers: 0
    Unremoved images: 0
    Report written to gold.validate-gold.json

    A successful run creates a report file named gold.validate-gold.json in the working directory.

Known limitations

H200 GPU compile error

SWE-bench does not compile on an H200 GPU. The benchmark terminates with the following error:

Example
ebench.harness.docker_build.BuildImageError: Error building image sweb.base.py.x86_64:latest: The command '/bin/sh -c apt update && apt install -y wget git build-essential libffi-dev libtiff-dev python3 python3-pip python-is-python3 jq curl locales locales-all tzdata && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 255
Check (logs/build_images/base/sweb.base.py.x86_64__latest/build_image.log) for more information.

Non-InfiniBand node behavior

Enabling privileged pods on non-InfiniBand nodes may result in NCCL failing to use eth0 correctly. We recommend enforcing it by adding the following environment variables:

Example
export NCCL_SOCKET_IFNAME=eth0
export UCX_NET_DEVICES=eth0
export NCCL_COLLNET_ENABLE=0
export NCCL_IB_HCA=eth0