Profile Python applications on SUNK

This guide covers how to profile Python applications running in Slurm jobs on SUNK using py-spy and Linux perf tools. It assumes you have a working SUNK deployment with an active Slurm job to profile. SUNK compute Pods already have the SYS_PTRACE capability, so profilers can attach to processes without additional Pod configuration.

Prerequisites

To use perf, the Kubernetes Nodes need the kernel.yama.ptrace_scope=0 and kernel.perf_event_paranoid=-1 kernel parameters set. Deploy a DaemonSet that runs a privileged container on each Node to configure these parameters.

The kernel.yama.ptrace_scope=0 kernel parameter allows processes to attach and read memory from other processes.
The kernel.perf_event_paranoid=-1 kernel parameter allows unprivileged access to performance monitoring.

If you only plan to use py-spy, the ptrace_scope parameter is the only one required. The perf_event_paranoid parameter is only needed for Linux perf.

Create a file called py-perf-ds.yaml with the following content:

py-perf-ds.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: perf-debug
  namespace: tenant-slurm
spec:
  selector:
    matchLabels:
      name: perf-debug
  template:
    metadata:
      labels:
        name: perf-debug
    spec:
      tolerations:
        - key: "sunk.coreweave.com/lock"
          operator: "Exists"
        - key: "sunk.coreweave.com/node"
          operator: "Exists"
      containers:
      - name: perf-debug
        image: busybox
        command:
          - /bin/sh
          - -c
          - >
            sysctl -w kernel.perf_event_paranoid=-1 &&
            sysctl -w kernel.yama.ptrace_scope=0 &&
            sleep infinity # Set the kernel parameters, then sleep to keep the container running.
        securityContext:
          privileged: true

This DaemonSet sets the required kernel parameters on every Node.

It uses sleep infinity to keep the container running so that the kernel parameters persist while the container is running.
The securityContext section runs the container in privileged mode to allow it to set the kernel parameters.

Deploy the DaemonSet.
```
kubectl apply -f py-perf-ds.yaml
```

Verify the DaemonSet is running on all Nodes.

kubectl get ds perf-debug -n tenant-slurm

Verify the Pods are running.

kubectl get pods -n tenant-slurm -l name=perf-debug

Verify the kernel parameters are set on one of the Pods. Replace [POD-NAME] with one of the Pod names from the previous step.

kubectl exec -n tenant-slurm [POD-NAME] -- \
          sysctl kernel.perf_event_paranoid kernel.yama.ptrace_scope

A successful output should show the kernel parameters are set to -1 and 0 respectively.

kernel.perf_event_paranoid = -1
kernel.yama.ptrace_scope = 0

Use `py-spy`

py-spy is a sampling profiler specifically designed for Python. It shows Python-level stack traces with function names, file paths, and line numbers.

Install `py-spy`

To install py-spy, start a debug container attached to the compute Pod where your Slurm job is running, then install py-spy in the debug container.

Identify the compute Pod running your job. In SUNK, the Slurm node name matches the Kubernetes Pod name. Use squeue from a login node to find the node, then use that name as the Pod name.
```
squeue -u $USER -o "%.18i %.9P %.8j %.8T %.10M %.6D %R"
```
Set the Pod name from the squeue output.
```
COMPUTE_POD=[SLURM-NODE-NAME]
```

Start a debug container attached to the compute Pod.

kubectl debug $COMPUTE_POD -n tenant-slurm \
  --target=slurmd \
  --image=python:3.12-slim \
  --profile=general \
  -it -- bash

Inside the debug container, install py-spy.
```
pip install py-spy
```

Show live top view in `py-spy`

In SUNK compute Pods, PID 1 is slurmd, not your Python application. Find your Python process PID with ps aux | grep python and use that PID in the commands below.

To show live top view of the profiling data, updated continuously, run the following command:

py-spy top --pid [PYTHON-PID]

A successful output should show the real-time profiling data.

Collecting samples from 'python /app.py' (pid: 42)
Total Samples 1000
GIL: 100%, Active: 100%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
  45.00%  45.00%   4.50s     4.50s   compute_hash (app.py:7)
  30.00%  75.00%   3.00s     7.50s   process_data (app.py:11)
  15.00%  15.00%   1.50s     1.50s   dumps (json/__init__.py:231)
   5.00%  95.00%   0.50s     9.50s   main (app.py:18)
   5.00%   5.00%   0.50s     0.50s   sleep (time.py:123)

The output shows the following:

%Own: Percentage of time spent in this function itself
%Total: Percentage of time spent in this function + functions it calls
OwnTime: Total time spent in this function itself
TotalTime: Total time spent in this function + functions it calls
GIL: Percentage of time holding the Global Interpreter Lock
Function (filename:line): Function name, filename, and line number

Record to SVG flamegraph

To record profiling data and generate an SVG flamegraph:

Run the following command:

py-spy record --pid [PYTHON-PID] --duration 30 -o /tmp/profile.svg

A successful output should show the profiling data has been written to the /tmp/profile.svg file.

py-spy> Sampling process 100 times a second for 30 seconds. Press Control-C to exit early.
py-spy> Wrote flamegraph data to '/tmp/profile.svg'. Samples: 3000

Open a new terminal in your local machine and copy the SVG file to your local machine.
```
kubectl cp tenant-slurm/$COMPUTE_POD:/tmp/profile.svg ./profile.svg -c slurmd
```
Open profile.svg in a browser to see the flamegraph.

Record to Speedscope format

Speedscope is a web-based viewer for performance profiles. To record profiling data and generate a Speedscope JSON file:

Run the following command:

py-spy record --pid [PYTHON-PID] --duration 30 --format speedscope -o /tmp/profile.speedscope.json

Open a new terminal in your local machine and copy the JSON file to your local machine.

kubectl cp tenant-slurm/$COMPUTE_POD:/tmp/profile.speedscope.json ./profile.speedscope.json -c slurmd

Upload the JSON file to Speedscope for analysis.

Show thread activity

To show what each thread is doing, run the following command:

py-spy dump --pid [PYTHON-PID]

A successful output should show the thread activity.

Process 1: python /app.py
Thread 1 (active): "MainThread"
    compute_hash (app.py:7)
    process_data (app.py:11)
    main (app.py:18)
    <module> (app.py:28)

Only show threads holding GIL

The GIL (Global Interpreter Lock) is a mutex that protects the Python interpreter from concurrent execution. It’s used to ensure that only one thread can execute Python code at a time. Monitoring the GIL can help you understand Python CPU usage (ignoring I/O wait). To only show threads holding GIL, run the following command:

py-spy top --pid [PYTHON-PID] --gil

`py-spy` options

py-spy has several options that can be used to configure the profiling data.

`rate` option

The rate option can be used to set the sampling rate. The default is 100 Hz. To sample at a higher rate, such as 500 Hz, run the following command:

py-spy record --pid [PYTHON-PID] --rate 500 -o profile.svg

`native` option

The native option can be used to show native (C/C++) extensions. To show native extensions, run the following command:

py-spy record --pid [PYTHON-PID] --native -o profile.svg

`idle` option

The idle option can be used to show idle threads. To show idle threads, run the following command:

py-spy record --pid [PYTHON-PID] --idle -o profile.svg

`nonblocking` option

The nonblocking option can be used to run in non-blocking mode, which doesn’t pause the target process. To run in non-blocking mode, run the following command:

py-spy record --pid [PYTHON-PID] --nonblocking -o profile.svg

Use `perf`

Linux perf is a powerful performance analysis tool that shows system-level and native code performance.

Install `perf`

To install perf, start a debug container attached to the compute Pod where your Slurm job is running, then install perf in the debug container.

If you have not already identified the compute Pod, find it using squeue from a login node and set the Pod name.
```
COMPUTE_POD=[SLURM-NODE-NAME]
```

Start a debug container with Ubuntu. The Ubuntu distribution has perf tools available.

kubectl debug $COMPUTE_POD -n tenant-slurm \
  --target=slurmd \
  --image=ubuntu:22.04 \
  --profile=general \
  -it -- bash

Inside the debug container, install perf.

apt-get update && apt-get install -y linux-tools-generic

Locate the perf binary. The location varies by kernel version.

PERF=$(find /usr/lib/linux-tools -name perf | head -1)

Show live top view in `perf`

To show real-time usage by function, run the following command:

$PERF top -p [PYTHON-PID]

A successful output should show the real-time usage by function.

Samples: 8K of event 'cycles', 4000 Hz, Event count (approx.): 2841251931 lost: 0/0 drop: 0/0
Overhead  Shared Object        Symbol
50%  python3.12           [.] _PyEval_EvalFrameDefault
30%  python3.12           [.] PyObject_GetAttr
20%  [kernel]             [k] copy_user_enhanced_fast_string
10%  python3.12           [.] _PyObject_GenericGetAttrWithDict
80%  _hashlib.so          [.] EVP_MD_CTX_copy_ex
90%  python3.12           [.] PyDict_GetItem
20%  libc.so.6            [.] __memcpy_avx_unaligned
80%  python3.12           [.] PyUnicode_AsUTF8AndSize

The output shows the following:

Overhead: CPU time percentage
Shared Object: The library/binary (python3.12, kernel, libc)
Symbol: Function name
[.]: User-space function
[k]: Kernel function

Record and generate a report

To record for a specific duration, then generate a report, use the record command, then the report command. This records at 99 Hz for 30 seconds with call graphs.

$PERF record -F 99 -p [PYTHON-PID] -g -- sleep 30

Use the report command to view the report.

$PERF report

The report command shows output similar to the following:

# Samples: 2K of event 'cycles'
# Event count (approx.): 1984327896
#
# Overhead  Command  Shared Object     Symbol
# ........  .......  ................  .................................
#
    15.23%  python   python3.12        [.] _PyEval_EvalFrameDefault
            |
            ---_PyEval_EvalFrameDefault
               |--45.00%--compute_hash
               |--30.00%--process_data
               |--15.00%--json_dumps

     8.91%  python   _hashlib.so       [.] EVP_DigestUpdate
            |
            ---EVP_DigestUpdate
               HASH_Update
               _hashlib_openssl_sha256_update

Use arrow keys to navigate the report and press Enter to expand call chains.

Generate a text report

To generate a text report to save or share results, use the report command with the --stdio option.

$PERF report --stdio > perf_report.txt

View detailed statistics

To view detailed statistics, use the stat command. This shows statistics for the process with the given PID for the specified duration.

$PERF stat -p [PYTHON-PID] -- sleep 10

A successful output should show the statistics.

 Performance counter stats for process id '42':

         10,234.56 msec task-clock                #    1.023 CPUs utilized
             1,234      context-switches          #  120.567 /sec
                45      cpu-migrations            #    4.398 /sec
               123      page-faults               #   12.024 /sec
    38,456,789,012      cycles                    #    3.758 GHz
    24,123,456,789      instructions              #    0.63  insn per cycle
     5,678,901,234      branches                  #  554.932 M/sec
        12,345,678      branch-misses             #    0.22% of all branches

      10.003456789 seconds time elapsed

Generate flamegraph data

To generate flamegraph data, use the record command, then the script command. This will output the data as a script that can be used with the FlameGraph tool.

Record the data. This records at 99 Hz for 30 seconds with call graphs.
```
$PERF record -F 99 -p [PYTHON-PID] -g -- sleep 30
```

Use the script command to output the data as a script.

# Output as script format
$PERF script > perf.data.script

Copy the script file to your local machine.

kubectl cp tenant-slurm/$COMPUTE_POD:/path/to/perf.data.script ./perf.data.script -c slurmd

Use the FlameGraph tool to generate a flamegraph from the script file.

`perf` options

perf has several options that can be used to configure the profiling data.

`-F` option

The -F option can be used to set the sampling rate. The default is 99 Hz. To sample at a higher rate, such as 999 Hz, run the following command:

$PERF record -F 999 -p [PYTHON-PID] -g -- sleep 30

The -a option can be used to record all CPUs.

$PERF record -F 99 -a -g -- sleep 30

The -e option can be used to record specific events.

$PERF record -e cycles,instructions -p [PYTHON-PID] -g -- sleep 30

The --call-graph option can be used to record with call-graph using dwarf, which is more accurate but also incurs more overhead.

$PERF record -F 99 -p [PYTHON-PID] --call-graph dwarf -- sleep 30

Comparison: `py-spy` vs `perf`

py-spy and perf are two different tools for profiling Python applications in Kubernetes. They have different strengths and weaknesses.

Feature	py-spy	perf
Focus	Python code only	All code (Python, C, kernel)
Output	Function names, file:line	Native symbols, may need debug symbols
Ease of use	Very easy, Python-specific	More complex, general purpose
Overhead	Very low (~1-2%)	Low (~1-5%)
Best for	Python performance issues	System/native code issues, CPU/cache analysis
GIL detection	Yes, built-in	No
Multi-threaded	Shows Python threads clearly	Shows all threads
Setup	Just install py-spy	Need kernel tools, may need debug symbols
Output formats	SVG, speedscope, text	Text report, script for flamegraphs

Use py-spy when you need to profile pure Python code. It provides a quick, easy-to-read view of the Python code. It has very low overhead, so it’s suitable to use in production. It’s ideal when you need to see Python function names and line numbers, or want to understand GIL contention. Use perf when you need to profile system-level performance. It provides a detailed view of the C extensions and kernel code, including CPU cache, branch prediction, and hardware counter data. It’s ideal when you suspect issues in native libraries (numpy, pandas C code, etc.), or need to correlate Python and kernel activity. Use both tools when you need to profile complex performance issues and get a complete picture of the performance. This is particularly useful if your code uses significant C extensions.

Troubleshooting: “Process not found”

Both py-spy and perf can encounter a “Process not found” error when trying to profile a process. Both tools need to attach to the target process. If the target process is not running, or the PID is incorrect, the tools will fail with a “Process not found” error. To fix this, verify the target process is running and the PID is correct.

Verify the target process is running: ps aux | grep python
Verify you are in the right container and namespace: kubectl get pods -n tenant-slurm
If using kubectl debug, verify that --target is set to slurmd.

Troubleshooting `py-spy`

”Permission Denied” error

You may encounter a “Permission Denied” error when trying to profile a process with py-spy. For example:

Error: Permission Denied: Try running again with elevated permissions

SUNK compute Pods already have the SYS_PTRACE capability, so this error typically means the kernel parameters are not set correctly. Verify the DaemonSet is running and the ptrace_scope parameter is configured.

Verify the DaemonSet is running: kubectl get pods -n tenant-slurm -l name=perf-debug

Check the kernel parameters:

kubectl exec -n tenant-slurm -l name=perf-debug -- \
  sysctl kernel.yama.ptrace_scope

A successful output should show the kernel parameter is set to 0.

kernel.yama.ptrace_scope = 0

Use --profile=general when running kubectl debug.

”Failed to find python version” error

You may encounter a “Failed to find python version” error when trying to profile a process with py-spy. This means the target PID is not a Python process. For example:

Error: Failed to find python version from target process

In SUNK compute Pods, PID 1 is typically slurmd, not your Python application. Find the correct Python PID first:

ps aux | grep python

Then, use that PID to profile the application.

py-spy top --pid [PYTHON-PID]

Troubleshooting `perf`

”failed with EPERM” error

You may encounter a “failed with EPERM” error when trying to profile a process with perf.

Error: perf_event_open(...) failed with EPERM

To fix this, you need to verify that the DaemonSet is running with the following command:

kubectl get pods -n tenant-slurm -l name=perf-debug

Next, verify the kernel parameters are set correctly. Run the following command:

kubectl exec -n tenant-slurm -l name=perf-debug -- \
          sysctl kernel.perf_event_paranoid

The output should show kernel.perf_event_paranoid is set to -1.

No symbols found

If perf shows hex addresses instead of function names, it means the debug symbols are not installed. To fix this, you need to install the debug symbols for Python.

apt-get install python3-dbg

Use the --call-graph dwarf option for better stack traces.

$PERF record --call-graph dwarf -p [PYTHON-PID] -- sleep 30

Tips for successful profiling

Start with py-spy. It’s easier to use than perf and usually sufficient for Python issues.
Use low sampling rates in production. 99-100 Hz is usually enough resolution for most applications.
Record for at least 30 seconds. This gives representative samples.
Profile during representative load. Idle or startup profiles aren’t useful.
Copy profiles out immediately. Debug containers are ephemeral and will be deleted after the profiling session.
Don’t leave the DaemonSet running in production long-term. It uses privileged access and it’s a security risk to leave it running after the profiling session.
Use flamegraphs. They’re much easier to understand than text output. Flamegraphs show the most time-consuming functions and their callers, making it easy to identify bottlenecks.
Compare before and after. Profile before optimization to establish baseline. This helps you understand the impact of your optimizations.

Typical workflow

Here’s a typical workflow for profiling a Python application running in a Slurm job on SUNK.

Deploy the DaemonSet to set the kernel parameters on all Nodes. See the example DaemonSet file in the Prerequisites section. Deploy the DaemonSet:
```
kubectl apply -f py-perf-ds.yaml
```

Find the compute Pod running your Slurm job. From a login node, use squeue to identify the Slurm node, then start a profiling session.

COMPUTE_POD=[SLURM-NODE-NAME]
kubectl debug $COMPUTE_POD -n tenant-slurm \
  --target=slurmd \
  --image=python:3.12-slim \
  --profile=general \
  -it -- bash

Inside the debug container, install py-spy, find your Python process, and check the top view.
```
pip install py-spy
ps aux | grep python
py-spy top --pid [PYTHON-PID]
```

Record a flamegraph for analysis.

py-spy record --pid [PYTHON-PID] --duration 60 -o /tmp/profile.svg

From another terminal, copy the flamegraph to your local machine. This is important because the debug container will be deleted after the profiling session.
```
kubectl cp tenant-slurm/$COMPUTE_POD:/tmp/profile.svg ./profile.svg -c slurmd
```

If needed, dive deeper with perf.

apt-get update && apt-get install -y linux-tools-generic
PERF=$(find /usr/lib/linux-tools -name perf | head -1)
$PERF record -F 99 -p [PYTHON-PID] -g -- sleep 60
$PERF report

Clean up the DaemonSet when done.
```
kubectl delete -f py-perf-ds.yaml
```

SUNK

Documentation Index

​Prerequisites

​Use py-spy

​Install py-spy

​Show live top view in py-spy

​Record to SVG flamegraph

​Record to Speedscope format

​Show thread activity

​Only show threads holding GIL

​py-spy options

​rate option

​native option

​idle option

​nonblocking option

​Use perf

​Install perf

​Show live top view in perf

​Record and generate a report

​Generate a text report

​View detailed statistics

​Generate flamegraph data

​perf options

​-F option

​Comparison: py-spy vs perf

​Troubleshooting: “Process not found”

​Troubleshooting py-spy

​”Permission Denied” error

​”Failed to find python version” error

​Troubleshooting perf

​”failed with EPERM” error

​No symbols found

​Tips for successful profiling

​Typical workflow

​Helpful references for profiling

Prerequisites

Use `py-spy`

Install `py-spy`

Show live top view in `py-spy`

Record to SVG flamegraph

Record to Speedscope format

Show thread activity

Only show threads holding GIL

`py-spy` options

`rate` option

`native` option

`idle` option

`nonblocking` option

Use `perf`

Install `perf`

Show live top view in `perf`

Record and generate a report

Generate a text report

View detailed statistics

Generate flamegraph data

`perf` options

`-F` option

Comparison: `py-spy` vs `perf`

Troubleshooting: “Process not found”

Troubleshooting `py-spy`

”Permission Denied” error

”Failed to find python version” error

Troubleshooting `perf`

”failed with EPERM” error

No symbols found

Tips for successful profiling

Typical workflow

Helpful references for profiling