py-spy and Linux perf tools. Profiling helps you identify performance bottlenecks in CPU usage, function calls, and native code so you can optimize your application. The guidance is aimed at developers and operators who already have a working SUNK deployment with an active Slurm job to profile.
SUNK compute Pods already have the SYS_PTRACE capability, so profilers can attach to processes without additional Pod configuration.
The following sections describe the prerequisite kernel configuration, how to use each profiler, a comparison of the two tools, and troubleshooting guidance.
Prerequisites
Before you profile, you must configure the kernel parameters thatperf requires on each Kubernetes Node. To use perf, the Kubernetes Nodes must have the kernel.yama.ptrace_scope=0 and kernel.perf_event_paranoid=-1 kernel parameters set. Deploy a DaemonSet that runs a privileged container on each Node to configure these parameters.
- The
kernel.yama.ptrace_scope=0kernel parameter lets processes attach and read memory from other processes. - The
kernel.perf_event_paranoid=-1kernel parameter allows unprivileged access to performance monitoring.
If you only plan to use
py-spy, the ptrace_scope parameter is the only one required. The perf_event_paranoid parameter is only needed for Linux perf.-
Create a file called
py-perf-ds.yamlwith the following content:This DaemonSet sets the required kernel parameters on every Node.py-perf-ds.yaml- It uses
sleep infinityto keep the container running so that the kernel parameters persist while the container is running. - The
securityContextsection runs the container in privileged mode to let it set the kernel parameters.
- It uses
-
Deploy the DaemonSet.
-
Verify the DaemonSet is running on all Nodes.
-
Verify the Pods are running.
-
Verify the kernel parameters are set on one of the Pods. Replace
[POD-NAME]with one of the Pod names from the previous step.
-1 and 0 respectively.
Use py-spy
py-spy is a sampling profiler designed for Python. It shows Python-level stack traces with function names, file paths, and line numbers. The following sections describe how to install py-spy in a debug container, view live profiling data, record flamegraphs, and configure common options.
Install py-spy
To install py-spy, start a debug container attached to the compute Pod where your Slurm job is running, then install py-spy in the debug container.
-
Identify the compute Pod running your job. In SUNK, the Slurm node name matches the Kubernetes Pod name. Use
squeuefrom a login node to find the node, then use that name as the Pod name. -
Set the Pod name from the
squeueoutput. -
Start a debug container attached to the compute Pod.
-
Inside the debug container, install
py-spy.
Show live top view in py-spy
In SUNK compute Pods, PID 1 is
slurmd, not your Python application. Find your Python process PID with ps aux | grep python and use that PID in the following commands.py-spy top command.
%Own: Percentage of time spent in this function itself.%Total: Percentage of time spent in this function and the functions it calls.OwnTime: Total time spent in this function itself.TotalTime: Total time spent in this function and the functions it calls.GIL: Percentage of time holding the Global Interpreter Lock.Function (filename:line): Function name, filename, and line number.
Record to SVG flamegraph
To record profiling data and generate an SVG flamegraph:-
Run the following command:
The output shows that the profiling data is written to the
/tmp/profile.svgfile. -
Open a new terminal in your local machine and copy the SVG file to your local machine.
-
Open
profile.svgin a browser to see the flamegraph.
Record to Speedscope format
Speedscope is a web-based viewer for performance profiles. To record profiling data and generate a Speedscope JSON file:-
Run the following command:
-
Open a new terminal in your local machine and copy the JSON file to your local machine.
- Upload the JSON file to Speedscope for analysis.
Show thread activity
To show what each thread is doing, use thepy-spy dump command.
Only show threads holding the GIL
The GIL (Global Interpreter Lock) is a mutex that protects the Python interpreter from concurrent execution. It ensures that only one thread executes Python code at a time. Monitoring the GIL helps you understand Python CPU usage (ignoring I/O wait). To show only the threads holding the GIL, use the--gil flag with py-spy top.
py-spy options
py-spy has several options you can use to configure the profiling data.
rate option
Use the rate option to set the sampling rate. The default is 100 Hz. To sample at a higher rate, such as 500 Hz, pass --rate 500.
native option
Use the native option to show native (C/C++) extensions.
idle option
Use the idle option to show idle threads.
nonblocking option
Use the nonblocking option to run in non-blocking mode, which doesn’t pause the target process.
Use perf
Linux perf is a performance analysis tool that shows system-level and native code performance. The following sections describe how to install perf in a debug container, view live profiling data, record reports and flamegraph data, and configure common options.
Install perf
To install perf, start a debug container attached to the compute Pod where your Slurm job is running, then install perf in the debug container.
-
If you have not already identified the compute Pod, find it using
squeuefrom a login node and set the Pod name. -
Start a debug container with Ubuntu. The Ubuntu distribution has
perftools available. -
Inside the debug container, install
perf. -
Locate the
perfbinary. The location varies by kernel version.
Show live top view in perf
To show real-time usage by function, use the perf top command.
Overhead: CPU time percentage.Shared Object: The library or binary (python3.12, kernel, libc).Symbol: Function name.[.]: User-space function.[k]: Kernel function.
Record and generate a report
To record for a specific duration, then generate a report, use therecord command, then the report command.
This records at 99 Hz for 30 seconds with call graphs.
report command to view the report.
report command shows output similar to the following:
Generate a text report
To generate a text report to save or share results, use thereport command with the --stdio option.
View detailed statistics
To view detailed statistics, use thestat command. This shows statistics for the process with the given PID for the specified duration.
Generate flamegraph data
To generate flamegraph data, use therecord command, then the script command. This outputs the data as a script that you can use with the FlameGraph tool.
-
Record the data. This records at 99 Hz for 30 seconds with call graphs.
-
Use the
scriptcommand to output the data as a script. -
Copy the script file to your local machine.
- Use the FlameGraph tool to generate a flamegraph from the script file.
perf options
perf has several options you can use to configure the profiling data.
-F option
Use the -F option to set the sampling rate. The default is 99 Hz. To sample at a higher rate, such as 999 Hz, pass -F 999.
-a option to record all CPUs.
-e option to record specific events.
--call-graph option to record with a call-graph using dwarf, which is more accurate but also incurs more overhead.
Compare py-spy and perf
py-spy and perf are two different tools for profiling Python applications in Kubernetes. They have different strengths and weaknesses. The following table and guidance help you decide which tool to use for a given investigation.
| Feature | py-spy | perf |
|---|---|---|
| Focus | Python code only | All code (Python, C, kernel) |
| Output | Function names, file:line | Native symbols, may need debug symbols |
| Ease of use | Easy, Python-specific | More complex, general purpose |
| Overhead | Low (about 1% to 2%) | Low (about 1% to 5%) |
| Best for | Python performance issues | System or native code issues, CPU and cache analysis |
| GIL detection | Yes, built-in | No |
| Multi-threaded | Shows Python threads clearly | Shows all threads |
| Setup | Install py-spy | Need kernel tools, may need debug symbols |
| Output formats | SVG, speedscope, text | Text report, script for flamegraphs |
py-spy when you need to profile pure Python code. It provides a quick, easy-to-read view of the Python code. It has low overhead, so it’s suitable to use in production. It’s a good fit when you need to see Python function names and line numbers, or want to understand GIL contention.
Use perf when you need to profile system-level performance. It provides a detailed view of the C extensions and kernel code, including CPU cache, branch prediction, and hardware counter data. It’s a good fit when you suspect issues in native libraries (such as numpy or pandas C code), or need to correlate Python and kernel activity.
Use both tools when you need to profile complex performance issues and get a complete picture of the performance. This is useful if your code uses C extensions.
Troubleshoot “Process not found” errors
Bothpy-spy and perf can return a “Process not found” error when you profile a process.
Both tools must attach to the target process. If the target process is not running, or the PID is incorrect, the tools fail with a “Process not found” error.
To fix this, verify the target process is running and the PID is correct.
- Verify the target process is running:
ps aux | grep python - Verify you’re in the right container and namespace:
kubectl get pods -n tenant-slurm - If you use
kubectl debug, verify that--targetis set toslurmd.
Troubleshoot py-spy
”Permission Denied” error
You might see a “Permission Denied” error when you profile a process withpy-spy.
For example:
SYS_PTRACE capability, so this error usually means the kernel parameters are not set correctly. Verify the DaemonSet is running and the ptrace_scope parameter is configured.
-
Verify the DaemonSet is running:
kubectl get pods -n tenant-slurm -l name=perf-debug -
Check the kernel parameters:
The output shows the kernel parameter set to
0. -
Use
--profile=generalwhen you runkubectl debug.
”Failed to find python version” error
You might see a “Failed to find python version” error when you profile a process withpy-spy. This means the target PID is not a Python process.
For example:
slurmd, not your Python application. Find the correct Python PID first:
Troubleshoot perf
”failed with EPERM” error
You might see a “failed with EPERM” error when you profile a process withperf.
kernel.perf_event_paranoid is set to -1.
No symbols found
Ifperf shows hex addresses instead of function names, the debug symbols are not installed.
To fix this, install the debug symbols for Python.
--call-graph dwarf option for better stack traces.
Tips for successful profiling
- Start with
py-spy. It’s easier to use thanperfand usually sufficient for Python issues. - Use low sampling rates in production. A rate from 99 Hz to 100 Hz is enough resolution for most applications.
- Record for at least 30 seconds. This gives representative samples.
- Profile during representative load. Idle or startup profiles aren’t useful.
- Copy profiles out immediately. Debug containers are ephemeral and are deleted after the profiling session.
- Don’t leave the DaemonSet running in production long-term. It uses privileged access, and leaving it running after the profiling session is a security risk.
- Use flamegraphs. They’re easier to understand than text output. Flamegraphs show the most time-consuming functions and their callers, making it easier to identify bottlenecks.
- Compare before and after. Profile before optimization to establish a baseline. This helps you understand the impact of your optimizations.
Typical workflow
The following workflow shows a typical end-to-end profiling session for a Python application running in a Slurm job on SUNK. Use it as a reference for combining the steps in this tutorial.-
Deploy the DaemonSet to set the kernel parameters on all Nodes. See the example DaemonSet file in the Prerequisites section. Deploy the DaemonSet:
-
Find the compute Pod running your Slurm job. From a login node, use
squeueto identify the Slurm node, then start a profiling session. -
Inside the debug container, install
py-spy, find your Python process, and check the top view. -
Record a flamegraph for analysis.
-
From another terminal, copy the flamegraph to your local machine. This is important because the debug container is deleted after the profiling session.
-
If needed, dive deeper with
perf. -
Clean up the DaemonSet when done. Removing the DaemonSet reduces the security exposure of leaving privileged Pods running on the cluster.