Interact with Kubernetes
Use Kubernetes commands to manage SUNK
Running SUNK on Kubernetes means you'll interact with it through Kubernetes commands. If you need to check logs or restart the Slurm Controller, here's how you do it.
Accessing Logs
Because Slurm daemons run within pods, their logs can be viewed using the kubectl logs
command, optionally with the -f
flag to follow the logs in real time. To access the logs of the Slurm Controller, which manages job submissions and scheduling, use this command:
$kubectl logs -f -l app.kubernetes.io/name=slurmctld -c slurmctld
To see what's happening on a specific Slurm compute node, list the slurmd logs for that Pod. The name of a Slurm node matches the name of its corresponding Kubernetes Pod, so you can get the logs by substituting the node's name for <Pod name>
in the following:
$kubectl logs -f -c slurmd <Pod name>
Restarting the Slurm Controller
If you need to restart the Slurm Controller, which can help with jobs that are stuck pending, first find the name of the Controller deployment:
$kubectl get deployments -l app.kubernetes.io/component=controller
Now that you have the deployment name, use the following command to restart it:
$kubectl rollout restart deployment <controller deployment name>
Make sure to replace <controller deployment name>
with the name you found earlier.
You can confirm that the restart is happening by checking the status of the rollout:
$kubectl rollout status deployment <controller deployment name>
Restarting the Controller won't cancel active jobs, but could fix problems involving jobs that are stuck pending. It's a safe operation to perform when troubleshooting.