Interacting with SUNK Using Kubectl
Kubernetes is a powerful container orchestration platform that manages the underlying infrastructure for SUNK, including compute resources, networking, and storage. SUNK leverages Kubernetes to provide a scalable, flexible, and high-performance computing environment for running Slurm workloads.
When using SUNK, you can interact with Kubernetes using the kubectl
command-line tool. This guide explains the key concepts, benefits, and best practices for using kubectl
to interact with Kubernetes in the context of SUN
Key concepts
First, some terminology and concepts to understand the relationship between Kubernetes and Slurm in the context of SUNK:
Term | Description |
---|---|
Kubernetes cluster and Nodes | A Kubernetes cluster is a collection of Kubernetes Nodes, which are (in CKS) physical machines that run Kubernetes components and containerized applications. Kubernetes Nodes are capitalized as proper nouns. |
Kubernetes Pod | A Kubernetes Pod is the smallest deployable unit in Kubernetes, representing a single instance of a running process, such as a Slurm node. Multiple Pods can run on a single Kubernetes Node. |
Slurm cluster and Nodes | A Slurm cluster is a collection of Slurm nodes, where each node is a Kubernetes Pod running a slurmd container. Slurm nodes are lowercase. |
kubectl | kubectl is the command-line tool for interacting with Kubernetes clusters. It allows you to inspect cluster resources, create, delete, and update objects, and view logs and events. |
When you use a Slurm cluster deployed by SUNK, you're operating within a Kubernetes environment. This means the underlying infrastructure is managed by Kubernetes, and many aspects of SUNK's operation can be observed and controlled using kubectl
, the command-line tool for Kubernetes.
Benefits
Here are some reasons why you might want to interact with Kubernetes via kubectl
:
- Visibility: You can use
kubectl
to see the status of the Kubernetes Pods where your Slurm jobs are running, providing insight into the underlying execution environment. Additionally, SUNK is deployed using Helm charts, which are managed by Kubernetes.kubectl
can be used to inspect these deployments. - Debugging: If you encounter issues,
kubectl
can help you inspect logs, events, and the state of the Pods, aiding in troubleshooting. - Familiarity: If you already have experience with Kubernetes, using
kubectl
to observe SUNK offers a familiar way to interact with the system. - Configuration: Many aspects of SUNK's configuration are managed as Kubernetes resources (such as ConfigMaps and Secrets), which you can interact with using
kubectl
.
Accessing Logs
Because Slurm daemons run within pods, their logs can be viewed using the kubectl logs
command, optionally with the -f
flag to follow the logs in real time. To access the logs of the Slurm Controller, which manages job submissions and scheduling, use this command:
$kubectl logs -f -l app.kubernetes.io/name=slurmctld -c slurmctld
To see what's happening on a specific Slurm compute node, list the slurmd
logs for that Pod. The name of a Slurm node matches the name of its corresponding Kubernetes Pod, so you can get the logs by substituting the node's name for <Pod name>
in the following:
$kubectl logs -f -c slurmd <Pod name>
Restarting the Slurm Controller
If your jobs are stuck in a pending
state, you may need to restart the Slurm Controller. To restart the Slurm Controller, first find the name of the Controller deployment:
$kubectl get deployments -l app.kubernetes.io/component=controller
Now that you have the deployment name, use the following command to restart it:
$kubectl rollout restart deployment <controller deployment name>
Replace <controller deployment name>
with the name you found earlier.
You can confirm that the restart is happening by checking the status of the rollout:
$kubectl rollout status deployment <controller deployment name>
Restarting the Controller won't cancel active jobs, but could fix problems involving jobs that are stuck pending. It's a safe operation to perform when troubleshooting.