SUNK incorporates Continuous Integration (CI) and GitOps principles to manage the deployment of SUNK and Slurm clusters. This guide outlines the common strategies used for the SUNK project.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
ArgoCD for Application Management
SUNK uses ArgoCD, a GitOps continuous delivery tool, to manage Kubernetes applications declaratively. By using ArgoCD, SUNK uses a streamlined deployment workflow that aligns with the GitOps model. This automates the deployment process to maintain consistency across different environments. Two main applications are defined within ArgoCD:- SUNK Application: Manages the deployment of the SUNK cluster, including all necessary configurations and dependencies.
- Slurm Application: Handles the Slurm cluster deployment, ensuring that compute resources are properly managed and scheduled.
- Persistent Volume Claims (PVCs) for storage.
- Prolog/Epilog scripts for preparing and cleaning up compute nodes before and after job execution.
NodeSet Sync Customization
ArgoCD configuration needs to be updated to allow it to properly SyncNodeSet definitions. The resource customization feature allows ArgoCD to sync the NodeSet spec in the same manner as PodSpec. The method of applying configuration changes varies depending on the cluster’s ArgoCD installation.
Apply configuration changes with a ConfigMap patch
To apply configuration changes with a ConfigMap patch, edit thedata section of the ConfigMap.
To open the argocd-cm ConfigMap for editing, use the following command:
The
kubectl edit command will open the entire ConfigMap, and will send the entire modified YAML back to the API server to replace the existing ConfigMap with your newly modified version.data section of the ConfigMap, add the following key-value pairs:
Apply configuration changes with kubectl
To apply configuration changes withkubectl, use the following command:
The
kubectl patch command only updates the specific resources you have modified, and merges them with the existing resource.-p '...' flag specifies the patch content. In this example, we use the data, - field, and type parameters.
The "data":{...} parameter specifies the section of the ConfigMap to be modified. The kubectl patch command will only modify the specified section.
" - field: spec.template.spec\n" and type: core/v1/PodSpec\n are the specific key-value pair to be added to the ConfigMap.
The --type=merge flag specifies the patch type as a JSON Merge Patch, which operates on the following logic:
- If a field exists in the patch, it replaces the existing field in the target object
- If a field exists in the patch with a
nullvalue, it deletes the field from the target object. - If a field does not exist in the patch, it remains unchanged in the target object.
- If you provide a list, it replaces the entire existing list with the one provided in the patch.
Configuration in git (GitOps)
This section shows an example of how you might keep a git repository synced to ArgoCD using Helm and app of apps pattern. Create a git repository with the following contents:SUNK
SUNK Helm chart
sunk/Chart.yaml
SUNK Values file
Use SUNK Values Reference to further customize this file. An examplesunk/values.yaml is shown below.
sunk/values.yaml
Slurm
Slurm Helm chart
slurm/Chart.yaml
Slurm Values file
Use Slurm Values Reference to further customize this file. An exampleslurm/values.yaml is shown below.
slurm/values.yaml
Custom Configurations (Optional)
This section shows an example of how you might define custom Slurm deployment configurations and keep it synchronized with ArgoCD.Slurm Controller config
You can use the following ConfigMap to customize the Slurm controller configurations.slurm/templates/etc-slurmctld-configmap.yaml
Prolog and Epilog
Further configurations can be found in the Slurm Values Reference and Prolog and Epilog Scripts pages. The following is an example of a Prolog ConfigMap,slurm/templates/prolog-configmap.yaml:
slurm/templates/prolog-configmap.yaml
slurm/templates/epilog-configmap.yaml:
slurm/templates/epilog-configmap.yaml
slurm/scripts/epilog.d/test.sh, to be used with the above Epilog ConfigMap is shown below:
slurm/scripts/epilog.d/test.sh
ArgoCD App of Apps
This section shows an example of how you might define multiple ArgoCDApplication resources to manage SUNK and Slurm using app of apps pattern and GitOps principles.
SUNK App defintion
Theapps/sunk.yaml file describes where ArgoCD can find and synchronize the Helm manifests for SUNK. Replace the [REPO-URL] placeholder with your GitOps repository URL.
Follow the ArgoCD Specs for further customization.
apps/sunk.yaml
Slurm App definition
Theapps/slurm.yaml file describes where ArgoCD can find and synchronize the Helm manifests for Slurm. Replace the [REPO-URL] placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.
The
spec.ignoreDifferences key contains recommended values to keep ArgoCD properly synchronized.apps/slurm.yaml
App of App definition
Theapp-of-apps.yaml file describes where ArgoCD can find and synchronize the custom Helm charts defined for SUNK and Slurm in the sections above. Replace the [REPO-URL] placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.
app-of-apps.yaml
Applying to ArgoCD
After following the steps above, your GitOps repository should be structured as below:Additional Notes
ArgoCD Impact on Slurm Jobs
ArgoCD Syncs are job-safe i.e. syncing in Argo does not affect running jobs in the cluster. Compute nodes are updated using theRollingUpdate strategy and the maximum percentage of nodes unavailable during an update can be configured with the compute.maxUnavailable in the Chart Values. See Slurm Values Reference for further details.
Login Nodes Updates
The login nodes may contain user states that you may not want to delete during an update. We recommend setting the value oflogin.updateStrategy to OnDelete strategy for cases like this. This would require you to manually delete the existing pod before creating the updated login node, ensuring the user state is not deleted during a Sync in ArgoCD. See Slurm Values Reference for further details.
Secret Job Lifecycle
On each Sync, the Slurm Chart schedules two Kubernetes Jobs in order to create the required secrets for the Slurm cluster to be operational. When installing or upgrading, any existing jobs are replaced, and new job runs are initiated. If a job succeeds, the Job object is deleted by an Argo hook, and ArgoCD reports beingIn Sync, indicating the job is complete. If a job fails, the Job object remains in Argo as Failed until the issue with job run is resolved or the next Sync occurs, which then follows this same process.