Manage Deployments with CI and GitOps
SUNK incorporates Continuous Integration (CI) and GitOps principles to manage the deployment of SUNK and Slurm clusters. This guide outlines the common strategies used for the SUNK project.
ArgoCD for Application Management
SUNK uses ArgoCD, a GitOps continuous delivery tool, to manage Kubernetes applications declaratively. By using ArgoCD, SUNK uses a streamlined deployment workflow that aligns with the GitOps model. This automates the deployment process to maintain consistency across different environments.
Two main applications are defined within ArgoCD:
- SUNK Application: Manages the deployment of the SUNK cluster, including all necessary configurations and dependencies.
- Slurm Application: Handles the Slurm cluster deployment, ensuring that compute resources are properly managed and scheduled.
Additional supporting applications are created for Slurm to expand its capabilities and manage specific requirements, such as:
- Persistent Volume Claims (PVCs) for storage.
- Prolog/Epilog scripts for preparing and cleaning up compute nodes before and after job execution.
NodeSet Sync Customization
ArgoCD configuration needs to be updated to allow it to properly Sync NodeSet
definitions. The resource customization feature allows ArgoCD to sync the NodeSet
spec in the same manner as PodSpec
. The method of applying configuration changes varies depending on the cluster's ArgoCD installation.
Apply configuration changes with a ConfigMap patch
To apply configuration changes with a ConfigMap patch, edit the data
section of the ConfigMap.
To open the argocd-cm
ConfigMap for editing, use the following command:
$kubectl edit configmap argocd-cm -n <namespace>
The kubectl edit
command will open the entire ConfigMap, and will send the entire modified YAML back to the API server to replace the existing ConfigMap with your newly modified version.
In the data
section of the ConfigMap, add the following key-value pairs:
resource.customizations.knownTypeFields.sunk.coreweave.com_NodeSet: |- field: spec.template.spectype: core/v1/PodSpec
Apply configuration changes with kubectl
To apply configuration changes with kubectl
, use the following command:
$kubectl patch configmap argocd-cm -n <namespace> --type=merge \-p '{"data":{"resource.customizations.knownTypeFields.sunk.coreweave.com_NodeSet": "- field: spec.template.spec\n type: core/v1/PodSpec\n"}}'
The kubectl patch
command only updates the specific resources you have modified, and merges them with the existing resource.
The -p '...'
flag specifies the patch content. In this example, we use the data
, - field
, and type
parameters.
The "data":{...}
parameter specifies the section of the ConfigMap to be modified. The kubectl patch
command will only modify the specified section.
" - field: spec.template.spec\n"
and type: core/v1/PodSpec\n
are the specific key-value pair to be added to the ConfigMap.
The --type=merge
flag specifies the patch type as a JSON Merge Patch, which operates on the following logic:
- If a field exists in the patch, it replaces the existing field in the target object
- If a field exists in the patch with a
null
value, it deletes the field from the target object. - If a field does not exist in the patch, it remains unchanged in the target object.
- If you provide a list, it replaces the entire existing list with the one provided in the patch.
Configuration in git (GitOps)
This section shows an example of how you might keep a git repository synced to ArgoCD using Helm and app of apps pattern.
Create a git repository with the following contents:
SUNK
SUNK Helm chart
apiVersion: v2name: sunk-gitopsversion: 0.1.0dependencies:- name: sunkversion: 5.x.xrepository: http://helm.corp.ingress.ord1.coreweave.com/
SUNK Values file
Use SUNK Values Reference to further customize this file. An example sunk/values.yaml
is shown below.
There should be a top level sunk
key in this file.
sunk:operator:logLevel: debugresources:limits:cpu: 1memory: 200Mirequests:cpu: 1memory: 200MipodMonitor:enabled: falsereplicas: 1leaderElection:enabled: truescheduler:podMonitor:enabled: falsesyncer:podMonitor:enabled: false
Slurm
Slurm Helm chart
apiVersion: v2name: slurm-gitopsversion: 0.1.0dependencies:- name: slurmversion: 5.x.xrepository: http://helm.corp.ingress.ord1.coreweave.com/
Slurm Values file
Use Slurm Values Reference to further customize this file. An example slurm/values.yaml
is shown below.
There should be a top level slurm
key in this file.
slurm:accounting:priorityClassName: ""resources:requests:cpu: 100mmemory: 100Micontroller:priorityClassName: ""resources:requests:cpu: 100mmemory: 100Mirest:enabled: falsepriorityClassName: ""resources:requests:cpu: 100mmemory: 100Milogin:priorityClassName: ""resources:requests:cpu: 10mmemory: 100Mimunge:priorityClassName: ""resources:requests:cpu: 1mmemory: 20Misyncer:priorityClassName: ""resources:requests:cpu: 100mmemory: 50Mischeduler:priorityClassName: ""enabled: trueresources:requests:cpu: 100mmemory: 50Mimysql:metrics:enabled: falseprimary:resources:requests:cpu: 100mmemory: 500MiinitdbScriptsConfigMap: "{{ .Release.Name }}-mysql-initdb-scripts"compute:nodes:cpu-epyc:enabled: truereplicas: 2 # Adjust to desired amount or scale manually after deploydefinitions:- standardfeatures:- testresources:requests:cpu: 500mmemory: 400MipriorityClassName: ""# This is a toleration for a test taint that can be applied to the desired nodes to keep# other workloads offtolerations:- key: sunk.coreweave.com/nodesoperator: "Exists"# affinity for a test label to filter nodes on, nodes need to be labeled with this or removeaffinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: sunk.coreweave.com/nodesoperator: Exists
Custom Configurations (Optional)
This section shows an example of how you might define custom Slurm deployment configurations and keep it synchronized with ArgoCD.
Slurm Controller config
You can use the following ConfigMap to customize the Slurm controller configurations.
To use this ConfigMap, you need to add its name to the slurm.slurmConfig.slurmCtld.etcConfigMap
key in the Slurm Values file.
apiVersion: v1kind: ConfigMapmetadata:name: {{ .Release.Name }}-etc-slurmctlddata:{{ (.Files.Glob "scripts/etc-slurmctld/*").AsConfig | indent 2 }}
Prolog and Epilog
Further configurations can be found in the Slurm Values Reference and Prolog and Epilog Scripts pages.
To use these ConfigMaps, you need to add them to the respective slurm.slurmConfig.slurmd.prologConfigMap
or slurm.slurmConfig.slurmd.epilogConfigMap
keys in the Slurm Values file.
The following is an example of a Prolog ConfigMap, slurm/templates/prolog-configmap.yaml
:
apiVersion: v1kind: ConfigMapmetadata:name: {{ .Release.Name }}-prologdata:{{ (.Files.Glob "scripts/prolog.d/*.sh").AsConfig | indent 2 }}
The following is an example of an Epilog ConfigMap, slurm/templates/epilog-configmap.yaml
:
apiVersion: v1kind: ConfigMapmetadata:name: {{ .Release.Name }}-epilogdata:{{ (.Files.Glob "scripts/epilog.d/*.sh").AsConfig | indent 2 }}
A simple example of an epilog script, slurm/scripts/epilog.d/test.sh
, to be used with the above Epilog ConfigMap is shown below:
#!/usr/bin/env bashset -eecho "Epilog test executed"
ArgoCD App of Apps
This section shows an example of how you might define multiple ArgoCD Application
resources to manage SUNK and Slurm using app of apps pattern and GitOps principles.
SUNK App defintion
The apps/sunk.yaml
file describes where ArgoCD can find and synchronize the Helm manifests for SUNK. Replace the <<REPO_URL>>
placeholder with your GitOps repository URL.
Follow the ArgoCD Specs for further customization.
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata:name: sunkspec:destination:namespace: sunkserver: https://kubernetes.default.svcsource:repoURL: <<REPO_URL>>path: sunktargetRevision: HEADhelm:valueFiles:- values.yamlsources: []project: defaultsyncPolicy:automated:prune: trueselfHeal: truesyncOptions:- CreateNamespace=true
Slurm App definition
The apps/slurm.yaml
file describes where ArgoCD can find and synchronize the Helm manifests for Slurm. Replace the <<REPO_URL>>
placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.
The spec.ignoreDifferences
key contains recommended values to keep ArgoCD properly synchronized
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata:name: slurmspec:destination:namespace: tenant-slurmserver: https://kubernetes.default.svcsource:repoURL: <<REPO_URL>>path: slurmtargetRevision: HEADhelm:valueFiles:- values.yamlsources: []project: defaultsyncPolicy:automated:prune: trueselfHeal: truesyncOptions:- CreateNamespace=trueignoreDifferences:- group: sunk.coreweave.comkind: Nodesetnamespace: tenant-slurmjqPathExpressions:- '.spec.template.spec.tolerations[] | select(.key == "node.coreweave.cloud/reservation-policy" or .key == "node.coreweave.cloud/reserved")'
App of App definition
The app-of-apps.yaml
file describes where ArgoCD can find and synchronize the custom Helm charts defined for SUNK and Slurm in the sections above. Replace the <<REPO_URL>>
placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata:name: sunk-app-of-appsnamespace: argocdspec:destination:namespace: 'argocd'server: https://kubernetes.default.svcsource:repoURL: <<REPO_URL>>path: appstargetRevision: HEADsources: []project: defaultsyncPolicy:automated:prune: trueselfHeal: true
Applying to ArgoCD
After following the steps above, your GitOps repository should be structured as below:
.├── app-of-apps.yaml├── apps│ ├── slurm.yaml│ └── sunk.yaml├── slurm│ ├── Chart.yaml│ ├── scripts│ │ └── prolog.d│ │ └── test.sh│ ├── templates│ │ ├── epilog-configmap.yaml│ │ ├── etc-slurmctld-configmap.yaml│ │ └── prolog-configmap.yaml│ └── values.yaml└── sunk├── Chart.yaml└── values.yaml
To apply all of the resources defined above, run the following command:
# pwd: root dir of GitOps repo$kubectl apply -f app-of-apps.yaml
You should now be able to keep SUNK and Slurm synchronized with ArgoCD following GitOps principles.
Additional Notes
ArgoCD Impact on Slurm Jobs
ArgoCD Syncs are job-safe i.e. syncing in Argo does not affect running jobs in the cluster. Compute nodes are updated using the RollingUpdate
strategy and the maximum percentage of nodes unavailable during an update can be configured with the compute.maxUnavailable
in the Chart Values. See Slurm Values Reference for further details.
Login Nodes Updates
The login nodes may contain user states that you may not want to delete during an update. We recommend setting the value of login.updateStrategy
to OnDelete
strategy for cases like this. This would require you to manually delete the existing pod before creating the updated login node, ensuring the user state is not deleted during a Sync in ArgoCD. See Slurm Values Reference for further details.
Secret Job Lifecycle
On each Sync, the Slurm Chart schedules two Kubernetes Jobs in order to create the required secrets for the Slurm cluster to be operational. When installing or upgrading, any existing jobs are replaced, and new job runs are initiated. If a job succeeds, the Job object is deleted by an Argo hook, and ArgoCD reports being In Sync
, indicating the job is complete. If a job fails, the Job object remains in Argo as Failed
until the issue with job run is resolved or the next Sync occurs, which then follows this same process.