Skip to main content

Manage Deployments with CI and GitOps

SUNK incorporates Continuous Integration (CI) and GitOps principles to manage the deployment of SUNK and Slurm clusters. This guide outlines the common strategies used for the SUNK project.

ArgoCD for Application Management

SUNK uses ArgoCD, a GitOps continuous delivery tool, to manage Kubernetes applications declaratively. By using ArgoCD, SUNK uses a streamlined deployment workflow that aligns with the GitOps model. This automates the deployment process to maintain consistency across different environments.

Two main applications are defined within ArgoCD:

  • SUNK Application: Manages the deployment of the SUNK cluster, including all necessary configurations and dependencies.
  • Slurm Application: Handles the Slurm cluster deployment, ensuring that compute resources are properly managed and scheduled.

Additional supporting applications are created for Slurm to expand its capabilities and manage specific requirements, such as:

  • Persistent Volume Claims (PVCs) for storage.
  • Prolog/Epilog scripts for preparing and cleaning up compute nodes before and after job execution.

NodeSet Sync Customization

ArgoCD configuration needs to be updated to allow it to properly Sync NodeSet definitions. The resource customization feature allows ArgoCD to sync the NodeSet spec in the same manner as PodSpec. The method of applying configuration changes varies depending on the cluster's ArgoCD installation.

Apply configuration changes with a ConfigMap patch

To apply configuration changes with a ConfigMap patch, edit the data section of the ConfigMap.

To open the argocd-cm ConfigMap for editing, use the following command:

$
kubectl edit configmap argocd-cm -n <namespace>
Note

The kubectl edit command will open the entire ConfigMap, and will send the entire modified YAML back to the API server to replace the existing ConfigMap with your newly modified version.

In the data section of the ConfigMap, add the following key-value pairs:

resource.customizations.knownTypeFields.sunk.coreweave.com_NodeSet: |
- field: spec.template.spec
type: core/v1/PodSpec

Apply configuration changes with kubectl

To apply configuration changes with kubectl, use the following command:

$
kubectl patch configmap argocd-cm -n <namespace> --type=merge \
-p '{"data":{"resource.customizations.knownTypeFields.sunk.coreweave.com_NodeSet": "- field: spec.template.spec\n type: core/v1/PodSpec\n"}}'
Note

The kubectl patch command only updates the specific resources you have modified, and merges them with the existing resource.

The -p '...' flag specifies the patch content. In this example, we use the data, - field, and type parameters.

The "data":{...} parameter specifies the section of the ConfigMap to be modified. The kubectl patch command will only modify the specified section.

" - field: spec.template.spec\n" and type: core/v1/PodSpec\n are the specific key-value pair to be added to the ConfigMap.

The --type=merge flag specifies the patch type as a JSON Merge Patch, which operates on the following logic:

  • If a field exists in the patch, it replaces the existing field in the target object
  • If a field exists in the patch with a null value, it deletes the field from the target object.
  • If a field does not exist in the patch, it remains unchanged in the target object.
  • If you provide a list, it replaces the entire existing list with the one provided in the patch.

Configuration in git (GitOps)

This section shows an example of how you might keep a git repository synced to ArgoCD using Helm and app of apps pattern.

Create a git repository with the following contents:

SUNK

SUNK Helm chart

sunk/Chart.yaml
apiVersion: v2
name: sunk-gitops
version: 0.1.0
dependencies:
- name: sunk
version: 5.x.x
repository: http://helm.corp.ingress.ord1.coreweave.com/

SUNK Values file

Use SUNK Values Reference to further customize this file. An example sunk/values.yaml is shown below.

Important

There should be a top level sunk key in this file.

sunk/values.yaml
sunk:
operator:
logLevel: debug
resources:
limits:
cpu: 1
memory: 200Mi
requests:
cpu: 1
memory: 200Mi
podMonitor:
enabled: false
replicas: 1
leaderElection:
enabled: true
scheduler:
podMonitor:
enabled: false
syncer:
podMonitor:
enabled: false

Slurm

Slurm Helm chart

slurm/Chart.yaml
apiVersion: v2
name: slurm-gitops
version: 0.1.0
dependencies:
- name: slurm
version: 5.x.x
repository: http://helm.corp.ingress.ord1.coreweave.com/

Slurm Values file

Use Slurm Values Reference to further customize this file. An example slurm/values.yaml is shown below.

Important

There should be a top level slurm key in this file.

slurm/values.yaml
slurm:
accounting:
priorityClassName: ""
resources:
requests:
cpu: 100m
memory: 100Mi
controller:
priorityClassName: ""
resources:
requests:
cpu: 100m
memory: 100Mi
rest:
enabled: false
priorityClassName: ""
resources:
requests:
cpu: 100m
memory: 100Mi
login:
priorityClassName: ""
resources:
requests:
cpu: 10m
memory: 100Mi
munge:
priorityClassName: ""
resources:
requests:
cpu: 1m
memory: 20Mi
syncer:
priorityClassName: ""
resources:
requests:
cpu: 100m
memory: 50Mi
scheduler:
priorityClassName: ""
enabled: true
resources:
requests:
cpu: 100m
memory: 50Mi
mysql:
metrics:
enabled: false
primary:
resources:
requests:
cpu: 100m
memory: 500Mi
initdbScriptsConfigMap: "{{ .Release.Name }}-mysql-initdb-scripts"
compute:
nodes:
cpu-epyc:
enabled: true
replicas: 2 # Adjust to desired amount or scale manually after deploy
definitions:
- standard
features:
- test
resources:
requests:
cpu: 500m
memory: 400Mi
priorityClassName: ""
# This is a toleration for a test taint that can be applied to the desired nodes to keep
# other workloads off
tolerations:
- key: sunk.coreweave.com/nodes
operator: "Exists"
# affinity for a test label to filter nodes on, nodes need to be labeled with this or remove
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: sunk.coreweave.com/nodes
operator: Exists

Custom Configurations (Optional)

This section shows an example of how you might define custom Slurm deployment configurations and keep it synchronized with ArgoCD.

Slurm Controller config

You can use the following ConfigMap to customize the Slurm controller configurations.

Important

To use this ConfigMap, you need to add its name to the slurm.slurmConfig.slurmCtld.etcConfigMap key in the Slurm Values file.

slurm/templates/etc-slurmctld-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Release.Name }}-etc-slurmctld
data:
{{ (.Files.Glob "scripts/etc-slurmctld/*").AsConfig | indent 2 }}

Prolog and Epilog

Further configurations can be found in the Slurm Values Reference and Prolog and Epilog Scripts pages.

Important

To use these ConfigMaps, you need to add them to the respective slurm.slurmConfig.slurmd.prologConfigMap or slurm.slurmConfig.slurmd.epilogConfigMap keys in the Slurm Values file.

The following is an example of a Prolog ConfigMap, slurm/templates/prolog-configmap.yaml:

slurm/templates/prolog-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Release.Name }}-prolog
data:
{{ (.Files.Glob "scripts/prolog.d/*.sh").AsConfig | indent 2 }}

The following is an example of an Epilog ConfigMap, slurm/templates/epilog-configmap.yaml:

slurm/templates/epilog-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Release.Name }}-epilog
data:
{{ (.Files.Glob "scripts/epilog.d/*.sh").AsConfig | indent 2 }}

A simple example of an epilog script, slurm/scripts/epilog.d/test.sh, to be used with the above Epilog ConfigMap is shown below:

slurm/scripts/epilog.d/test.sh
#!/usr/bin/env bash
set -e
echo "Epilog test executed"

ArgoCD App of Apps

This section shows an example of how you might define multiple ArgoCD Application resources to manage SUNK and Slurm using app of apps pattern and GitOps principles.

SUNK App defintion

The apps/sunk.yaml file describes where ArgoCD can find and synchronize the Helm manifests for SUNK. Replace the <<REPO_URL>> placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.

apps/sunk.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: sunk
spec:
destination:
namespace: sunk
server: https://kubernetes.default.svc
source:
repoURL: <<REPO_URL>>
path: sunk
targetRevision: HEAD
helm:
valueFiles:
- values.yaml
sources: []
project: default
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

Slurm App definition

The apps/slurm.yaml file describes where ArgoCD can find and synchronize the Helm manifests for Slurm. Replace the <<REPO_URL>> placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.

Note

The spec.ignoreDifferences key contains recommended values to keep ArgoCD properly synchronized

apps/slurm.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: slurm
spec:
destination:
namespace: tenant-slurm
server: https://kubernetes.default.svc
source:
repoURL: <<REPO_URL>>
path: slurm
targetRevision: HEAD
helm:
valueFiles:
- values.yaml
sources: []
project: default
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
ignoreDifferences:
- group: sunk.coreweave.com
kind: Nodeset
namespace: tenant-slurm
jqPathExpressions:
- '.spec.template.spec.tolerations[] | select(.key == "node.coreweave.cloud/reservation-policy" or .key == "node.coreweave.cloud/reserved")'

App of App definition

The app-of-apps.yaml file describes where ArgoCD can find and synchronize the custom Helm charts defined for SUNK and Slurm in the sections above. Replace the <<REPO_URL>> placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.

app-of-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: sunk-app-of-apps
namespace: argocd
spec:
destination:
namespace: 'argocd'
server: https://kubernetes.default.svc
source:
repoURL: <<REPO_URL>>
path: apps
targetRevision: HEAD
sources: []
project: default
syncPolicy:
automated:
prune: true
selfHeal: true

Applying to ArgoCD

After following the steps above, your GitOps repository should be structured as below:

.
├── app-of-apps.yaml
├── apps
│ ├── slurm.yaml
│ └── sunk.yaml
├── slurm
│ ├── Chart.yaml
│ ├── scripts
│ │ └── prolog.d
│ │ └── test.sh
│ ├── templates
│ │ ├── epilog-configmap.yaml
│ │ ├── etc-slurmctld-configmap.yaml
│ │ └── prolog-configmap.yaml
│ └── values.yaml
└── sunk
├── Chart.yaml
└── values.yaml

To apply all of the resources defined above, run the following command:

# pwd: root dir of GitOps repo
$
kubectl apply -f app-of-apps.yaml

You should now be able to keep SUNK and Slurm synchronized with ArgoCD following GitOps principles.

Additional Notes

ArgoCD Impact on Slurm Jobs

ArgoCD Syncs are job-safe i.e. syncing in Argo does not affect running jobs in the cluster. Compute nodes are updated using the RollingUpdate strategy and the maximum percentage of nodes unavailable during an update can be configured with the compute.maxUnavailable in the Chart Values. See Slurm Values Reference for further details.

Login Nodes Updates

The login nodes may contain user states that you may not want to delete during an update. We recommend setting the value of login.updateStrategy to OnDelete strategy for cases like this. This would require you to manually delete the existing pod before creating the updated login node, ensuring the user state is not deleted during a Sync in ArgoCD. See Slurm Values Reference for further details.

Secret Job Lifecycle

On each Sync, the Slurm Chart schedules two Kubernetes Jobs in order to create the required secrets for the Slurm cluster to be operational. When installing or upgrading, any existing jobs are replaced, and new job runs are initiated. If a job succeeds, the Job object is deleted by an Argo hook, and ArgoCD reports being In Sync, indicating the job is complete. If a job fails, the Job object remains in Argo as Failed until the issue with job run is resolved or the next Sync occurs, which then follows this same process.