Manage deployments with CI and GitOps

SUNK incorporates Continuous Integration (CI) and GitOps principles to manage the deployment of SUNK and Slurm clusters. This guide outlines the common strategies used for the SUNK project.

ArgoCD for Application Management

SUNK uses ArgoCD, a GitOps continuous delivery tool, to manage Kubernetes applications declaratively. By using ArgoCD, SUNK uses a streamlined deployment workflow that aligns with the GitOps model. This automates the deployment process to maintain consistency across different environments.

Two main applications are defined within ArgoCD:

SUNK Application: Manages the deployment of the SUNK cluster, including all necessary configurations and dependencies.
Slurm Application: Handles the Slurm cluster deployment, ensuring that compute resources are properly managed and scheduled.

Additional supporting applications are created for Slurm to expand its capabilities and manage specific requirements, such as:

Persistent Volume Claims (PVCs) for storage.
Prolog/Epilog scripts for preparing and cleaning up compute nodes before and after job execution.

NodeSet Sync Customization

ArgoCD configuration needs to be updated to allow it to properly Sync NodeSet definitions. The resource customization feature allows ArgoCD to sync the NodeSet spec in the same manner as PodSpec. The method of applying configuration changes varies depending on the cluster's ArgoCD installation.

Apply configuration changes with a ConfigMap patch

To apply configuration changes with a ConfigMap patch, edit the data section of the ConfigMap.

To open the argocd-cm ConfigMap for editing, use the following command:

$
kubectl edit configmap argocd-cm -n <namespace>

Note

The kubectl edit command will open the entire ConfigMap, and will send the entire modified YAML back to the API server to replace the existing ConfigMap with your newly modified version.

In the data section of the ConfigMap, add the following key-value pairs:

resource.customizations.knownTypeFields.sunk.coreweave.com_NodeSet: |
  - field: spec.template.spec
    type: core/v1/PodSpec

Apply configuration changes with kubectl

To apply configuration changes with kubectl, use the following command:

$
kubectl patch configmap argocd-cm -n <namespace> --type=merge \
  -p '{"data":{"resource.customizations.knownTypeFields.sunk.coreweave.com_NodeSet": "- field: spec.template.spec\n  type: core/v1/PodSpec\n"}}'

Note

The kubectl patch command only updates the specific resources you have modified, and merges them with the existing resource.

The -p '...' flag specifies the patch content. In this example, we use the data, - field, and type parameters.

The "data":{...} parameter specifies the section of the ConfigMap to be modified. The kubectl patch command will only modify the specified section.

" - field: spec.template.spec\n" and type: core/v1/PodSpec\n are the specific key-value pair to be added to the ConfigMap.

The --type=merge flag specifies the patch type as a JSON Merge Patch, which operates on the following logic:

If a field exists in the patch, it replaces the existing field in the target object
If a field exists in the patch with a null value, it deletes the field from the target object.
If a field does not exist in the patch, it remains unchanged in the target object.
If you provide a list, it replaces the entire existing list with the one provided in the patch.

Configuration in git (GitOps)

This section shows an example of how you might keep a git repository synced to ArgoCD using Helm and app of apps pattern.

Create a git repository with the following contents:

SUNK

SUNK Helm chart

sunk/Chart.yaml

apiVersion: v2
name: sunk-gitops
version: 0.1.0
dependencies:
  - name: sunk
    version: 5.x.x
    repository: http://helm.corp.ingress.ord1.coreweave.com/

SUNK Values file

Use SUNK Values Reference to further customize this file. An example sunk/values.yaml is shown below.

Important

There should be a top level sunk key in this file.

sunk/values.yaml

sunk:
  operator:
    logLevel: debug
    resources:
      limits:
        cpu: 1
        memory: 200Mi
      requests:
        cpu: 1
        memory: 200Mi
    podMonitor:
      enabled: false
    replicas: 1
    leaderElection:
      enabled: true
  scheduler:
    podMonitor:
      enabled: false
  syncer:
    podMonitor:
      enabled: false

Slurm

Slurm Helm chart

slurm/Chart.yaml

apiVersion: v2
name: slurm-gitops
version: 0.1.0
dependencies:
  - name: slurm
    version: 5.x.x
    repository: http://helm.corp.ingress.ord1.coreweave.com/

Slurm Values file

Use Slurm Values Reference to further customize this file. An example slurm/values.yaml is shown below.

Important

There should be a top level slurm key in this file.

slurm/values.yaml

slurm:
  accounting:
    priorityClassName: ""
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
  controller:
    priorityClassName: ""
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
  rest:
    enabled: false
    priorityClassName: ""
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
  login:
    priorityClassName: ""
    resources:
      requests:
        cpu: 10m
        memory: 100Mi
  munge:
    priorityClassName: ""
    resources:
      requests:
        cpu: 1m
        memory: 20Mi
  syncer:
    priorityClassName: ""
    resources:
      requests:
        cpu: 100m
        memory: 50Mi
  scheduler:
    priorityClassName: ""
    enabled: true
    resources:
      requests:
        cpu: 100m
        memory: 50Mi
  mysql:
    metrics:
      enabled: false
    primary:
      resources:
        requests:
          cpu: 100m
          memory: 500Mi
    initdbScriptsConfigMap: "{{ .Release.Name }}-mysql-initdb-scripts"
  compute:
    nodes:
      cpu-epyc:
        enabled: true
        replicas: 2 # Adjust to desired amount or scale manually after deploy
        definitions:
          - standard
        features:
          - test
        resources:
          requests:
            cpu: 500m
            memory: 400Mi
        priorityClassName: ""
        # This is a toleration for a test taint that can be applied to the desired nodes to keep
        # other workloads off
        tolerations:
          - key: sunk.coreweave.com/nodes
            operator: "Exists"
        # affinity for a test label to filter nodes on, nodes need to be labeled with this or remove
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: sunk.coreweave.com/nodes
                      operator: Exists

Custom Configurations (Optional)

This section shows an example of how you might define custom Slurm deployment configurations and keep it synchronized with ArgoCD.

Slurm Controller config

You can use the following ConfigMap to customize the Slurm controller configurations.

Important

To use this ConfigMap, you need to add its name to the slurm.slurmConfig.slurmCtld.etcConfigMap key in the Slurm Values file.

slurm/templates/etc-slurmctld-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Release.Name }}-etc-slurmctld
data:
{{ (.Files.Glob "scripts/etc-slurmctld/*").AsConfig | indent 2 }}

Prolog and Epilog

Further configurations can be found in the Slurm Values Reference and Prolog and Epilog Scripts pages.

Important

To use these ConfigMaps, you need to add them to the respective slurm.slurmConfig.slurmd.prologConfigMap or slurm.slurmConfig.slurmd.epilogConfigMap keys in the Slurm Values file.

The following is an example of a Prolog ConfigMap, slurm/templates/prolog-configmap.yaml:

slurm/templates/prolog-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Release.Name }}-prolog
data:
{{ (.Files.Glob "scripts/prolog.d/*.sh").AsConfig | indent 2 }}

The following is an example of an Epilog ConfigMap, slurm/templates/epilog-configmap.yaml:

slurm/templates/epilog-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Release.Name }}-epilog
data:
{{ (.Files.Glob "scripts/epilog.d/*.sh").AsConfig | indent 2 }}

A simple example of an epilog script, slurm/scripts/epilog.d/test.sh, to be used with the above Epilog ConfigMap is shown below:

slurm/scripts/epilog.d/test.sh

#!/usr/bin/env bash

set -e

echo "Epilog test executed"

ArgoCD App of Apps

This section shows an example of how you might define multiple ArgoCD Application resources to manage SUNK and Slurm using app of apps pattern and GitOps principles.

SUNK App defintion

The apps/sunk.yaml file describes where ArgoCD can find and synchronize the Helm manifests for SUNK. Replace the <<REPO_URL>> placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.

apps/sunk.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: sunk
spec:
  destination:
    namespace: sunk
    server: https://kubernetes.default.svc
  source:
    repoURL: <<REPO_URL>>
    path: sunk
    targetRevision: HEAD
    helm:
      valueFiles:
        - values.yaml
  sources: []
  project: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Slurm App definition

The apps/slurm.yaml file describes where ArgoCD can find and synchronize the Helm manifests for Slurm. Replace the <<REPO_URL>> placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.

Note

The spec.ignoreDifferences key contains recommended values to keep ArgoCD properly synchronized

apps/slurm.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: slurm
spec:
  destination:
    namespace: tenant-slurm
    server: https://kubernetes.default.svc
  source:
    repoURL: <<REPO_URL>>
    path: slurm
    targetRevision: HEAD
    helm:
      valueFiles:
        - values.yaml
  sources: []
  project: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
  ignoreDifferences:
    - group: sunk.coreweave.com
      kind: Nodeset
      namespace: tenant-slurm
      jqPathExpressions:
        - '.spec.template.spec.tolerations[] | select(.key == "node.coreweave.cloud/reservation-policy" or .key == "node.coreweave.cloud/reserved")'

App of App definition

The app-of-apps.yaml file describes where ArgoCD can find and synchronize the custom Helm charts defined for SUNK and Slurm in the sections above. Replace the <<REPO_URL>> placeholder with your GitOps repository URL. Follow the ArgoCD Specs for further customization.

app-of-apps.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: sunk-app-of-apps
  namespace: argocd
spec:
  destination:
    namespace: 'argocd'
    server: https://kubernetes.default.svc
  source:
    repoURL: <<REPO_URL>>
    path: apps
    targetRevision: HEAD
  sources: []
  project: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Applying to ArgoCD

After following the steps above, your GitOps repository should be structured as below:

.
├── app-of-apps.yaml
├── apps
│  ├── slurm.yaml
│  └── sunk.yaml
├── slurm
│  ├── Chart.yaml
│  ├── scripts
│  │  └── prolog.d
│  │     └── test.sh
│  ├── templates
│  │  ├── epilog-configmap.yaml
│  │  ├── etc-slurmctld-configmap.yaml
│  │  └── prolog-configmap.yaml
│  └── values.yaml
└── sunk
   ├── Chart.yaml
   └── values.yaml

To apply all of the resources defined above, run the following command:


# pwd: root dir of GitOps repo
$
kubectl apply -f app-of-apps.yaml

You should now be able to keep SUNK and Slurm synchronized with ArgoCD following GitOps principles.

Additional Notes

ArgoCD Impact on Slurm Jobs

ArgoCD Syncs are job-safe i.e. syncing in Argo does not affect running jobs in the cluster. Compute nodes are updated using the RollingUpdate strategy and the maximum percentage of nodes unavailable during an update can be configured with the compute.maxUnavailable in the Chart Values. See Slurm Values Reference for further details.

The login nodes may contain user states that you may not want to delete during an update. We recommend setting the value of login.updateStrategy to OnDelete strategy for cases like this. This would require you to manually delete the existing pod before creating the updated login node, ensuring the user state is not deleted during a Sync in ArgoCD. See Slurm Values Reference for further details.

Secret Job Lifecycle

On each Sync, the Slurm Chart schedules two Kubernetes Jobs in order to create the required secrets for the Slurm cluster to be operational. When installing or upgrading, any existing jobs are replaced, and new job runs are initiated. If a job succeeds, the Job object is deleted by an Argo hook, and ArgoCD reports being In Sync, indicating the job is complete. If a job fails, the Job object remains in Argo as Failed until the issue with job run is resolved or the next Sync occurs, which then follows this same process.

ArgoCD for Application Management​

NodeSet Sync Customization​

Apply configuration changes with a ConfigMap patch​

Apply configuration changes with kubectl​

Configuration in git (GitOps)​

SUNK​

SUNK Helm chart​

SUNK Values file​

Slurm​

Slurm Helm chart​

Slurm Values file​

Custom Configurations (Optional)​

Slurm Controller config​

Prolog and Epilog​

ArgoCD App of Apps​

SUNK App defintion​

Slurm App definition​

App of App definition​

Applying to ArgoCD​

Additional Notes​

ArgoCD Impact on Slurm Jobs​

Login Nodes Updates​

Secret Job Lifecycle​

ArgoCD for Application Management

NodeSet Sync Customization

Apply configuration changes with a ConfigMap patch

Apply configuration changes with kubectl

Configuration in git (GitOps)

SUNK

SUNK Helm chart

SUNK Values file

Slurm

Slurm Helm chart

Slurm Values file

Custom Configurations (Optional)

Slurm Controller config

Prolog and Epilog

ArgoCD App of Apps

SUNK App defintion

Slurm App definition

App of App definition

Applying to ArgoCD

Additional Notes

ArgoCD Impact on Slurm Jobs

Login Nodes Updates

Secret Job Lifecycle