Workflows on CoreWeave run on Argo Workflows, which is a great tool to orchestrate parallel execution of GPU and CPU jobs. It manages retries and parallelism for you, and allows you to submit workflows via CLI, Rest API and the Kubernetes API.
After installing kubectl
and adding your CoreWeave Cloud access credentials, the following steps will deploy the Argo Server in your namespace.
Apply the resource argo-install.yaml
found in this repository, this will install the Argo Workflow server into your namespace
$ kubectl apply -f argo-install.yamlserviceaccount/argo unchangedserviceaccount/argo-server unchangedrole.rbac.authorization.k8s.io/argo-role unchangedrole.rbac.authorization.k8s.io/argo-server-role unchangedrolebinding.rbac.authorization.k8s.io/argo-binding unchangedrolebinding.rbac.authorization.k8s.io/argo-server-binding unchangedconfigmap/workflow-controller-configmap unchangedservice/argo-server unchangeddeployment.apps/argo-server unchangeddeployment.apps/workflow-controller unchanged
Install the Argo CLI from the Argo releases page
Submit an example workflow, gpu-say-workflow.yaml
found in this repository. The workflow takes a JSON Array and spins up one Pod with one GPU allocated for each, in parallel. nvidia-smi
output as well as the parameter entry assigned for that Pod is printed to the log.
$ argo submit --watch gpu-say-workflow.yaml -p messages='["Argo", "Is", "Awesome"]'Name: gpu-sayzfwxcNamespace: tenant-testServiceAccount: defaultStatus: RunningCreated: Mon Feb 10 19:31:17 -0500 (15 seconds ago)Started: Mon Feb 10 19:31:17 -0500 (15 seconds ago)Duration: 15 secondsParameters:messages: ["Argo", "Is", "Awesome"]STEP PODNAME DURATION MESSAGE● gpu-sayzfwxc (main)└-·-✔ echo(0:Argo)(0) (gpu-echo) gpu-sayzfwxc-391007373 10s├-● echo(1:Is)(0) (gpu-echo) gpu-sayzfwxc-3501791705 15s└-✔ echo(2:Awesome)(0) (gpu-echo) gpu-sayzfwxc-3986963301 12s
Get the log output from all parallel containers
$ argo logs -w gpu-sayrbr6zecho(0:Argo)(0): Tue Feb 11 00:25:30 2020echo(0:Argo)(0): +-----------------------------------------------------------------------------+echo(0:Argo)(0): | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |echo(0:Argo)(0): |-------------------------------+----------------------+----------------------+echo(0:Argo)(0): | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |echo(0:Argo)(0): | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |echo(0:Argo)(0): |===============================+======================+======================|echo(0:Argo)(0): | 0 NVIDIA Graphics... Off | 00000000:08:00.0 Off | N/A |echo(0:Argo)(0): | 28% 51C P5 16W / 180W | 18MiB / 8119MiB | 0% Default |echo(0:Argo)(0): +-------------------------------+----------------------+----------------------+echo(0:Argo)(0):echo(0:Argo)(0): +-----------------------------------------------------------------------------+echo(0:Argo)(0): | Processes: GPU Memory |echo(0:Argo)(0): | GPU PID Type Process name Usage |echo(0:Argo)(0): |=============================================================================|echo(0:Argo)(0): +-----------------------------------------------------------------------------+echo(0:Argo)(0): Input was: Argoecho(1:Is)(0): Tue Feb 11 00:25:30 2020echo(1:Is)(0): +-----------------------------------------------------------------------------+echo(1:Is)(0): | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |echo(1:Is)(0): |-------------------------------+----------------------+----------------------+...
Port forward the Argo UI
kubectl port-forward svc/argo-server 2746:2746
Open and explore the Argo UI at http://localhost:2746
We recommend the following retry strategy on your workflow / steps.
retryStrategy:limit: 4retryPolicy: Alwaysbackoff:duration: "15s"factor: 2
We also recommend setting an activeDeadlineSeconds
on each step
, but not on the entire workflow. This allows a step to retry but prevents it from taking unreasonably long time to finish.