CoreWeave
Search…
Getting Started

Introduction

Workflows on CoreWeave run on Argo Workflows, which is a great tool to orchestrate parallel execution of GPU and CPU jobs. It manages retries and parallelism for you, and allows you to submit workflows via CLI, Rest API and the Kubernetes API.
Argo Web UI

Getting Started

  • A new window will open to CoreWeave Apps with the list of available applications. Find and select the argo-workflows application
  • In the right upper corner, select the latest version of the helm chart and click DEPLOY
  • The deployment form will prompt you to enter an application name. The remaining parameters have our suggested defaults, when ready click DEPLOY at the bottom of the page
server authentication mode does not require credentials, we suggest usingclient mode instead for better security, for more information visit https://argoproj.github.io/argo-workflows/argo-server-auth-mode
  • After a few minutes, the deployment will be ready. If you selected Expose UI via public Ingress, Argo Workflows will be accessible outside the cluster.
    Click the ingress link to open Argo Workflows UI in a new window
On first visit, you may encounter get a TLS certificate error. It can take up to five minutes for the certificate to be issued, once issued the error should disappear.
  • To run a sample workflow, click +SUBMIT NEW WORKFLOW and then Edit using workflow options This shows 'Argo says' workflow, click +CREATE, after a few minutes, on success, the workflow will change to green.

Argo CLI

After installing kubectl and adding your CoreWeave Cloud access credentials, install Argo CLI from https://github.com/argoproj/argo-workflows/releases
  1. 1.
    Save an example workflow into the file gpu-say-workflow.yaml
1
apiVersion: argoproj.io/v1alpha1
2
kind: Workflow
3
metadata:
4
generateName: gpu-say
5
spec:
6
entrypoint: main
7
activeDeadlineSeconds: 300 # Cancel operation if not finished in 5 minutes
8
ttlSecondsAfterFinished: 86400 # Clean out old workflows after a day
9
# Parameters can be passed/overridden via the argo CLI.
10
# To override the printed message, run `argo submit` with the -p option:
11
# $ argo submit examples/arguments-parameters.yaml -p messages='["CoreWeave", "Is", "Fun"]'
12
arguments:
13
parameters:
14
- name: messages
15
value: '["Argo", "Is", "Awesome"]'
16
17
templates:
18
- name: main
19
steps:
20
- - name: echo
21
template: gpu-echo
22
arguments:
23
parameters:
24
- name: message
25
value: "{{item}}"
26
withParam: "{{workflow.parameters.messages}}"
27
28
- name: gpu-echo
29
inputs:
30
parameters:
31
- name: message
32
retryStrategy:
33
limit: 1
34
script:
35
image: nvidia/cuda:11.4.1-runtime-ubuntu20.04
36
command: [bash]
37
source: |
38
nvidia-smi
39
echo "Input was: {{inputs.parameters.message}}"
40
41
resources:
42
requests:
43
memory: 128Mi
44
cpu: 500m # Half a core
45
limits:
46
nvidia.com/gpu: 1 # Allocate one GPU
47
affinity:
48
nodeAffinity:
49
requiredDuringSchedulingIgnoredDuringExecution:
50
# This will REQUIRE the Pod to be run on a system with a GPU with 8 or 16GB VRAM
51
nodeSelectorTerms:
52
- matchExpressions:
53
- key: gpu.nvidia.com/vram
54
operator: In
55
values:
56
- "8"
57
- "16"
Copied!
  1. 1.
    Submit the workflow, gpu-say-workflow.yaml . The workflow takes a JSON Array and spins up one Pod with one GPU allocated for each, in parallel. nvidia-smi output, as well as the parameter entry assigned for that Pod, is printed to the log.
    1
    $ argo submit --watch gpu-say-workflow.yaml -p messages='["Argo", "Is", "Awesome"]'
    2
    Name: gpu-sayzfwxc
    3
    Namespace: tenant-test
    4
    ServiceAccount: default
    5
    Status: Running
    6
    Created: Mon Feb 10 19:31:17 -0500 (15 seconds ago)
    7
    Started: Mon Feb 10 19:31:17 -0500 (15 seconds ago)
    8
    Duration: 15 seconds
    9
    Parameters:
    10
    messages: ["Argo", "Is", "Awesome"]
    11
    12
    STEP PODNAME DURATION MESSAGE
    13
    ● gpu-sayzfwxc (main)
    14
    └-·-✔ echo(0:Argo)(0) (gpu-echo) gpu-sayzfwxc-391007373 10s
    15
    ├-● echo(1:Is)(0) (gpu-echo) gpu-sayzfwxc-3501791705 15s
    16
    └-✔ echo(2:Awesome)(0) (gpu-echo) gpu-sayzfwxc-3986963301 12s
    Copied!
  2. 2.
    Get the log output from all parallel containers
    1
    $ argo logs -w gpu-sayrbr6z
    2
    echo(0:Argo)(0): Tue Feb 11 00:25:30 2020
    3
    echo(0:Argo)(0): +-----------------------------------------------------------------------------+
    4
    echo(0:Argo)(0): | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
    5
    echo(0:Argo)(0): |-------------------------------+----------------------+----------------------+
    6
    echo(0:Argo)(0): | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    7
    echo(0:Argo)(0): | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    8
    echo(0:Argo)(0): |===============================+======================+======================|
    9
    echo(0:Argo)(0): | 0 NVIDIA Graphics... Off | 00000000:08:00.0 Off | N/A |
    10
    echo(0:Argo)(0): | 28% 51C P5 16W / 180W | 18MiB / 8119MiB | 0% Default |
    11
    echo(0:Argo)(0): +-------------------------------+----------------------+----------------------+
    12
    echo(0:Argo)(0):
    13
    echo(0:Argo)(0): +-----------------------------------------------------------------------------+
    14
    echo(0:Argo)(0): | Processes: GPU Memory |
    15
    echo(0:Argo)(0): | GPU PID Type Process name Usage |
    16
    echo(0:Argo)(0): |=============================================================================|
    17
    echo(0:Argo)(0): +-----------------------------------------------------------------------------+
    18
    echo(0:Argo)(0): Input was: Argo
    19
    echo(1:Is)(0): Tue Feb 11 00:25:30 2020
    20
    echo(1:Is)(0): +-----------------------------------------------------------------------------+
    21
    echo(1:Is)(0): | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
    22
    echo(1:Is)(0): |-------------------------------+----------------------+----------------------+
    23
    ...
    Copied!

Recommendations

We recommend the following retry strategy on your workflow / steps.
1
retryStrategy:
2
limit: 2
3
retryPolicy: Always
4
backoff:
5
duration: "15s"
6
factor: 2
7
affinity:
8
nodeAntiAffinity: {}
Copied!
We also recommend setting an activeDeadlineSeconds on each step, but not on the entire workflow. This allows a step to retry but prevents it from taking unreasonably long time to finish.
Last modified 1mo ago