Spark on Kubernetes

Run Spark workloads on CoreWeave using spark-submit, or interactively from JupyterLab.

Apache Spark is a popular multi-language engine designed for executing distributed workloads. It is often used for data processing tasks, such as preparing data for training machine learning models.

This guide demonstrates how to use Spark in cluster mode and interactively in client mode through a Jupyter notebook, with PySpark serving as the backend for the img2dataset tool. This tool will be used to download and process large, publicly available image datasets. For an in-depth explanation of all the features available, please visit Sparks's Running on Kubernetes documentation.

Follow the steps below to complete the demonstration.

Configure Kubernetes

Verify that your Kubernetes configuration is set up correctly. The full account creation details are found in Cloud Account and Access. Briefly, the steps are:

  1. Install kubectl.

  2. Create an API token, and download the Kubeconfig.

  3. Set your Kubernetes context to coreweave.

    $ kubectl config set-context coreweave

Verify your configuration by checking your Kubernetes secrets. If the kubectl command returns any security tokens, your configuration is ready to go.

$ kubectl get secret

Download the source code repository

In a working folder, download and extract the demonstration code to your local workstation. The following command works in macOS, Linux, and Windows CMD, but not Windows PowerShell.

curl -L https://github.com/coreweave/kubernetes-cloud/archive/refs/heads/master.tar.gz | tar xzf -

Alternatively, clone the kubernetes-cloud GitHub repository with your preferred method.

Then, change to the repository's spark directory.

Create the PVC

The PVC is a shared, 400Gi NVMe drive used to store the downloaded datasets. Create it by applying the spark-pvc.yaml manifest, found in the spark directory.

kubectl apply -f spark-pvc.yaml

Deploy the Weights and Biases Secret

The img2dataset code will log metrics to Weights and Biases. To enable this, you must deploy a WandB auth token as a Kubernetes secret.

First, find your token by logging in to your Weights and Biases account and navigating to https://wandb.ai/authorize.

Then, encode your token with base64:

$ echo -n "example-wanbd-token" | base64
ZXhhbXBsZS13YW5kYi10b2tlbg==

Next, copy the token into wandb-secret.yaml. When complete, it should look similar to this:

apiVersion: v1
data:
  token: ZXhhbXBsZS13YW5kYi10b2tlbg==
kind: Secret
metadata:
  name: wandb-token-secret
type: Opaque

Finally, apply the updated manifest to the cluster with kubectl:

kubectl apply -f wandb-secret.yaml

Create the ServiceAccount

Because Spark runs on Kubernetes, it needs permission to manage its resources. For instance, Spark uses Pods as executors, so it needs the necessary permissions to deploy and delete them.

The required ServiceAccount, Role, and RoleBinding are defined in the spark-role.yaml manifest. Apply it to your namespace with kubectl:

kubectl apply -f spark-role.yaml

About the different modes

You can launch your Spark job in cluster mode, or in client mode.

  • Cluster mode requires Spark installed locally on your workstation. If you don't already have Apache Spark installed, follow the instructions on the download page to install the latest version pre-built for Apache Hadoop.

  • Client mode does not require Spark installed locally.

We'll explain both methods.

Launch in cluster mode

In cluster mode the Spark driver is deployed as a Pod alongside the executors. You'll use spark-submit, located in $SPARK_HOME/bin, to launch the job. We've provided an example submit command in the project directory as example-spark-submit.sh.

Spark Configuration

Spark jobs use many configuration parameters.

In the example script, the configuration is defined on the command line. However, there are two other ways you can set Spark configuration parameters.

  • You can use a Kubernetes Pod template to set default parameters. Any variables defined in the template can be overridden by the spark-submit command line.

  • You can set parameters in the Python script when creating the Spark context. Anything set here will override parameters in the Pod template or the command line.

The full parameter list is published at spark.apache.org.

The example script uses a template YAML file, cpu-pod-template.yaml, which contains specific Kubernetes configurations that Spark does not override. It also specifies the Docker files for the driver and executors, the ServiceAccount, and the Kubernetes namespace. The script deploys a total of two Pods - one executor and one driver - with each Pod allocated 16 cores and 64GB of memory.

Run the job

To launch the job, run the example script:

./example-spark-submit.sh

The job will download the MSCOCO dataset to the msoco/ folder in the PVC you created. It should take approximately 10 minutes to run, then the compute resources will be deleted when the job is finished. The driver Pod won't be deleted so you can view the logs after the job is complete.

While the script runs, you can watch the job status in the terminal. You can see the logs for the driver and executor Pods with the kubectl logs command.

Launch in client mode

In client mode, the Spark driver is the machine from which the jobs are launched, and the driver needs to communicate with the executor pods running on the cluster. Although there are multiple ways to achieve this, we'll deploy a JupyterLab service on Kubernetes to serve as the driver for this example. You do not need to install Spark locally in client mode.

Deploy JupyterLab

JupyterLab runs in a pod specified by a Kubernetes deployment, which is connected to a headless service that exposes ports for both JupyterLab and Spark. This setup is detailed in the Client Mode Networking section of Spark on Kubernetes.

The deployment and service configurations can be found in jupyter/jupyter-service.yaml. To deploy, execute the following command:

kubectl apply -f jupyter/jupyter-service.yaml

This setup does not expose the service over the open internet. Therefore, you need create two port forwards for JupyterLab and Spark UI:

kubectl port-forward service/spark-jupyter 8888:8888 4040:4040

When the port forward is running, you can connect to JupyterLab at https://localhost:8888 and the Spark UI at https://localhost:4040.

Running the Job

An example Jupyter notebook (jupyter/interactive-example.ipynb) is provided, which creates a Spark session in client mode using dynamic executors. This setup allows the Spark session to run concurrently with the Jupyter notebook, with executors spinning up as needed for jobs and shutting down once the jobs are completed. By default, the example notebook configures the Spark session to use a maximum of five executors. The remaining Spark configuration parameters are identical to those of the previous job.

In this notebook, the main job uses img2dataset to download the Conceptual 12M dataset to the cc12m-jupyter/ folder within the PVC you created during setup.

The img2dataset documentation states that the dataset was downloaded in approximately five hours using a single machine with 16 CPU cores and 64GB memory. The Weights and Biases report for that run can be found here.

Note

Many of the images in this run failed to download, as did the run shown in the img2dataset documentation. This is an issue with timeouts for the CC12M dataset, not a Spark error.

By using five executors with the same resource configuration, the notebook can download the dataset in about one hour, demonstrating the scalability and efficiency of batch processing with Spark.

Last updated