Spark on Kubernetes
Run Spark workloads on CoreWeave using spark-submit, or interactively from JupyterLab.
Apache Spark is a popular multi-language engine designed for executing distributed workloads. It is often used for data processing tasks, such as preparing data for training machine learning models.
This guide demonstrates how to use Spark in cluster mode and interactively in client mode through a Jupyter notebook, with PySpark serving as the backend for the img2dataset
tool. This tool will be used to download and process large, publicly available image datasets. For an in-depth explanation of all the features available, please visit Sparks's Running on Kubernetes documentation.
Follow the steps below to complete the demonstration.
Configure Kubernetes
Verify that your Kubernetes configuration is set up correctly. The full account creation details are found in Cloud Account and Access. Briefly, the steps are:
-
Install
kubectl
. -
Create an API token, and download the Kubeconfig.
-
Install the new Kubeconfig, or merge it with your existing one.
-
Set your Kubernetes context to
coreweave.
kubectl config set-context coreweave
Verify your configuration by checking your Kubernetes secrets. If the kubectl
command returns any security tokens, your configuration is ready to go.
kubectl get secret
Download the source code repository
In a working folder, download and extract the demonstration code to your local workstation. The following command works in macOS, Linux, and Windows CMD, but not Windows PowerShell.
curl -L https://github.com/coreweave/kubernetes-cloud/archive/refs/heads/master.tar.gz | tar xzf -
Alternatively, clone the kubernetes-cloud GitHub repository with your preferred method.
Then, change to the repository's spark
directory.
Create the PVC
The PVC is a shared, 400Gi NVMe drive used to store the downloaded datasets. Create it by applying the spark-pvc.yaml
manifest, found in the spark
directory.
kubectl apply -f spark-pvc.yaml
Deploy the Weights and Biases Secret
The img2dataset
code will log metrics to Weights and Biases. To enable this, you must deploy a WandB auth token as a Kubernetes secret.
First, find your token by logging in to your Weights and Biases account and navigating to https://wandb.ai/authorize.
Then, encode your token with base64
:
echo -n "example-wanbd-token" | base64
ZXhhbXBsZS13YW5kYi10b2tlbg==
Next, copy the token into wandb-secret.yaml
. When complete, it should look similar to this:
apiVersion: v1
data:
token: ZXhhbXBsZS13YW5kYi10b2tlbg==
kind: Secret
metadata:
name: wandb-token-secret
type: Opaque
Finally, apply the updated manifest to the cluster with kubectl
:
kubectl apply -f wandb-secret.yaml
Create the ServiceAccount
Because Spark runs on Kubernetes, it needs permission to manage its resources. For instance, Spark uses Pods as executors, so it needs the necessary permissions to deploy and delete them.
The required ServiceAccount, Role, and RoleBinding are defined in the spark-role.yaml
manifest. Apply it to your namespace with kubectl
:
kubectl apply -f spark-role.yaml
About the different modes
You can launch your Spark job in cluster mode, or in client mode.
- Cluster mode requires Spark installed locally on your workstation. If you don't already have Apache Spark installed, follow the instructions on the download page to install the latest version pre-built for Apache Hadoop.
- Client mode does not require Spark installed locally.
We'll explain both methods.
Launch in cluster mode
In cluster mode the Spark driver is deployed as a Pod alongside the executors. You'll use spark-submit
, located in $SPARK_HOME/bin
, to launch the job. We've provided an example submit command in the project directory as example-spark-submit.sh
.
Spark Configuration
Spark jobs use many configuration parameters.
In the example script, the configuration is defined on the command line. However, there are two other ways you can set Spark configuration parameters.
- You can use a Kubernetes Pod template to set default parameters. Any variables defined in the template can be overridden by the
spark-submit
command line. - You can set parameters in the Python script when creating the Spark context. Anything set here will override parameters in the Pod template or the command line.
The full parameter list is published at spark.apache.org.
The example script uses a template YAML file, cpu-pod-template.yaml
, which contains specific Kubernetes configurations that Spark does not override. It also specifies the Docker files for the driver and executors, the ServiceAccount, and the Kubernetes namespace. The script deploys a total of two Pods - one executor and one driver - with each Pod allocated 16 cores and 64GB of memory.
Run the job
To launch the job, run the example script:
./example-spark-submit.sh
The job will download the MSCOCO dataset to the msoco/
folder in the PVC you created. It should take approximately 10 minutes to run, then the compute resources will be deleted when the job is finished. The driver Pod won't be deleted so you can view the logs after the job is complete.
While the script runs, you can watch the job status in the terminal. You can see the logs for the driver and executor Pods with the kubectl logs
command.
Launch in client mode
In client mode, the Spark driver is the machine from which the jobs are launched, and the driver needs to communicate with the executor pods running on the cluster. Although there are multiple ways to achieve this, we'll deploy a JupyterLab service on Kubernetes to serve as the driver for this example. You do not need to install Spark locally in client mode.
Deploy JupyterLab
JupyterLab runs in a pod specified by a Kubernetes deployment, which is connected to a headless service that exposes ports for both JupyterLab and Spark. This setup is detailed in the Client Mode Networking section of Spark on Kubernetes.
The deployment and service configurations can be found in jupyter/jupyter-service.yaml
. To deploy, execute the following command:
kubectl apply -f jupyter/jupyter-service.yaml
This setup does not expose the service over the open internet. Therefore, you need create two port forwards for JupyterLab and Spark UI:
kubectl port-forward service/spark-jupyter 8888:8888 4040:4040
When the port forward is running, you can connect to JupyterLab at https://localhost:8888
and the Spark UI at https://localhost:4040
.
Running the Job
An example Jupyter notebook (jupyter/interactive-example.ipynb
) is provided, which creates a Spark session in client mode using dynamic executors. This setup allows the Spark session to run concurrently with the Jupyter notebook, with executors spinning up as needed for jobs and shutting down once the jobs are completed. By default, the example notebook configures the Spark session to use a maximum of five executors. The remaining Spark configuration parameters are identical to those of the previous job.
In this notebook, the main job uses img2dataset
to download the Conceptual 12M dataset to the cc12m-jupyter/
folder within the PVC you created during setup.
The img2dataset
documentation states that the dataset was downloaded in approximately five hours using a single machine with 16 CPU cores and 64GB memory. The Weights and Biases report for that run can be found here.
Many of the images in this run failed to download, as did the run shown in the img2dataset
documentation. This is an issue with timeouts for the CC12M dataset, not a Spark error.
By using five executors with the same resource configuration, the notebook can download the dataset in about one hour, demonstrating the scalability and efficiency of batch processing with Spark.