Skip to main content

Nimbus

Proprietary control and data-plane software for CoreWeave Kubernetes Service Nodes

Nimbus is CoreWeave's advanced control and data-plane software, seamlessly integrated into the NVIDIA® BlueField® Data Processing Unit (DPU) that's affixed to each of our Nodes.

The DPU blends a suite of ARM-based CPU cores, specialized acceleration engines, and a robust network interface. This combination creates a powerful "infrastructure on a chip," acting as a sophisticated "computer-in-front-of-a-computer." The ARM processors operate in complete isolation from the host CPU, ensuring an added layer of security and efficiency.

While Nimbus operates behind the scenes, its impact is profound, because it enables the resource requests initiated by our customers. For instance, when a customer requests a Virtual Private Cloud (VPC) Custom Resource within CKS, Nimbus responds by creating the appropriate network policy directly on the DPU.

Nimbus allows our stateless Nodes to deliver features traditionally exclusive to virtualized environments. Through its API-driven, extensible network programming capabilities, Nimbus orchestrates tenant isolation, VPC setups, public internet routing, and much more. This ensures our Nodes offer unparalleled scalability, flexibility, and isolation, all while foregoing the need for virtual machines.

Understanding Nimbus

Nimbus operates through a dual-component system: an Operator situated on the DPU, and a Controller within the CKS management cluster. This pair collaborates to oversee both the DPU and the Node's lifecycle, ensuring seamless operation and management.

Embedded directly on the DPUs, the Nimbus Operator is integral to the Kubernetes ecosystem at CoreWeave, treated with the same level of importance and management as Kubernetes Nodes themselves.

The Controller's role extends to managing Node responses to various fleet life cycle events. These include crucial updates and checks such as firmware upgrades, adjustments in network configurations, changes in VPC memberships, and comprehensive hardware integrity assessments.

Key to Nimbus's functionality is its ability to offload critical operations such as storage virtualization (using NVMe-oF), firewalling, and encryption directly to the DPU. This strategic offloading ensures the host CPU remains free for computational tasks, enhancing overall system efficiency.

Nimbus's architecture creates a secure, isolated network environment for each Node. By managing VPC memberships and network interfaces, Nimbus allows Nodes to support multiple client VPCs simultaneously. Network routing and isolation are handled through a Type 5 (Layer 3) EVPN-VXLAN overlay network, shifting network intelligence from traditional Top of Rack (TOR) switches to the DPU. In this setup, TOR switches are repurposed as mere packet-forwarding devices, utilizing unnumbered BGP for the underlying network infrastructure.

Nimbus's role in CKS

Nimbus is a critical component of CoreWeave Kubernetes Service (CKS). During Day 0 and Day 1 operations, Nimbus is used to join the Node to CoreWeave's internal onboarding cluster.

For Day 2+ operations, Nimbus is used by other CKS components like the VPC Operator to maintain the desired state of the DPU and the Node's VPC membership.

Nimbus is responsible to ensure that the Node:

  • has the correct network configuration
  • has joined the correct customer's cluster
  • is a member of the correct VPCs

Nimbus also verifies that the DPU is running the correct firmware and has the correct configuration to support the Node's workload.