cache-dropper sidecar to flush the Linux page cache between exclusive Slurm jobs on a SUNK compute node. It’s intended for cluster administrators and job authors who need to mitigate memory fragmentation in CPU-intensive training workloads.
Between exclusive jobs on a node, dropping the page cache can improve performance and lower memory access times caused by memory fragmentation. Memory fragmentation can lead to Out of Memory (OOM) errors and slowdowns in CPU-intensive training jobs.
SUNK includes a sidecar container, cache-dropper, to handle page cache flushes. This container runs in privileged mode, which lets it drop the cache without requiring the main slurmd container to run as privileged. The sidecar checks for the presence of a specific trigger file and drops the page cache if that file appears. The cache-dropper sidecar writes to the drop_caches sysctl file to free both page cache and reclaimable slab objects.
Dropping the cache is a non-destructive operation, but it can incur additional CPU and I/O overhead as dropped objects are recreated.
Enable the cache-dropper sidecar in the Slurm chart
To enable the cache-dropper sidecar, set .compute.cacheDropper.enabled to true in the Slurm values.yaml file.
When enabled, cache-dropper is present in every compute pod. It periodically checks for the existence of the /run/enroot/drop_caches file as a signal to proceed. If this file exists, it triggers the cache drop operation.
Drop the page cache with a Slurm job
After the sidecar is enabled, individual Slurm jobs trigger a cache drop by creating the trigger file the sidecar watches for. To usecache-dropper, add the following touch command to a Slurm job script: