Skip to main content

Manage memory with cache-dropper

Use cache-dropper to manage memory between exclusive Slurm jobs on a node

Between exclusive jobs on a node, dropping the page cache can help to improve performance and lower memory access times caused by memory fragmentation. Memory fragmentation can lead to Out of Memory (OOM) errors and slowdowns in CPU-intensive training jobs.

SUNK includes a sidecar container, cache-dropper, to handle page cache flushes. This container runs in privileged mode, allowing it to drop the cache without requiring the main slurmd container to run as privileged. The sidecar checks for the presence of a specific trigger file and drops the page cache if that file appears. The cache-dropper sidecar writes to the drop_caches sysctl file to free both page cache and reclaimable slab objects.

Dropping the cache is a non-destructive operation, but may incur additional CPU and I/O overhead as dropped objects are recreated.

Enable the cache-dropper sidecar in the Slurm chart

To enable the cache-dropper sidecar, set .compute.cacheDropper.enabled to true in the Slurm values.yaml file.

When enabled, cache-dropper is present in every compute pod. It periodically checks for the existence of the /run/enroot/drop_caches file as a signal to proceed. If this file exists, it will trigger the cache drop operation.

Drop the page cache with a Slurm job

To use cache-dropper, add the following touch command to a Slurm job script:

Example
$
touch /run/enroot/drop_caches

The cache will drop each time it hits this command.