Address nodes in the INVAL state
When checking the state of a Slurm node, you may encounter nodes in the INVAL state.
The INVAL state often indicates a mismatch between the slurm.conf file and the slurmd daemon running on the Compute node. This may occur for multiple reasons:
- Resource mismatch: The
slurm.conffile might declare that a certain number of CPUs, memory, or GPUs are available, but when theslurmdon that node starts up, it reports a different number of these resources. For example, theslurm.confdeclaresRealMemory=128000(128GB), but the node'sslurmdreports onlyRealMemory=96000(96GB) after checking its system. This could be due to memory being used by other processes on the node. Theslurm.conffile may have been updated to reflect a change in the number of CPUs or memory on the node, but theslurmddaemon has not been restarted to reflect the change. - Hardware feature mismatch: If the
slurm.conffile specifies a hardware feature, such asFeatures=gpu, but the node'sslurmdcannot detect that feature, the node will go into theINVALstate. - Networking and Hostname issues: Problems with a node's hostname, IP address, or network connectivity can cause the
slurmctldto receive invalid or incomplete registration information fromslurmd, resulting in theINVALstate.
Troubleshoot an INVAL node
To troubleshoot an INVAL node, first examine the slurmctld logs on the controller node and the slurmd logs on the Compute node. These logs often contain explicit error messages detailing why the node's configuration is invalid.
For a detailed report of the node's configuration, run the slurmd -C command on the Compute node.
$slurmd -C
Compare this output with the node's entry in your slurm.conf file. If there are any discrepancies, update the slurm.conf file to match the slurmd configuration.
After making changes, restart the slurmctld on the controller node and the slurmd daemon on the Compute node to allow them to re-register with the correct information.
For detailed instructions, see Restart the Slurm Controller and Restart the Slurm daemon.
You must restart the slurmctld and slurmd for changes made to the slurm.conf file to take effect.