Skip to main content

Address nodes in the INVAL state

When checking the state of a Slurm node, you may encounter nodes in the INVAL state.

The INVAL state often indicates a mismatch between the slurm.conf file and the slurmd daemon running on the Compute node. This may occur for multiple reasons:

  • Resource mismatch: The slurm.conf file might declare that a certain number of CPUs, memory, or GPUs are available, but when the slurmd on that node starts up, it reports a different number of these resources. For example, the slurm.conf declares RealMemory=128000 (128GB), but the node's slurmd reports only RealMemory=96000 (96GB) after checking its system. This could be due to memory being used by other processes on the node. The slurm.conf file may have been updated to reflect a change in the number of CPUs or memory on the node, but the slurmd daemon has not been restarted to reflect the change.
  • Hardware feature mismatch: If the slurm.conf file specifies a hardware feature, such as Features=gpu, but the node's slurmd cannot detect that feature, the node will go into the INVAL state.
  • Networking and Hostname issues: Problems with a node's hostname, IP address, or network connectivity can cause the slurmctld to receive invalid or incomplete registration information from slurmd, resulting in the INVAL state.

Troubleshoot an INVAL node

To troubleshoot an INVAL node, first examine the slurmctld logs on the controller node and the slurmd logs on the Compute node. These logs often contain explicit error messages detailing why the node's configuration is invalid.

For a detailed report of the node's configuration, run the slurmd -C command on the Compute node.

Example
$
slurmd -C

Compare this output with the node's entry in your slurm.conf file. If there are any discrepancies, update the slurm.conf file to match the slurmd configuration.

After making changes, restart the slurmctld on the controller node and the slurmd daemon on the Compute node to allow them to re-register with the correct information.

For detailed instructions, see Restart the Slurm Controller and Restart the Slurm daemon.

Restart required

You must restart the slurmctld and slurmd for changes made to the slurm.conf file to take effect.