Howto restart elasticluster/SLURM cluster on ScienceCloud

Dear cloud users

in case you were running a SLURM elasticluster on ScienceCloud, here are the steps to recover:

  • Start frontend first
    • if volume(s) were attached
    • mount volume if needed
    • restart nfs service
  • Start compute nodes
    • Wait for the frontend to be up and running before you start compute nodes.  (This is mostly to avoid NIS/YP timeouts; NFS is pretty robust in handling with server failures.)
  • If SLURM service does not start on compute/frontend check:
    • /var/run/slurm and /var/run/slurm-llnl - they should exist and be writeable by user 'slurm'; /var/run/slurm should be a simlink to /var/run/slurm-llnl
    • Check /var/log/slurm/slurm*.log for failure information
    • in case, re-run elasticluster setup
      • note that this step will also erase (some) manual customizations (e.g. additions to `slurm.conf` and other config files).
  • Important: Stop whenever a problem is encountered or something does not follow the instructions schema
    • Allow S3IT to help by adding the S3IT elasticluster public ssh-key to your frontend - in ~/.ssh/authorized_keys
      • ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCmGFagrJOLwxU9sRKKE1P09W0LlVsAHYvWpAbeWSVDFBmfO2YZH/KyJyGZxvZ0fkZIHltPvYKrAoN7A8+QHS0GPOcWYA72tSVMGLFxWe18KrpjmqBnABJd/V9ViMCbUXBYGWk9k7tgeuFDMjguaNufzOx1QUN8wMaA6dPqG0QBip7vA45EGjQVBKTCsR7j4nnMbN9W/va3YL5gm9OR5W5uqeXSZUIQsi8j4fLZWmDVcUNkr8YQIYmBY4tgbQ3FayTlC+PmAFAIoBWCbM0emk6YBX77896ldQMq7KQ8tpys7evopECOWBDE2W9rWSxCAVeXNRHPJgHO7JtB4LeIdcxN gc3-user@elasticluster

Do not hesitate to contact us in case the above procedure would not work