Install Slurm

0 Overview

This guide walks through setting up Slurm on an OpenCHAMI cluster. This guide will assume you have already setup an OpenCHAMI cluster per Sections 1-2.6 in the OpenCHAMI Tutorial, therefore this guide will assume the rocky user on a Rocky Linux 9 system. Please substitute the rocky user with your normal user if you have setup an OpenCHAMI cluster outside of the tutorial. The only other requirement is to run a webserver to serve a Slurm repo for the image builder to use, and this guide assumes Podman is present to use. This guide will only walk through Slurm setup for a cluster with one head node and one compute node, but is easily expanded to multiple compute nodes by updating the node list with ochami.

0.1 Prerequisites

Note

This guide assumes you have setup an OpenCHAMI cluster per Sections 1-2.6 in the OpenCHAMI tutorial.

0.2 Contents

1 Setup and Configure Slurm

Steps in this section occur on the head node created in the OpenCHAMI tutorial (or otherwise).

1.1 Setup Slurm Build/Installation as a Local Repository

Install version 0.5.18 of munge. Versions 0.5-0.5.17 have a significant security vulnerability, so it is important that version 0.5.18 is used instead of 0.5.13 which is available through dnf for Rocky Linux 9. For more information see: https://nvd.nist.gov/vuln/detail/CVE-2026-25506

Change into working directory (created in Section 1.1 of the Tutorial), so that any files that are created are put here.

cd /opt/workdir

Grab munge version 0.5.18 release tarball from GitHub:

curl -sL https://github.com/dun/munge/releases/download/munge-0.5.18/munge-0.5.18.tar.xz -o munge-0.5.18.tar.xz

Convert tarball to rpm package, build dependencies and build binary package:

sudo dnf install -y rpm-build rpmdevtools

rpmbuild -ts munge-0.5.18.tar.xz --define "_topdir /opt/workdir/rpmbuild"

sudo dnf builddep -y /opt/workdir/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm

rpmbuild -tb munge-0.5.18.tar.xz --define "_topdir /opt/workdir/rpmbuild"

Install rpms created by rpmbuild:

sudo rpm --install --verbose --force \
    rpmbuild/RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm \
    rpmbuild/RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm \
    rpmbuild/RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm \
    rpmbuild/RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm \
    rpmbuild/RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm \
    rpmbuild/RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm

Check that munge was installed correctly:

munge --version

The output should be:

munge-0.5.18 (2026-02-10)

Download Slurm pre-requisite sources compatible with Rocky 9 OS:

sudo dnf -y update && \
sudo dnf clean all && \
sudo dnf -y install epel-release && \
sudo dnf -y install dnf-plugins-core && \
sudo dnf config-manager --set-enabled devel && \
sudo dnf config-manager --set-enabled crb && \
sudo dnf groupinstall -y 'Development Tools' && \
sudo dnf install -y createrepo freeipmi freeipmi-devel dbus-devel gtk2-devel hdf5 hdf5-devel http-parser-devel \
               hwloc hwloc-devel jq json-c-devel libaec libconfuse libcurl-devel libevent-devel \
               libyaml libyaml-devel lua-devel lua-filesystem lua-json lua-lpeg lua-posix lua-term mariadb mariadb-devel \
               ncurses-devel numactl numactl-devel oniguruma openssl-devel pam-devel \
               perl-DBI perl-ExtUtils-MakeMaker perl-Switch pigz python3 python3-devel readline-devel \
               lsb_release rrdtool rrdtool-devel tcl tcl-devel ucx ucx-cma ucx-devel ucx-ib wget \
               lz4-devel s2n-tls-devel libjwt-devel librdkafka-devel && \
sudo dnf clean all

Create build script to install Slurm 24.05.5 and PMIX 4.2.9-1:

Note

This guide installs Slurm 24.05.5 and PMIX 4.2.9-1 to ensure compatibility. Other versions can be installed instead, but make sure to check version compatibility first.

Edit as normal user: /opt/workdir/build.sh

/opt/workdir/build.sh
SLURMVERSION=${1:-24.05.5}
PMIXVERSION=${2:-4.2.9-1}
ELRELEASE=${3:-el9} #Rocky 9

subversions=( ${PMIXVERSION//-/ } )
pmixmajor=${subversions[0]}
export LC_ALL="C"
OSVERSION=$(lsb_release -r | gawk '{print $2}')
CDIR=$(pwd)
SDIR="slurm/$OSVERSION/$SLURMVERSION"
mkdir -p ${SDIR}
if [[ -e ${SDIR}/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm ]]; then
        echo "The RPM of PMIX version ${PMIXVERSION} is already available."
else
        cd slurm
        wget https://github.com/openpmix/openpmix/releases/download/v${pmixmajor}/pmix-${PMIXVERSION}.src.rpm || {
                echo "$? pmix-${PMIXVERSION}.src.rpm not downloaded"
                exit
        }
        rpmbuild --rebuild ./pmix-${PMIXVERSION}.src.rpm &> rpmbuild-pmix-${PMIXVERSION}.log || {
                echo "$? pmix-${PMIXVERSION}.src.rpm not builded, review rpmbuild-pmix-${PMIXVERSION}.log"
                exit
        }
        cd ${CDIR}
        mv /root/rpmbuild/RPMS/x86_64/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm ${SDIR}
        dnf -y install ${SDIR}/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm        
fi
if [[ -e ${SDIR}/slurm-${SLURMVERSION}-*.rpm ]]; then
        echo "The RPMs of slurm ${SLURMVERSION} are already available."
else
        cd slurm
        wget https://download.schedmd.com/slurm/slurm-${SLURMVERSION}.tar.bz2 || wget http://www.schedmd.com/download/archive/slurm-${SLURMVERSION}.tar.bz2 || {
                echo "$? slurm-${SLURMVERSION}.tar.bz2 not downloaded"
                exit
        }
        rpmbuild -ta --with pmix --with lua --with pam --with mysql --with ucx --with slurmrestd slurm-${SLURMVERSION}.tar.bz2 &> rpmbuild-slurm-${SLURMVERSION}.log || {
                echo "$? slurm-${SLURMVERSION}.tar.bz2 not builded, review rpmbuild-slurm-${SLURMVERSION}.log"
                exit
        }
        grep 'configure: WARNING:' rpmbuild-slurm-${SLURMVERSION}.log 
        cd ${CDIR}
        mv /root/rpmbuild/RPMS/x86_64/slurm*-${SLURMVERSION}-*.rpm ${SDIR}
fi

Adjust permissions for build script so that it is executable, and execute it with root privileges:

chmod 755 /opt/workdir/build.sh
sudo /opt/workdir/build.sh

Note

The following warnings are normal:

configure: WARNING: unable to locate libnvidia-ml.so and/or nvml.h
configure: WARNING: unable to locate librocm_smi64.so and/or rocm_smi.h
configure: WARNING: unable to locate libze_loader.so and/or ze_api.h
configure: WARNING: HPE Slingshot: unable to locate libcxi/libcxi.h
configure: WARNING: unable to build man page html files without man2html

Copy the Slurm packages to the desired location to create the local repository:

sudo mkdir -p /srv/repo/rocky/9/x86_64/
sudo cp -r /opt/workdir/slurm/9.7/24.05.5 /srv/repo/rocky/9/x86_64/slurm-24.05.5

Create the local repository (this will be used for installation and images later):

sudo createrepo /srv/repo/rocky/9/x86_64/slurm-24.05.5

The output should be:

Directory walk started
Directory walk done - 15 packages
Temporary output repo path: /srv/repo/rocky/9/x86_64/slurm-24.05.5/.repodata/
Preparing sqlite DBs
Pool started (with 5 workers)
Pool finished

1.2 Configure Slurm and Slurm Services

Create user and group β€˜slurm’ with specified UID/GID:

SLURMID=666
sudo groupadd -g $SLURMID slurm
sudo useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurm

Note

The following warning is expected and can be ignored, as the ‘slurm’ user is a system service account so can have a UID below 1000.

useradd warning: slurm's uid 666 outside of the UID_MIN 1000 and UID_MAX 60000 range.

Update the UID and GID of β€˜munge’ user and group to 616, update directory ownership, create munge key and restart the munge service:

# Update UID and GID
sudo usermod -u 616 munge
sudo groupmod -g 616 munge

# Fix user and group ownership
sudo chown munge:munge /var/log/munge/
sudo chown munge:munge /var/lib/munge/
sudo chown munge:munge /etc/munge/

# Create munge key
sudo -u munge /usr/sbin/mungekey -v

# Start munge again
sudo systemctl enable --now munge

Install mariaDB:

sudo dnf -y install mariadb-server

Tune mariaDB with the Slurm recommended options for the compute node where mariaDB will be running:

cat <<EOF | sudo tee /etc/my.cnf.d/innodb.cnf
[mysqld]
innodb_buffer_pool_size=5120M
innodb_log_file_size=512M
innodb_lock_wait_timeout=900
max_allowed_packet=16M
EOF

Note

We are assigning 5GB to the innodb_buffer_pool_size. The pool size should be 5-50% of the available memory of the head node and at least 4GB.

Enable and start the mariaDB service as this is a single node cluster (so we aren’t enabling High Availability):

sudo systemctl enable --now mariadb

Secure the mariaDB installation with a strong root password. Use pwgen to generate a password, and make sure to store this password securely. You will use the pwgen password to setup and configure MariaDB, as well as to create a database for Slurm to access the head node:

sudo dnf -y install pwgen
export SQL_PWORD="$(pwgen 20 1)"
echo "${SQL_PWORD}" # copy output for interactive prompts and so you can store it somewhere securely

sudo mysql_secure_installation

MariaDB setup/settings should be done as follows:

Enter current password for root (enter for none): # enter user password (e.g. "rocky" if following tutorial)

Switch to unix_socket authentication [Y/n] Y

Change the root password? [Y/n] Y
New password: # use the password from pwgen
Re-enter new password: # use the password from pwgen

Remove anonymous users? [Y/n] n

Disallow root login remotely? [Y/n] Y

Remove test database and access to it? [Y/n] n

Reload privilege tables now? [Y/n] Y

Create the database and grant access to localhost and the head node. You will need the password you generated with pwgen in the above step when prompted to “Enter password:”:

cat <<EOF | mysql -u root -p
create database slurm_acct_db;
grant all on slurm_acct_db.* to slurm@'localhost' identified by "${SQL_PWORD}";
grant all on slurm_acct_db.* to slurm@'demo.openchami.cluster' identified by "${SQL_PWORD}";
grant all on slurm_acct_db.* to slurm@'demo' identified by "${SQL_PWORD}";
exit
EOF

Install a few more dependencies that are required:

sudo dnf -y install jq libconfuse numactl parallel perl-DBI perl-Switch

Setup directory structure for the Slurm database and controller daemon services:

sudo mkdir -p /var/spool/slurmctld /var/log/slurm /run/slurm
sudo chown -R slurm. /var/spool/slurmctld /var/log/slurm /run/slurm
echo "d /run/slurm 0755 slurm slurm -" | sudo tee /usr/lib/tmpfiles.d/slurm.conf

1.3 Install Slurm and Setup Configuration Files

Add the Slurm repo created earlier to install from it (will ensure we get the correct package versions):

# Create local repo file
SLURMVERSION=24.05.5

cat <<EOF | sudo tee /etc/yum.repos.d/slurm-local.repo
[slurm-local]
name=Slurm ${SLURMVERSION} - Local
baseurl=file:///srv/repo/rocky/9/x86_64/slurm-${SLURMVERSION}
gpgcheck=0
enabled=1
countme=1
EOF

# Install from local repo file
sudo dnf -y install slurm slurm-contribs slurm-example-configs slurm-libpmi slurm-pam_slurm slurm-perlapi slurm-slurmctld slurm-slurmdbd pmix

Create configuration files by copying the example files, and then modify the directory and file ownership:

# Copy configuration files
sudo cp -p /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
sudo cp -p /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf

# Set directory and file ownership to slurm
sudo chown -R slurm. /etc/slurm/

Modify the SlurmDB config. You will need the pwgen generated password generated earlier when setting up MariaDB for this section:

DBHOST=demo
DBPASSWORD="${SQL_PWORD}"
SLURMDBHOST1=demo

sudo sed -i "s|DbdAddr.*|DbdAddr=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|DbdHost.*|DbdHost=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurmdbd.conf

sudo sed -i "s|#StorageHost.*|StorageHost=${DBHOST}|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|#StoragePort.*|StoragePort=3306|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|StoragePass.*|StoragePass=${DBPASSWORD}|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|SlurmUser.*|SlurmUser=slurm|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurmdbd.conf

sudo sed -i "s|#StorageLoc.*|StorageLoc=slurm_acct_db|g" /etc/slurm/slurmdbd.conf

The environment variable we set earlier to store the password for SQL should now be unset for security:

unset SQL_PWORD

Create the Slurm config file, which will be used by SlurmCTL. Note that you may need to update the NodeName info depending on the configuration of your compute node.

Note

If the head node is in a VM (see Head Node: Using Virtual Machine), the SlurmctldHost will be head instead of demo.

Edit the Slurm config file as root: /etc/slurm/slurm.conf

Add job container config file to Slurm config directory:

SLURMTMPDIR=/lscratch

cat <<EOF | sudo tee /etc/slurm/job_container.conf
# Job /tmp on a local volume mounted on ${SLURMTMPDIR}
# /dev/shm has special handling, and instead of a bind mount is always a fresh tmpfs filesystem.
BasePath=${SLURMTMPDIR}
AutoBasePath=true
Shared=true
EOF

Configure the hosts file with addresses for both the head node and the compute node:

1.4 Make a Local Slurm Repository and Serve it with Nginx

Create configuration file to mount into Nginx container:

Edit as normal user: /opt/workdir/nginx.conf

/opt/workdir/nginx.conf
user  nginx;
worker_processes  auto;

error_log  /var/log/nginx/error.log notice;
pid        /run/nginx.pid;


events {
    worker_connections  1024;
}


http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    keepalive_timeout  65;

    #gzip  on;

    include /etc/nginx/conf.d/*.conf;

    server {
        # configuration of HTTP virtual server 
        location /slurm-24.05.5 {
            # configuration for processing URIs for local Slurm repo
            # serve static files from this path 
            # such that a request for /slurm-24.05.5/repodata/repomd.xml will be served /usr/share/nginx/html/slurm-24.05.5/repodata/repomd.xml
            root /usr/share/nginx/html;
        }
    }
}

Use Podman to run Nginx in a container that has the local Slurm repository and the Nginx configuration file mounted into it:

podman run --name serve-slurm \
    -v /opt/workdir/nginx.conf:/etc/nginx/nginx.conf \
    --mount type=bind,source=/srv/repo/rocky/9/x86_64/slurm-24.05.5,target=/usr/share/nginx/html/slurm-24.05.5,readonly \
    -p 8080:80 -d nginx

Check everything is working by grabbing the repodata file from inside the head node:

curl http://localhost:8080/slurm-24.05.5/repodata/repomd.xml

The output should be:

<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
  <revision>1770960915</revision>
  <data type="primary">
    <checksum type="sha256">4670c00aed4cc64e542e8b76f4f59ec4dd333a2e02258ddab5b7604874915dff</checksum>
    <open-checksum type="sha256">04f66940b8479413f57cf15aa66d56624aede301f064356ee667ccf4594470ef</open-checksum>
    <location href="repodata/4670c00aed4cc64e542e8b76f4f59ec4dd333a2e02258ddab5b7604874915dff-primary.xml.gz"/>
    <timestamp>1770960914</timestamp>
    <size>5336</size>
    <open-size>33064</open-size>
  </data>
  <data type="filelists">
    <checksum type="sha256">11b43e8e70d418dbe78a8c064ca42e18d63397729ba0710323034597f681d0a4</checksum>
    <open-checksum type="sha256">1f2b8e754a2db5c26557ad2a7b9c8b6a210115a4263fb153bc0445dc8210b59c</open-checksum>
    <location href="repodata/11b43e8e70d418dbe78a8c064ca42e18d63397729ba0710323034597f681d0a4-filelists.xml.gz"/>
    <timestamp>1770960914</timestamp>
    <size>11154</size>
    <open-size>68224</open-size>
  </data>
  <data type="other">
    <checksum type="sha256">a7f25375920bf5d30d9de42a6f4aeaa8105b1150bc9bef1440e700369bcdcf53</checksum>
    <open-checksum type="sha256">da1da29e2d02a626986c3647032c175e0cb768d4d643c9020e2ccc343ced93e4</open-checksum>
    <location href="repodata/a7f25375920bf5d30d9de42a6f4aeaa8105b1150bc9bef1440e700369bcdcf53-other.xml.gz"/>
    <timestamp>1770960914</timestamp>
    <size>1229</size>
    <open-size>3354</open-size>
  </data>
  <data type="primary_db">
    <checksum type="sha256">34f7acb86f91ab845250ed939181b88acb0be454e9c42eb99cb871e3241f75e4</checksum>
    <open-checksum type="sha256">038901ed7c43b991becd931370b539c29ad5c7abffefc1ce6fc20cb8e1c1b7c7</open-checksum>
    <location href="repodata/34f7acb86f91ab845250ed939181b88acb0be454e9c42eb99cb871e3241f75e4-primary.sqlite.bz2"/>
    <timestamp>1770960915</timestamp>
    <size>12132</size>
    <open-size>131072</open-size>
    <database_version>10</database_version>
  </data>
  <data type="filelists_db">
    <checksum type="sha256">1912b17f136f28e892c9591e34abc1c8ef5b466df8eed7d6d1c5adadb200c6ad</checksum>
    <open-checksum type="sha256">9eb023458e4570a8c3d9407e24ee52a94befc93785e71b1f72a5d90f314762e2</open-checksum>
    <location href="repodata/1912b17f136f28e892c9591e34abc1c8ef5b466df8eed7d6d1c5adadb200c6ad-filelists.sqlite.bz2"/>
    <timestamp>1770960915</timestamp>
    <size>15917</size>
    <open-size>73728</open-size>
    <database_version>10</database_version>
  </data>
  <data type="other_db">
    <checksum type="sha256">2872ebc347c2e5fe166907ba8341dc10ef9d0419261fac253cb6bab0d1eb046f</checksum>
    <open-checksum type="sha256">5db7c12e76bde1a6b5739ad5c52481633d1dd87599e86ce4d84bae8fe4504db1</open-checksum>
    <location href="repodata/2872ebc347c2e5fe166907ba8341dc10ef9d0419261fac253cb6bab0d1eb046f-other.sqlite.bz2"/>
    <timestamp>1770960915</timestamp>
    <size>1940</size>
    <open-size>24576</open-size>
    <database_version>10</database_version>
  </data>
</repomd>

Create the compute Slurm image config file (uses the base image created in the tutorial as the parent layer):

Warning

When writing YAML, it’s important to be consistent with spacing. It is recommended to use spaces for all indentation instead of tabs.

When pasting, you may have to configure your editor to not apply indentation rules (:set paste in Vim, :set nopaste to switch back).

Edit as root: /etc/openchami/data/images/compute-slurm-rocky9.yaml

/etc/openchami/data/images/compute-slurm-rocky9.yaml
options:
  layer_type: base
  name: compute-slurm
  publish_tags:
    - 'rocky9'
  pkg_manager: dnf
  gpgcheck: False
  parent: 'demo.openchami.cluster:5000/demo/rocky-base:9'
  registry_opts_pull:
    - '--tls-verify=false'

  publish_s3: 'http://demo.openchami.cluster:7070'
  s3_prefix: 'compute/slurm/'
  s3_bucket: 'boot-images'

repos:
  - alias: 'Epel9'
    url: 'https://dl.fedoraproject.org/pub/epel/9/Everything/x86_64/'
    gpg: 'https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-9'
  - alias: 'Slurm'
    url: 'http://localhost:8080/slurm-24.05.5'
            

packages:
  - boxes
  - figlet
  - git
  - nfs-utils
  - tcpdump
  - traceroute
  - vim
  - curl
  - rpm-build
  - shadow-utils
  - pwgen
  - jq
  - libconfuse
  - numactl
  - parallel
  - perl-DBI
  - slurm-24.05.5
  - pmix-4.2.9
  - slurm-contribs-24.05.5
  - slurm-devel-24.05.5
  - slurm-example-configs-24.05.5
  - slurm-libpmi-24.05.5
  - slurm-pam_slurm-24.05.5
  - slurm-perlapi-24.05.5
  - slurm-sackd-24.05.5
  - slurm-slurmctld-24.05.5
  - slurm-slurmd-24.05.5
  - slurm-slurmdbd-24.05.5
  - slurm-slurmrestd-24.05.5
  - slurm-torque-24.05.5

cmds:
  - cmd: 'curl -sL https://github.com/dun/munge/releases/download/munge-0.5.18/munge-0.5.18.tar.xz -o munge-0.5.18.tar.xz'
  - cmd: 'rpmbuild -ts munge-0.5.18.tar.xz'
  - cmd: 'dnf builddep -y /root/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm'
  - cmd: 'rpmbuild -tb munge-0.5.18.tar.xz'
  - cmd: 'cd /root/rpmbuild'
  - cmd: 'rpm --install --verbose --force /root/rpmbuild/RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm'
  - cmd: 'dnf remove -y munge-libs-0.5.13-13.el9 munge-0.5.13-13.el9'

Run podman container to run image build command. The S3_ACCESS and S3_SECRET tokens are set in the tutorial here.

podman run \
  --rm \
  --device /dev/fuse \
  --network host \
  -e S3_ACCESS=${ROOT_ACCESS_KEY} \
  -e S3_SECRET=${ROOT_SECRET_KEY} \
  -v /etc/openchami/data/images/compute-slurm-rocky9.yaml:/home/builder/config.yaml \
  ghcr.io/openchami/image-build-el9:v0.1.2 \
  image-build \
    --config config.yaml \
    --log-level DEBUG

Note

If you have already aliased the image build command per the tutorial, you can instead run:

build-image /etc/openchami/data/images/compute-slurm-rocky9.yaml

Check that the images built.

s3cmd ls -Hr s3://boot-images/ | cut -d' ' -f 4- | grep slurm

The output should be:

1615M  s3://boot-images/compute/slurm/rocky9.7-compute-slurm-rocky9
  84M  s3://boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.20.1.el9_7.x86_64.img
  14M  s3://boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.20.1.el9_7.x86_64

1.5 Configure the Boot Script Service and Cloud-Init

Get a fresh access token for ochami:

export DEMO_ACCESS_TOKEN=$(sudo bash -lc 'gen_access_token')

Create payload for boot script service with URIs for slurm compute boot artefacts:

sudo mkdir -p /etc/openchami/data/boot/bss

URIS=$(s3cmd ls -Hr s3://boot-images | grep compute/slurm | awk '{print $4}' | sed 's-s3://-http://172.16.0.254:7070/-' | xargs)
URI_IMG=$(echo "$URIS" | cut -d' ' -f1)
URI_INITRAMFS=$(echo "$URIS" | cut -d' ' -f2)
URI_KERNEL=$(echo "$URIS" | cut -d' ' -f3)
cat <<EOF | sudo tee /etc/openchami/data/boot/bss/boot-compute-slurm-rocky9.yaml
---
kernel: '${URI_KERNEL}'
initrd: '${URI_INITRAMFS}'
params: 'nomodeset ro root=live:${URI_IMG} ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-init'
macs:
  - 52:54:00:be:ef:01
EOF

Set BSS parameters:

ochami bss boot params set -f yaml -d @/etc/openchami/data/boot/bss/boot-compute-slurm-rocky9.yaml

Check the BSS boot parameters were added:

ochami bss boot params get -F yaml

The output should be:

- cloud-init:
    meta-data: null
    phone-home:
        fqdn: ""
        hostname: ""
        instance_id: ""
        pub_key_dsa: ""
        pub_key_ecdsa: ""
        pub_key_rsa: ""
    user-data: null
  initrd: http://172.16.0.254:9000/boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.20.1.el9_7.x86_64.img
  kernel: http://172.16.0.254:9000/boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.20.1.el9_7.x86_64
  macs:
    - 52:54:00:be:ef:01
  params: nomodeset ro root=live:http://172.16.0.254:9000/boot-images/compute/slurm/rocky9.7-compute-slurm-rocky9 ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-init

Create new directory for setting up cloud-init configuration:

sudo mkdir -p /etc/openchami/data/cloud-init
cd /etc/openchami/data/cloud-init

Create new ssh key on the head node and press Enter for all of the prompts:

ssh-keygen -t ed25519

The new key that was generated can be found in ~/.ssh/id_ed25519.pub. This key will need to be used in the cloud-init meta-data configured below.

Setup the cloud-init configuration by creating ci-defaults.yaml:

cat <<EOF | sudo tee /etc/openchami/data/cloud-init/ci-defaults.yaml
---
base-url: "http://172.16.0.254:8081/cloud-init"
cluster-name: "demo"
nid-length: 2
public-keys:
  - "$(cat ~/.ssh/id_ed25519.pub)"
short-name: "de"
EOF

Then, set the cloud-init defaults using the ochami CLI:

ochami cloud-init defaults set -f yaml -d @/etc/openchami/data/cloud-init/ci-defaults.yaml

Verify that these values were set with:

ochami cloud-init defaults get -F json-pretty

The output should be:

{
  "base-url": "http://172.16.0.254:8081/cloud-init",
  "cluster-name": "demo",
  "nid-length": 2,
  "public-keys": [
    "<YOUR SSH KEY>"
  ],
  "short-name": "nid"
}

Configure cloud-init for compute group:

Edit as root: /etc/openchami/data/cloud-init/ci-group-compute.yaml

/etc/openchami/data/cloud-init/ci-group-compute.yaml
- name: compute
  description: "compute config"
  file:
    encoding: plain
    content: |
      ## template: jinja
      #cloud-config
      merge_how:
      - name: list
        settings: [append]
      - name: dict
        settings: [no_replace, recurse_list]
      users:
        - name: root
          ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }}
      disable_root: false            

Now, set this configuration for the compute group:

ochami cloud-init group set -f yaml -d @/etc/openchami/data/cloud-init/ci-group-compute.yaml

Check that it got added with:

ochami cloud-init group get config compute

The cloud-config file created within the YAML above should get print out:

## template: jinja
#cloud-config
merge_how:
- name: list
  settings: [append]
- name: dict
  settings: [no_replace, recurse_list]
users:
  - name: root
    ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }}
disable_root: false

ochami has basic per-group template rendering available that can be used to check that the Jinja2 is rendering properly for a node. Check if for the first compute node (x1000c0s0b0n0):

ochami cloud-init group render compute x1000c0s0b0n0

Note

This feature requires that impersonation is enabled with cloud-init. Check and make sure that the IMPERSONATION environment variable is set in /etc/openchami/configs/openchami.env.

The SSH key that was created above should appear in the config:

#cloud-config
merge_how:
- name: list
  settings: [append]
- name: dict
  settings: [no_replace, recurse_list]
users:
  - name: root
    ssh_authorized_keys: ['<SSH_KEY>']

1.6 Boot the Compute Node with the Slurm Compute Image

Boot the compute1 compute node VM from the compute Slurm image:

Note

If the head node is in a VM (see Head Node: Using Virtual Machine), make sure to run the virt-install command on the host!

Note

If you recieve following error:

ERROR Failed to open file '/usr/share/OVMF/OVMF_VARS.fd': No such file or directory

Repeat the command, but replace OVMF_VARS.fd with OVMF_VARS_4M.fd and replace OVMF_CODE.secboot.fd with OVMF_CODE_4M.secboot.fd.

If this still fails, check the path under /usr/share/OVMF to check the name of the files there, as some distros store them under varient names.

Watch it boot. First, it should PXE:

>>Start PXE over IPv4.
  Station IP address is 172.16.0.1

  Server IP address is 172.16.0.254
  NBP filename is ipxe-x86_64.efi
  NBP filesize is 1079296 Bytes
 Downloading NBP file...

  NBP file downloaded successfully.
BdsDxe: loading Boot0001 "UEFI PXEv4 (MAC:525400BEEF01)" from PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(525400BEEF01,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)
BdsDxe: starting Boot0001 "UEFI PXEv4 (MAC:525400BEEF01)" from PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(525400BEEF01,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)
iPXE initialising devices...
autoexec.ipxe... Not found (https://ipxe.org/2d12618e)



iPXE 1.21.1+ (ge9a2) -- Open Source Network Boot Firmware -- https://ipxe.org
Features: DNS HTTP HTTPS iSCSI TFTP VLAN SRP AoE EFI Menu

Then, we should see it get it’s boot script from TFTP, then BSS (the /boot/v1 URL), then download it’s kernel/initramfs and boot into Linux.

Configuring (net0 52:54:00:be:ef:01)...... ok
tftp://172.16.0.254:69/config.ipxe... ok
Booting from http://172.16.0.254:8081/boot/v1/bootscript?mac=52:54:00:be:ef:01
http://172.16.0.254:8081/boot/v1/bootscript... ok
http://172.16.0.254:7070/boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.24.1.el9_7.x86_64... ok
http://172.16.0.254:7070/boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.24.1.el9_7.x86_64.img... ok

During Linux boot, output should indicate that the SquashFS image gets downloaded and loaded.

[    2.169210] dracut-initqueue[545]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[    2.170532] dracut-initqueue[545]:                                  Dload  Upload   Total   Spent    Left  Speed
100 1356M  100 1356M    0     0  1037M      0  0:00:01  0:00:01 --:--:-- 1038M
[    3.627908] squashfs: version 4.0 (2009/01/31) Phillip Lougher

Once PXE boot process is done, detach from the VM with ctrl+]. Log back into the virsh console if desired with virsh console compute1.

Tip

If the VM installation fails for any reason, it can be destroyed and undefined so that the install command can be run again.

  1. Shut down (“destroy”) the VM:
    sudo virsh destroy compute1
  2. Undefine the VM:
    sudo virsh undefine --nvram compute1
  3. Rerun the virt-install command above.

Alternatively, if you want to reboot the compute node VM with an updated image, do the following:

sudo virsh destroy compute1
sudo virsh start --console compute1

1.7 Configure and Start Slurm in the Compute Node

Login as root to the compute node, ignoring its host key:

Note

If using a VM head node, login from there. Else, login from host.

ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.16.0.1 

Check that the munge and slurm packages were installed from the correct sources (e.g. slurm packages should be installed from @slurm-local repo):

dnf list installed | grep -e munge -e slurm

The output should be:

munge.x86_64                                   0.5.18-1.el9                     @System                                            
munge-debuginfo.x86_64                         0.5.18-1.el9                     @System                                            
munge-debugsource.x86_64                       0.5.18-1.el9                     @System                                            
munge-devel.x86_64                             0.5.18-1.el9                     @System                                            
munge-libs.x86_64                              0.5.18-1.el9                     @System                                            
munge-libs-debuginfo.x86_64                    0.5.18-1.el9                     @System                                            
pmix.x86_64                                    4.2.9-1.el9                      @8080_slurm-24.05.5                                
slurm.x86_64                                   24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-contribs.x86_64                          24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-devel.x86_64                             24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-example-configs.x86_64                   24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-libpmi.x86_64                            24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-pam_slurm.x86_64                         24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-perlapi.x86_64                           24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-sackd.x86_64                             24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-slurmctld.x86_64                         24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-slurmd.x86_64                            24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-slurmdbd.x86_64                          24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-slurmrestd.x86_64                        24.05.5-1.el9                    @8080_slurm-24.05.5                                
slurm-torque.x86_64                            24.05.5-1.el9                    @8080_slurm-24.05.5 

Note

If there is a version 0.5.13 of munge currently installed and present in the output from the above command, remove it to ensure that version 0.5.18 is used.

dnf remove -y munge-libs-0.5.13-<version> munge-0.5.13-<version>

Create slurm config file that is identical to that of the head node. Note that you may need to update the NodeName info depending on the configuration of your compute node:

Note

If the head node is in a VM (see Head Node: Using Virtual Machine), the SlurmctldHost will be head instead of demo.

Edit the Slurm config file as root: /etc/slurm/slurm.conf

Configure the hosts file with addresses for both the head node and the compute node:

Create the Slurm user on the compute node:

SLURMID=666
groupadd -g $SLURMID slurm
useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurm

Update Slurm file and directory ownership:

chown -R slurm:slurm /etc/slurm/
chown -R slurm:slurm /var/lib/slurm

Note

Use find / -name "slurm" to make sure everything that needs to be changed is identified. Note that not all results need ownership modified though, such as directories under /run/, /usr/ or /var/!

Create the directory /var/log/slurm as it doesn’t exist yet, and set ownership to Slurm:

mkdir /var/log/slurm
chown slurm:slurm /var/log/slurm

Creating job_container.conf file that matches the one in the head node:

SLURMTMPDIR=/lscratch

cat <<EOF | sudo tee /etc/slurm/job_container.conf
# Job /tmp on a local volume mounted on ${SLURMTMPDIR}
# /dev/shm has special handling, and instead of a bind mount is always a fresh tmpfs filesystem.
BasePath=${SLURMTMPDIR}
AutoBasePath=true
Shared=true
EOF

Update ownership of the job container config file:

chown slurm:slurm /etc/slurm/job_container.conf

Munge UID is 991 and GID is 990, so change them both to 616 (to match head node UID/GID):

usermod -u 616 munge
groupmod -g 616 munge

Note

If you get the following error: usermod: user munge is currently used by process <PID>

Kill the process and repeat above two commands: kill -15 <PID>

Update munge file/directory ownership:

find / -mount -writable -type d -uid 991 -exec chown -R munge:munge \{\} \;

Copy the munge key from the head node to the compute node.

Inside the head node:

cd ~
sudo cp /etc/munge/munge.key ./
sudo chown "$(id -u):$(id -u)" munge.key
scp ./munge.key root@172.16.0.1:~/

Inside the compute node:

mv munge.key /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key

Note

In the case of an error about “Offending ECDSA key in ~/.ssh/known_hosts:3”, remove the compute node from the known hosts file and try the ‘scp’ command again:

ssh-keygen -R 172.16.0.1

Alternatively, setup an ignore.conf file per Section 2.8.3 of the tutorial, to prevent this issue.

Continuing inside the compute node, setup and start the services for Slurm.

Enable and start munge service:

systemctl enable munge.service
systemctl start munge.service
systemctl status munge.service

The output should be:

● munge.service - MUNGE authentication service
     Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
     Active: active (running) since Wed 2026-02-04 00:55:55 UTC; 1 week 2 days ago
       Docs: man:munged(8)
   Main PID: 1451 (munged)
      Tasks: 4 (limit: 24335)
     Memory: 2.2M (peak: 2.5M)
        CPU: 4.710s
     CGroup: /system.slice/munge.service
             └─1451 /usr/sbin/munged

Feb 04 00:55:55 de01 systemd[1]: Started MUNGE authentication service.

Enable and start slurmd:

systemctl enable slurmd
systemctl start slurmd
systemctl status slurmd

The output should be:

● slurmd.service - Slurm node daemon
     Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled)
     Active: active (running) since Fri 2026-02-13 05:59:32 UTC; 4s ago
   Main PID: 30727 (slurmd)
      Tasks: 1
     Memory: 1.3M (peak: 1.5M)
        CPU: 16ms
     CGroup: /system.slice/slurmd.service
             └─30727 /usr/sbin/slurmd --systemd

Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Stopped Slurm node daemon.
Feb 13 05:59:32 de01.openchami.cluster systemd[1]: slurmd.service: Consumed 3.533s CPU time, 3.0M memory peak.
Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Starting Slurm node daemon...
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPU frequency setting not configured for this node
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd version 24.05.5 started
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd started on Fri, 13 Feb 2026 05:59:32 +0000
Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Started Slurm node daemon.
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3892 TmpDisk=778 Uptime=796812 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) 

Disable the firewall and reset the nft ruleset in the compute node:

systemctl stop firewalld
systemctl disable firewalld

nft flush ruleset
nft list ruleset

Start Slurm service daemons in the head node:

sudo systemctl start slurmdbd
sudo systemctl start slurmctld

Restart Slurm service daemons in the compute node to ensure changes are applied:

systemctl restart slurmd

1.8 Test Munge and Slurm

Test munge on the head node:

# Try to munge and unmunge to access the compute node
munge -n | ssh root@172.16.0.1 unmunge

The output should be:

STATUS:           Success (0)
ENCODE_HOST:      ??? (192.168.200.2)
ENCODE_TIME:      2026-02-13 05:33:34 +0000 (1770960814)
DECODE_TIME:      2026-02-13 05:33:34 +0000 (1770960814)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              ??? (1000)
GID:              ??? (1000)
LENGTH:           0

Note

In the case of an error about “Offending ECDSA key in ~/.ssh/known_hosts:3”, remove the compute node from the known hosts file and try the ‘scp’ command again:

ssh-keygen -R 172.16.0.1

Alternatively, setup an ignore.conf file per Section 2.8.3 of the tutorial, to prevent this issue.

Test that you can submit a job from the head node.

Check that the node is present and idle:

sinfo

The output should be:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*        up   infinite      1   idle de01

Create user with a Slurm account:

sudo useradd -m -s /bin/bash testuser
sudo usermod -aG wheel testuser
sudo sacctmgr create user testuser defaultaccount=root
sudo su - testuser

Run a test job as the user ’testuser':

srun hostname

The output should be:

srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
de01

If something goes wrong and your compute node goes down, restart it with this command:

sudo scontrol update NodeName=de01 State=RESUME