October 17, 2024 in HPC, OpenCHAMI, Booting by Alex Lovell-Troy3 minutes
In previous posts, we covered how we set up OpenCHAMI and interacted with it via the API and CLI. Now, let’s dive into one of the most critical aspects of managing a large HPC cluster—booting nodes efficiently and reliably.
Like many HPC systems, the nodes in the Badger cluster are diskless. Each boot relies on loading a remote filesystem image into memory. The image is built to include everything needed for the node to operate, while any filesystem changes during runtime are saved to an overlayfs
layer, which also runs in memory.
OpenCHAMI itself doesn’t include tooling to build, store, and serve system images. In keeping with our core principle of modularity, each site has their own preferred OS and image build pipeline. And, since OpenCHAMI doesn’t have custom softare that must be installed in the system image, any Linux operating system should work. OpenCHAMI references existing kernels, ramdisks, and system image through urls in boot parameters.
At LANL, we use Buildah and containers to create images and then share them with Quay. For automation, we use gitlab runners to trigger a new image build on new commits to our git repository.
Here’s how a node’s boot process is configured in OpenCHAMI using ochami-cli:
Our kernel commandline has a few unique items:
live
specification indicates that Linux will download the filesystem and make it an overlayfs layer for the newroot.To populate BSS with ochami-cli
:
And to view the new data:
In this post, we explored how OpenCHAMI orchestrates the boot process for diskless HPC nodes, leveraging remote filesystem images and modular tools like Buildah for creating and managing system images. By maintaining flexibility in image creation and boot configurations, OpenCHAMI allows sites to use their preferred operating systems and infrastructure. With a focus on efficiency and scalability, the system simplifies booting large clusters by integrating seamlessly with existing tools and workflows. As we continue this series, we’ll dive deeper into deployment workflows and how OpenCHAMI can streamline HPC operations across a wide range of environments.
Stay tuned for the final part in our series!