Supercomputing 23
On this page
With a press release that coincided with the opening of the show floor at Supercomputing ‘23 in Denver, Colorado, the OpenCHAMI consortium was officially launched.
Read our official release at LANL.gov
OpenCHAMI Debut
Even before the announcement, the team at LANL spent the summer curating a minimal set of microservices to achieve our goal of booting a ten node cluster. At the core, we chose the inventory system from HPE’s CSM software stack and the companion service for managing boot scripts, BSS.
- CSM Inventory and State Management github.com/Cray-HPE/hms-smd
- CSM Boot Script Generation and Cloud-Init provider github.com/Cray-HPE/hms-bss
The HPE services were built to be part of a large integrated system which meant that some changes were needed in order to use the services with OpenCHAMI. Where possible, we have tried to maintain reversability for any changes to the HPE software. You can read more about the process in our post, Refactoring SMD.
As part of our refactor, we wanted to disable some of the subroutines in SMD that made it hard for us to deploy it on the cloud and on servers without direct access to the BMCs of our cluster. In particular, we wanted to decouple the hardware discovery system from the inventory system. After a few attempts to isolate and reuse the discovery code from SMD, we decided to pursue a green field approach to discovery, which led to our new discovery tool, Magellan. You can read more about that process in our post, Magellan: alternative redfish discovery.
We also wanted to prove to ourselves that we could address some of the challenges we’d heard from operators that use SMD and BSS regularly. To identify promising areas, we conducted a few interviews based on the methodology outlined in The Phoenix Project. One item that came up as a large source of unplanned maintenance was the etcd database used by BSS to support booting. You can read more about how we added Postgres database support to BSS in our post, Switching from etcd to postgres