🔧 Engineering Principles of OpenCHAMI: Hard Lessons in HPC Management

Designing for Reliability, Simplicity, and the Sysadmin


📝 Introduction: Learning from Experience

Software isn’t just about writing code—it’s about making deliberate choices based on real-world needs. At OpenCHAMI, every design decision comes from lessons learned in HPC system management—some through careful planning, and others through painful experience.

The principles below aren’t just abstract ideals—they are practical rules that guide how we build OpenCHAMI. They reflect the trade-offs we’ve had to make and the priorities we’ve had to balance.


1️⃣ The Sysadmin is Responsible for Preventing Downtime. Serve the Sysadmin.

“If a system is down, it’s the sysadmin’s problem. Make it easier, not harder.”

HPC systems are mission-critical. When they go down, someone gets a phone call—and that person is usually the sysadmin.

That’s why OpenCHAMI is built with one core customer in mind: the person who has to keep the system running. Every design choice prioritizes reducing unplanned maintenance activities and ensuring sysadmins have full visibility and control over the system.

🚀 What This Means for OpenCHAMI:
✔️ System failures should be predictable, not mysterious.
✔️ Logs, metrics, and error messages should help, not confuse.
✔️ If an upgrade or feature adds risk, we rethink it.


2️⃣ Eliminate Complexity.

“The easiest way to manage complexity is to remove it.”

Complexity isn’t just an inconvenience—it’s a risk. Every added feature, configuration, or service introduces new failure modes.

We believe that the best system management tools are the ones you don’t notice because they just work. OpenCHAMI avoids unnecessary dependencies and favors simpler, more maintainable solutions.

🚀 What This Means for OpenCHAMI:
✔️ Avoid adding configuration settings that won’t be used.
✔️ Don’t build unnecessary integrations—let sites choose their own tools.
✔️ Prioritize “sane defaults” to reduce tuning complexity.


3️⃣ Where Complexity is Useful to the Sysadmin, Make it Transparent.

“Sometimes complexity is unavoidable. But it should never be hidden.”

There are cases where complexity serves a purpose—for example:

  • Fine-tuning network configurations
  • Customizing job scheduling policies
  • Managing hardware variations

The key isn’t to eliminate this complexity, but to make it transparentalways show what’s happening under the hood.

🚀 What This Means for OpenCHAMI:
✔️ Logs and debugging tools should expose the real state of the system.
✔️ Automation should be optional, not forced.
✔️ Documentation should explain how things work, not just how to use them.


4️⃣ The Best Code for OpenCHAMI is Written and Maintained by a Larger Community.

“We focus on what makes OpenCHAMI unique and rely on larger projects for everything else.”

A common mistake in infrastructure projects is reinventing solutions that already exist elsewhere. OpenCHAMI doesn’t need to create:

  • An API Gateway → There are mature, battle-tested options like Kong, Envoy, and Traefik.
  • A Certificate Authority → We rely on well-established PKI and ACME-based tooling instead of building our own.
  • A Custom Database or Storage Layer → OpenCHAMI works with standard databases rather than designing custom persistence models.

Instead of trying to own every problem, we focus only on what’s specific to OpenCHAMI and HPC:

  • Node lifecycle management
  • Cluster-wide configuration orchestration
  • HPC-specific security and authentication models

By delegating non-HPC-specific problems to broader open-source communities, OpenCHAMI remains lightweight, maintainable, and future-proof.

🚀 What This Means for OpenCHAMI:
✔️ Reuse mature external projects instead of reinventing infrastructure.
✔️ Stay modular—integrate easily with industry-standard tools.
✔️ Contribute upstream to projects we rely on instead of forking or duplicating efforts.


5️⃣ The Best Way to Fix Distributed Systems is Not to Have Them.

“Distributed systems fail in ways you can’t predict. Avoid them when possible.”

Clusters are inherently distributed, but that doesn’t mean every piece of software managing them has to be. OpenCHAMI avoids unnecessary distributed components because:

  • Synchronization failures are hard to debug.
  • Network dependencies introduce new failure points.
  • Single-node services are easier to scale than complex distributed ones.

🚀 What This Means for OpenCHAMI:
✔️ Prefer stateless, independent services over distributed coordination.
✔️ Reduce reliance on external databases when possible.
✔️ If something must be distributed, keep it simple.


6️⃣ Support HTTP REST APIs (And Why We Chose HTTP Over GRPC).

“We prioritize protocols that sysadmins can easily use and debug.”

At one point, we considered supporting gRPC and other binary protocols. However, we stuck with HTTP because:

  • Sysadmins can use curl, wget, or Postman to troubleshoot API requests.
  • Logging and monitoring tools for HTTP are widely available and well-understood.
  • Network firewalls, proxies, and load balancers all support HTTP by default.
  • Almost every programming language has robust HTTP libraries—without requiring a special SDK.

🚀 What This Means for OpenCHAMI:
✔️ Stick with widely supported, human-friendly protocols.
✔️ Prioritize ease of debugging over theoretical efficiency.
✔️ Keep APIs scriptable and accessible to sysadmins first.


📌 Final Thoughts

OpenCHAMI’s engineering principles aren’t just guidelines—they are the foundation of every design decision we make. They help us prioritize the right trade-offs, ensure reliability, and most importantly, serve the sysadmins who keep HPC systems running.

🚀 Want to help shape the future of OpenCHAMI?
💬 Join the discussion.
🔧 Contribute on GitHub.
📖 Explore our docs.