GPU Server for HPC Cluster

How hard it is to build a server with four top-of-the-line GPU for an high-performance computing cluster? Harder than you might think.

When I started building the SCRP cluster back in 2020 summer, the GPU servers were provided by Asrock Rack. Everything except the GPUs were preassembled. This is the sensible thing to do in normal times.

Fast forward to 2021 summer, and times were not normal. The supply chain distribution and semiconductor shortage were in high gear. Pretty much every name-brand server manufacturer quoted us months-long lead time, if they were willing to deal with us at all. To get everything in for the new academic year, I constructed a series of servers with parts sourced from different parts of the world. It is actually not that hard to build servers—they are basically heavy-duty PC’s with all sorts of specialized parts—that is, unless you want a GPU server suitable for an HPC cluster.

So what is so special about GPU servers for HPC cluster?

  • Most server case have seven to eight PCI slots, but I needed at least nine slots (four dual-slot GPU + single slot Infiniband network card). There are maybe two manufacturers for such cases you can find from retail channel.
  • High-end GPU uses a lot of power. A single RTX3090 uses 350W, four means 1400W. Adding in CPU and other stuff and you are looking at 1800W minimum. A beefy power supply is definitely needed.
  • 1800W ATX power supply does exist, you say. The problem is, almost no servers use ATX power supply—they pretty much all use specialized CRPS power supply that gives you two power supplies in one small package. There are a lot of benefit to this, including redundancy and lower load per power supply. Guess how many 2000W CRPS power supply can you find from retail channel? ZERO. There is simply too much demand for these things from server manufacturers and too little from retail. I was fortunate enough to have one specially ordered on my behalf by a retail supplier, but it took a while to arrive.
  • Once you sorted out the parts, now comes assembly. Unless you have one of those highly-specialized Supermicro 11-slot motherboard—I am not sure if they even sell them in retail—your motherboard will have the width of seven PCIe slots. But you need nine! What do you do? Simple, you might think, all that is needed is a PCIe extension cable. Except you need one end of the cable to go under a GPU, and 99% of the cables you can buy will not be able to do that. I ended up having one custom-made. Yes, custom made. It’s the silver strip in the photo. Did I mention it is so fragile out of the factory, I ended up strengthening it with hot glue myself?

To conclude, if you think building your own PC is challenging, building a GPU server for an HPC cluster is probably three times the challenge. Another reason why you should not maintain your own infrastructure.