Spent over an hour trying to figure out why some new GPUs were not working. The server is concern is a Asrock Rack 2U4G-EPYC-2T, which is a specialized server that allows four GPUs to be installed in a relatively small case. Google was not helpful because, understandably, this is a niche product produced only in small quantities.
What did not work:
- -Attaching four Ampere GPUs (i.e. RTX 3000 series) in their intended positions in the case.
- Attaching four Pascal GPUs (i.e. GTX 1000 series) in the intended positions.
- Attaching only one Ampere GPU at the rear of the case.
- Attaching four Ampere GPUs directly to the mainboard.
Took me a good hour to figure out that the issue was caused by the PCIe extender board. The three GPU positions at the front require the extender board, but the board was only for PCIe Gen 3. Normally, Gen 4 GPUs can negotiate with Gen 3 mainboards to communicate in PCIe Gen 3, but apparently they cannot do that through the extender board. Once the issue had been identified, the solution was actually very straightforward—manually setting the PCIe lanes to Gen 3 solves everything.
Yet another reason why maintaining your own computing infrastructure is not for the faint hearted.