I'm trying to fix a server that won't boot with all four of its CPUs installed. It is a Primergy RX600 R6. It will boot fine with CPU 1 and 3 in their sockets, but if I plug in sockets 2 and/or 4, the machine refuses to boot. Gives the error "PSU Pctrl Fail 31.12.2018 15:59:25" and switches itself back off.
I've actually got three machines that all do this same problem. No matter what I try, nothing will make it boot with the four CPUs in. Have swapped parts to the point everything has been replaced. New PSUs, New mobo, new CPUs and new memory risers. I thought maybe some of the DIMMs are bad, but after testing each module in the known working 2 CPU configuration, I'm less inclined to think there's anything wrong with them. All the RAM will work to make the two CPU setup boot, but with the four CPUs, it gives that same PS Pctrl error every time, even with the minimum of RAM installed.
To walk you through what happened, there was a power event on one of our RX600 R6 units. The board got fried by a failing PSU. We replaced the entire unit, swapped over the DIMMs and CPUs from the dead one, and that's where the problem of not booting all four CPUs started. We thought maybe the CPUs were damaged, so we put all new ones in. Still the same error. Systematically replaced every part that had formerly been in that damaged unit, to the point the new machine had all new risers and CPUs. Still the same error. So we tried using another machine. Still the same error.
At this point, I was of the opinion that one of the DIMMs was bad, probably damaged in the electrical event. With the machine booting in its working two CPU configuration, I tested each DIMM module, but it booted with no problem. So now I'm at a loss. There doesn't seem to be any option in BIOS to enable those extra CPUs, and I can't even load BIOS with all of them in place. I know it's a longshot, but does anyone have any idea what I might have missed? Seems odd for an issue to be this persistent, and happen on so many different machines. The only commonality are the memory modules, which are from the machine that had the initial problem. Given that they all test as working, I'm not sure what else could be preventing the server from getting past POST.