AMD Ryzen 3000 series CPUs can't do Random on boot causing Boot Failure on newer Linux distributions

From LinuxReviews
Jump to navigationJump to search
Xkill.png

The fun part is that very old AMD APUs have a very similar well-known problem with it's implementation of the CPU instruction RDRAND. systemd implemented a work-around for it back in May but distributions do not have it since there has been no stable systemd releases since then. The result is that newer distributions - with the exception of Debian 10 - will simply fail to boot on Ryzen 3000 series CPUs due to a bug in those CPUs which causes them to fail to produce random data when RDRAND is called early in the boot process.

The RDRAND CPU in instruction on AMD 3000 series CPUs is just plain broken. Most modern Linux distributions can not boot because of it.

The reason why AMD 3000 series CPUs failure to produce random output when the RDRAND instruction is called makes Linux distributions freeze on boot is that systemd versions 240+ will simply by-pass the kernels urandom facility and call RDRAND directly early in the boot process. The reason for doing so is simple: The kernel does not have enough random entropy at that point.

Systemd does the same thing when a machine returns from suspend to RAM. This revealed a problem with older AMD APUs back in May: systemd 239 would work fine and systemd 240 would not. The reason is that older AMD chips like the E1-1500 stops returning random information when RDRAND is called after a suspend to RAM. More disturbing, the E1-1500 will set the CF flag to 1 indicating that the returned value is random when it's not. Both AMD and Intel CPUs can set the CF flag to 0 thereby indicating that there is no random data available. Systemd does check this flag.

Amd-ax-apu-randomness.jpg

The bug in Ryzen 3000 series CPUs is different from the bug in earlier AMD chips - and by earlier we do not mean Ryzen 1000 or 2000 series CPUs who do not have this problem - we mean ancient APUs. Ryzen 3000 series CPUs will always fail to produce random data when RDRAND is called while older APUs fails to do so when returning from suspend to RAM. The CPU bugs do exhibit the same behavior in one major scandal aspect: They both happily claim to produce random data by setting the Carry Flag (CF) to 1 when the hard truth is that they do not produce random data - in which case the CF flag should be set to zero. Again: Systemd does check the CF flags value.

AMD CPUs do indicate that there is a problem in a non-specified way: They return ULONG_MAX (0xFFFFFFFFFFFFFFFF) instead of a random number. That behavior is not specified by AMDs own hardware documentation.

Here's a small test-program you can try for yourself:

File: producerandom.c
#include <stdio.h>

int main() {
  unsigned char success;
  unsigned long v;
  asm volatile("rdrand %0;"
               "setc %1"
               : "=r"(v), "=qm"(success));

  printf("success: %i  value: %lx\n", success, v);
}

The above program will work fine on Intel CPUs, Ryzen 1000 and 2000 series CPUs but not Ryzen 3000 series CPUs or old AMD APUs after they have been suspended to RAM.

What to do if you are affected by this problem

There are several options which will allow you to boot a Linux system on Ryzen 3000 series CPUs, all of which require some technical skills. You're out out luck if you lack those.

One easy solution is to downgrade systemd to version 239. Systemd will only call RDRAND directly in 240+. This may not be a good option if there is no such systemd version available for your distribution.

Another option would be to put the SSD/HDD in another machine which does boot and compile a patched systemd or a git version. There are two possible patches one can apply. One would be to edit systemd's src/libsystemd/sd-id128/sd-id128.c where r = genuine_random_bytes(&t, sizeof t, RANDOM_ALLOW_RDRAND); can be changed to r = genuine_random_bytes(&t, sizeof t, 0); - which, as you can see, makes systemd not use RDRAND. A perhaps better approach would be to check if the return value is 0 or ULONG_MAX. This is the fix systemd 243 will include. A patch taking this approach can be found in systemd's git commit 543867907cf910c7970ad34afeca49a2c6aed99c.

Newly released Debian 10 Buster includes the above mentioned systemd patch and does therefore not have this problem.

Other distributions should take the Debian approach and ship a patched systemd version. A new systemd version will not be released for some time.

"We are working on a release, but distros should probably backport the patch in the meantime."

Lennart Poettering in response to us asking for a guesstimate on the next systemd 243 release on June 10th, 2019

Systemd is bad and it's Lennart Poettering's fault

This is a hardware bug in AMD CPUs. systemd is a piece of software. Software using the CPU instruction clzero or movbe should not have pages of code verifying that those instructions work as advertised. systemd freezing is simply a result of it being in charge very early in the boot process.

This is, in truth, AMDs fault. It is a hardware problem. The upcoming systemd 243 will, when released, work around this problem - it will not solve it because it can't actually fix what is a hardware problem.

Do keep in mind that user-space programs may also be affected since Ryzen 3000 series CPUs will always fail to produce random data when RDRAND is called. User-space software will typically use /dev/urandom but there could be some rare pieces of software which do call RDRAND directly. It is interesting to note that OpenSSL used to make frequent calls to RDRAND. This was changed years ago since someone with a buggy AMD APU happened to notice a problem using openssl after a system suspend.

There does not appear to be a separate systemd bug in systemds issue tracker as of now, but there is some discussion under issue 11810 which was originally created to tackle the suspend problem on ancient AMD APUs like the E1-1500. Ubuntu is tracking it as a the Ryzen 3000 specific issue as bug #1835809.

AMD does, of course, not have any issue tracker for hardware bugs in their CPUs.


published 2019-07-09last edited 2019-07-30


avatar

Anonymous user #1

2 months ago
Score 0++
Can AMD fix it with a microcode update?
avatar

Anonymous user #1

2 months ago
Score 0++
They probably can make EFI/BIOS feed it some entropy with an update. How random the resulting initial output would be as a result is of course a good question. But that applies to every other CPU as well; getting actual truly random not reproducible data out of a CPU is kind of tricky.
avatar

Anonymous user #1

2 months ago
Score 0++
They wouldn't need to fix the entropy problem to solve this; only make sure the carry flag is cleared. This should be within the realm of a microcode update, but who knows.
avatar

Anonymous user #2

2 months ago
Score 1++
Wow. That's a major security threat if another software relies on flag spec.
avatar

Anonymous user #3

one month ago
Score 0++
Is this related with the last Agesa and Chipset drivers update from AMD?
Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.