RDRAND stops returning random values on older AMD CPUs after suspend
Systemd versions past 239 will fail to start or restart any service after a suspend to RAM on the AX-XXXX and EX-XXXX series AMD APUs after they have been suspended. The root cause of this bug is a hardware bug in these AMD CPUs who will happily return
0xFFFFFFFFFFFFFFFF) instead of a random number after they have been suspended.
All AMD APUs in CPU family 22 appear to be affected. Older chips like A8-7600 in family 21 do not have RDRAND (or RDSEED for that matter) and family 23 (Ryzen) does not have this problem. You can
grep family /proc/cpuinfo to see what family CPU you have.
The problem for systemd users with these APUs
is that it uses RDRAND, if present, in the sd_id128_randomize function in
src/libsystemd/sd-id128/sd-id128.c to generate UUIDs for services in versions 240 and onward. This fails after a suspend on affected AMD APUs since RDRAND is no longer returning random values. A workaround is to replace the line
r = genuine_random_bytes(&t, sizeof t, RANDOM_ALLOW_RDRAND); with
r = genuine_random_bytes(&t, sizeof t, 0); so RDRAND is no longer used. Another simple workaround is to simply not use suspend a hibernate to disk instead.
Lennart Poettering has proposed that systemd in future versions does a
if (*ret == 0 || *ret == ULONG_MAX) check to filter out invalid results. This, if added, would permanently solve the issue for affected users.
If systemd should be blindly trusting the hardware by using RDRAND instead of a properly seeded PRNG is a question worth asking.
The underlying problem
The failure of RDRAND to return random values is pretty bad and something which likely affects a lot more than systemd. In the systemd case it fails in a very noticeable and visible way. It would not be good if other critical pieces of software are happily accepting
0xFFFFFFFFFFFFFFFF as a random value and use it when RDRAND is called.
Technically RDRAND is allowed to fail if no random value is available. All CPUs supporting this flag, AMD or Intel, can set the Carry Flag (CF) to 0 to indicate that no random value is available. These problematic AMD APUs will happily report a CF flag of 1 indicating that RDRAND is working and still return
0xFFFFFFFFFFFFFFFF. That's problematic.
The good news for OpenSSL users is that this hardware bug in AMD APUs is actually been known and documented in RedHat bug #85911 since 2014. OpenSSLs solution was to simply not use RDRAND at all so the underlying problem's remained ignored and unsolved.
A kernel fix is needed
This is a hardware problem which should be worked around at the kernel level by either making RDRAND unavailable on these machines or by checking if it returns valid values after a suspend.
published 2019-05-07 - last edited 2019-07-30