Linux Kernel 5.5 Will Not Fix The Frequent Intel GPU Hangs In Recent Kernels
Linux users running machines with Intel integrated graphics have been struggling with frequent system hangs and other problems caused by a buggy i915 kernel module for Intel iGPUs for quite some time. 5.3 series kernels went from being completely useless to problematic as of 5.3.14 while 5.4 series kernels remain utterly broken. Several fixes attempting solve some of the more common problems with Intel graphics chips have been merged into the Linux Kernel mainline git tree the last few days. Problems with frequent hangs remain and it looks like Linux Kernel 5.5 will be as problematic as previous kernels for those using Intel integrated graphics.
written by 林慧 (Wai Lin). published 2020-01-12 - last edited 2020-01-25
The freedesktop issue tracker for Intel graphics is riddled with complains from an increasing amount of frustrated users who experience GPU hangs and total system freezes due to bugs in the i915 kernel driver for Intel graphics chips. The woes begun with kernel 5.1 and they have only gotten worse, not better, with each new 5.x series kernel release. One of the more scandalous bugs where the i915 would drop writes to pages it did not own was fixed in kernel 5.3.14 but other problems remain in 5.3-series kernels and new ones were introduced in 5.4 series kernels.
The bug reports are piling up:
- hsw iommu i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0 (446)
- GPU hangs and screen became full random colors for small period of time (962) (kernel 5.4.10)
- GPU hang on transition to idle (673) (5.3.13-5.4.10)
- GPU HANG: ecode 7:1:0xfffffffe, in Xwayland [2363, hang on rcs0 (755)] (5.3, 5.4)
- NULL pointer dereference in i915_active_acquire since Linux 5.4 (827)
There has been dozens and dozens of other bug reports filed. Most of those have been closed as being duplicate of bug #673 with little to no effort to clarify if they are related or just exhibit the same kinds of hangs and problems.
From Bad To Worse
Users of 5.4 series kernels report that it is getting worse, not better, with each minor version.
"I can confirm that it is now much worse with 5.4.10-arch1-1. It usually froze for me once every 2 days (I think with 5.4.8-arch1-1), but today it froze 5 times already (and it's 3pm where I am). Just browser, discord, spotify and arduino IDE."
"It does seem to have (subjectively) gotten worse with 5.4.10-arch1-1somehow, even light loads (browser) trigger the condition now."
"Problem still exists on latest arch with plain 5.4.10-arch1-1 I experience a crash approx once per day."
"I used to have the same issue where I received Resetting rcs0 for hang on rcs0 warnings but it rarely happens recently. Instead, my laptop sometimes completely freezes and requires hard reboot, so no log can be recovered. This usually happens after I unplug the HDMI cable connected to a monitor. I don't know if this information is useful."
Some of the problems with the i915 kernel module are related to Intel's firmware which is causing issues with older kernels as far back as 4.19. Bugs like i915 8086:5917 subsystem 1028:0817 System suspend hang when i915/kbl_dmc_ver1_*.bin installed (933) describe issues where removing the firmware is the only known solution.
What firmware is loaded is indicated in the kernel ring buffer (viewable with
dmesg) with a message like:
[drm] Finished loading DMC firmware i915/bxt_dmc_ver1_07.bin (v1.7) Initialized i915 1.6.0 20191101 for 0000:00:02.0 on minor 0
The i915 module's use of binary blob firmware means that a kernel which has worked fine on a machine could suddenly have major issues if a system upgrade as updated the firmware files (provided by the
It is possible to avoid the Intel iGPU firmware by eradicating the
i915 modules firmware folder
i915 module will give a warning if the firmware is missing:
i915 0000:00:02.0: Direct firmware load for i915/bxt_dmc_ver1_07.bin failed with error -2 i915 0000:00:02.0: Failed to load DMC firmware i915/bxt_dmc_ver1_07.bin. Disabling runtime power management. i915 0000:00:02.0: DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
Intel's "FIRMWARE page states that
"DMC provides additional graphics low-power idle states. It provides capability to save and restore display registers across these low-power states independently from the OS/Kernel."
Temperatures on older Intel chips are, oddly, lower without the firmware which should provide "lower-power idle states".
Skylake and newer processors require GuC/HuC firmware for certain video decoding features. Older chips do not.
Not loading the Intel firmware should not be seen as a solution to anything. It is a factor one should be aware of: A given kernel version will behave differently depending on the firmware version being loaded. Avoiding the firmware should only be seen as a temporary solution if it is known to avoid machine-specific problems.
Hangs are not the only problem with the last few major Linux kernel versions. High temperatures are also a problem.
- 5.3.11 regression: No RC6 on Kaby Lake (614)
Temperature problems are somewhat related to HD Graphics 620: VAAPI performs poorly (956) which outlines how Intel iGPUs run at housefire temperatures during video playback.
Kernel 5.5 Will Solve Some Issues But Problems Remain
The Acer Swift SF113-31 has a Intel "Apollo Lake" Goldmount N4200 SoC with a iGPU using the i915 kernel module. The latest kernels do not provide a problem-free experience on that notebook computer.
Bug i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0 (446) was closed with git commit "drm/i915/gt: Do not restore invalid RS state. NULL pointer dereference in i915_active_acquire since Linux 5.4 (827) is fixed by commit "drm/i915: Hold reference to intel_frontbuffer as we track activity. Those are steps in the right direction.
However, Intel employees are very quick to mark any and all bugs related to GPU hangs as being duplicate of GPU hang on transition to idle (673). A close-up inspection of several bugs who are supposedly duplicates appear to be more unique cases. It is hard to say exactly how many different unsolved issues there still are with Intel's i915 kernel GPU driver. Bug 673 remains open.
5.5.0-rc5-Hyemii-00257-gac61145a725a - git as of January 11th - appears to be slight better than 5.3 and 5.4 series kernels but it is far from problem-free and random system freeze remain a problem. It is therefore almost guaranteed that the final 5.5 Linux kernel will have most of the same frustrating problems previous kernels have had. That's a bit sad, it is not like those who are stuck with a Intel-powered laptop can work around the problems with Intel's i915 kernel driver by sticking a AMD GPU in it. There simply isn't room.
Perhaps Linux Kernel 5.6 will work fine with Intel iGPUs. Perhaps not. It is impossible to guess before the merge window, which has yet to open, closes. The only thing we can say for sure is that Kernel 5.5 will NOT be great for Intel iGPU users.
|Update: Kernel 5.5 rc7 appears to be problem-free on low-powered Intel chips with the kernel parameters |