Linux Kernel 5.5 Will Not Fix The Frequent Intel GPU Hangs In Recent Kernels
Linux users running machines with Intel integrated graphics have been struggling with frequent system hangs and other problems caused by a buggy i915 kernel module for Intel iGPUs for quite some time. 5.3 series kernels went from being completely useless to problematic as of 5.3.14 while 5.4 series kernels remain utterly broken. Several fixes attempting solve some of the more common problems with Intel graphics chips have been merged into the Linux Kernel mainline git tree the last few days. Problems with frequent hangs remain and it looks like Linux Kernel 5.5 will be as problematic as previous kernels for those using Intel integrated graphics.
written by 林慧 (Wai Lin) 2020-01-12 - last edited 2020-12-05. © CC BY
Intel DG1 dedicated graphics card prototype launched at CES 2020.
The freedesktop issue tracker for Intel graphics is riddled with complaints from an increasing amount of frustrated users who experience GPU hangs and total system freezes due to bugs in the i915 kernel driver for Intel graphics chips. The woes begun with kernel 5.1 and they have only gotten worse, not better, with each new 5.x series kernel release. One of the more scandalous bugs where the i915 would drop writes to pages it did not own was fixed in kernel 5.3.14 but other problems remain in 5.3-series kernels and new ones were introduced in 5.4 series kernels.
The bug reports are piling up:
- hsw iommu i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0 (446)
- GPU hangs and screen became full random colors for small period of time (962) (kernel 5.4.10)
- GPU hang on transition to idle (673) (5.3.13-5.4.10)
- GPU HANG: ecode 7:1:0xfffffffe, in Xwayland [2363, hang on rcs0 (755)] (5.3, 5.4)
- NULL pointer dereference in i915_active_acquire since Linux 5.4 (827)
There has been dozens and dozens of other bug reports filed. Most of those have been closed as being duplicate of bug #673 with little to no effort to clarify if they are related or just exhibit the same kinds of hangs and problems.
The high number of problems reported on the issue tracker during the last few months may not reflect the real number of affected systems and users. The Intel GPU bug tracker is subject to freedesktop organization's authoritarian and draconian 白左 "Code of Conduct" which applies to posts on their bug tracker and everything else posted anywhere on the Internet. People with a basic understanding of right and wrong will naturally refuse to participate under those terms. The new gitlab issue tracker freedesktop recently uses requires JavaScript which makes it hard to use that site and it ensures that searching using ctrl+f is impossible. This may turn another group of privacy-aware users away.
From Bad To Worse[edit]
Users of 5.4 series kernels report that it is getting worse, not better, with each minor version.
"I can confirm that it is now much worse with 5.4.10-arch1-1. It usually froze for me once every 2 days (I think with 5.4.8-arch1-1), but today it froze 5 times already (and it's 3pm where I am). Just browser, discord, spotify and arduino IDE."
"It does seem to have (subjectively) gotten worse with 5.4.10-arch1-1somehow, even light loads (browser) trigger the condition now."
"Problem still exists on latest arch with plain 5.4.10-arch1-1 I experience a crash approx once per day."
"I used to have the same issue where I received Resetting rcs0 for hang on rcs0 warnings but it rarely happens recently. Instead, my laptop sometimes completely freezes and requires hard reboot, so no log can be recovered. This usually happens after I unplug the HDMI cable connected to a monitor. I don't know if this information is useful."
Firmware Woes[edit]
Some of the problems with the i915 kernel module are related to Intel's firmware which is causing issues with older kernels as far back as 4.19. Bugs like i915 8086:5917 subsystem 1028:0817 System suspend hang when i915/kbl_dmc_ver1_*.bin installed (933) describe issues where removing the firmware is the only known solution.
What firmware is loaded is indicated in the kernel ring buffer (viewable with dmesg
) with a message like:
[drm] Finished loading DMC firmware i915/bxt_dmc_ver1_07.bin (v1.7) Initialized i915 1.6.0 20191101 for 0000:00:02.0 on minor 0
The i915 module's use of binary blob firmware means that a kernel which has worked fine on a machine could suddenly have major issues if a system upgrade as updated the firmware files (provided by the linux-firmware
package).
It is possible to avoid the Intel iGPU firmware by eradicating the i915
modules firmware folder /lib/firmware/i915/
. The i915
module will give a warning if the firmware is missing:
i915 0000:00:02.0: Direct firmware load for i915/bxt_dmc_ver1_07.bin failed with error -2 i915 0000:00:02.0: Failed to load DMC firmware i915/bxt_dmc_ver1_07.bin. Disabling runtime power management. i915 0000:00:02.0: DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
Intel's "FIRMWARE page states that
"DMC provides additional graphics low-power idle states. It provides capability to save and restore display registers across these low-power states independently from the OS/Kernel."
as of January 11th, 2020
Temperatures on older Intel chips are, oddly, lower without the firmware which should provide "lower-power idle states".
Skylake and newer processors require GuC/HuC firmware for certain video decoding features. Older chips do not.
Not loading the Intel firmware should not be seen as a solution to anything. It is a factor one should be aware of: A given kernel version will behave differently depending on the firmware version being loaded. Avoiding the firmware should only be seen as a temporary solution if it is known to avoid machine-specific problems.
High Temperatures[edit]
Hangs are not the only problem with the last few major Linux kernel versions. High temperatures are also a problem.
Temperature problems are somewhat related to HD Graphics 620: VAAPI performs poorly (956) which outlines how Intel iGPUs run at housefire temperatures during video playback.
Kernel 5.5 Will Solve Some Issues But Problems Remain[edit]
The Acer Swift SF113-31 has a Intel "Apollo Lake" Goldmount N4200 SoC with a iGPU using the i915 kernel module. The latest kernels do not provide a problem-free experience on that notebook computer.
Bug i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0 (446) was closed with git commit "drm/i915/gt: Do not restore invalid RS state. NULL pointer dereference in i915_active_acquire since Linux 5.4 (827) is fixed by commit "drm/i915: Hold reference to intel_frontbuffer as we track activity. Those are steps in the right direction.
However, Intel employees are very quick to mark any and all bugs related to GPU hangs as being duplicate of GPU hang on transition to idle (673). A close-up inspection of several bugs who are supposedly duplicates appear to be more unique cases. It is hard to say exactly how many different unsolved issues there still are with Intel's i915 kernel GPU driver. Bug 673 remains open.
"[drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}": The current state of Intel's i915 kernel driver. What did they mean by this?
5.5.0-rc5-Hyemii-00257-gac61145a725a - git as of January 11th - appears to be slight better than 5.3 and 5.4 series kernels but it is far from problem-free and random system freeze remain a problem. It is therefore almost guaranteed that the final 5.5 Linux kernel will have most of the same frustrating problems previous kernels have had. That's a bit sad, it is not like those who are stuck with a Intel-powered laptop can work around the problems with Intel's i915 kernel driver by sticking a AMD GPU in it. There simply isn't room.
Perhaps Linux Kernel 5.6 will work fine with Intel iGPUs. Perhaps not. It is impossible to guess before the merge window, which has yet to open, closes. The only thing we can say for sure is that Kernel 5.5 will NOT be great for Intel iGPU users.
Update: Kernel 5.5 rc7 appears to be problem-free on low-powered Intel chips with the kernel parameters intel_idle.max_cstate=1 i915.enable_dc=0 . You may not need both. Goldmount chips will only need i915.enable_dc=0 , Baytrail chips need intel_idle.max_cstate=1 . Some chips need both. See Intel graphics for additional tips.
|
Enable comment auto-refresher
Anonymous (1fdb390b)
Permalink |
TmTFx
Permalink |
Anonymous (774a614d19)
Anonymous (cf7ecb02)
Permalink |
Anonymous (cf7ecb02)
Permalink |
Anonymous (774a614d19)
Anonymous (7e0a1a60)
Permalink |
Anonymous (774a6c2bb2)
Permalink |
Anonymous (774a614d19)
Permalink |
AbstractApproach
Permalink |
Anonymous (b865270314)
Permalink |
Anonymous (677751bf07)
Permalink |
Anonymous (fb60bd67a7)
Permalink |