Kernel 5.4.1 And 5.3.14 Are Released Making Linux Users With Intel iGPUs Finally Able To Use 5.3-Series Kernels

From LinuxReviews
Jump to navigationJump to search
Tux.png

The Linux kernel's i915 module for Intel iGPUs has been a mess for quite some time. Reverting all the way back to kernel 5.0.21 has been one solution for low-powered Intel Goldmount "Apollo Lake" SoCs like the Pentium N4200. Kernel 5.3.14 has a patch, also included in kernel 5.4.0, which brings 5.3.x series kernels a step closer to being usable on Intel iGPUs. It makes 5.3.14 usable but 5.4 series kernels have other issues with Intel iGPUs. There's also some fixes for USB and all the Intel CPU-bug mitigations in both 5.4.1 and 5.3.14.

published 2019-11-30last edited 2019-12-07

Acer Swift SFS113-31 20191130 074743.jpg
The Acer Swift SF113-31 has a Intel "Apollo Lake" Goldmount N4200 SoC with a iGPU using the i915 module. 5.3 series Linux kernels have so far been completely usable on this machine thanks to the i915 module screwing around with memory used by the filesystem and other important kernel functions. Kernel 5.3.14 has a patch. Compiling it on this machine's weak 4 core 1.5 GHz (all-core load) Pentium N4200 takes ages.

A Message Nobody Wants To See, Ever

[   50.138567] WARNING: CPU: 1 PID: 1330 at fs/ext4/inode.c:3941 ext4_set_page_dirty+0x3e/0x50
[   50.138638] CPU: 1 PID: 1330 Comm: kworker/u8:4 Not tainted 5.3.8-Seohyun #1
[   50.138639] Hardware name: Acer Swift SF113-31/ASAHI_AP_S, BIOS V1.12 03/30/2018
[   50.138700] Workqueue: i915 __i915_gem_free_work [i915]
[   50.138704] RIP: 0010:ext4_set_page_dirty+0x3e/0x50
[   50.138706] Code: 48 8b 00 a8 01 75 16 48 8b 57 08 48 8d 42 ff 83 e2 01 48 0f 44 c7 48 8b 00 a8 08 74 0d 48 8b 07 f6 c4 20 74 0f e9 92 e7 f7 ff <0f> 0b 48 8b 07 f6 c4 20 75 f1 0f 0b e9 81 e7 f7 ff 90 0f 1f 44 00
[   50.138707] RSP: 0018:ffffc1e60137fd90 EFLAGS: 00010246
[   50.138709] RAX: 0017ffe000002016 RBX: ffff9e337236a200 RCX: 0000000000000000
[   50.138710] RDX: 0000000000000000 RSI: 0000000121400000 RDI: fffff3ecc498ea40
[   50.138711] RBP: fffff3ecc498ea40 R08: 0000000121400000 R09: 0000000000000000
[   50.138712] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000001263a9
[   50.138713] R13: ffff9e3322c11b00 R14: ffff9e33367f9ca0 R15: 0000000000000000
[   50.138714] FS:  0000000000000000(0000) GS:ffff9e337ba80000(0000) knlGS:0000000000000000
[   50.138715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.138716] CR2: 000055fb74151f10 CR3: 000000013c60a000 CR4: 00000000003406e0
[   50.138717] Call Trace:
[   50.138767]  i915_gem_userptr_put_pages+0x14b/0x1e0 [i915]
[   50.138812]  __i915_gem_object_put_pages+0x5b/0xa0 [i915]
[   50.138854]  __i915_gem_free_objects+0x124/0x230 [i915]
[   50.138898]  __i915_gem_free_work+0x64/0x90 [i915]
[   50.138902]  process_one_work+0x199/0x340
[   50.138905]  worker_thread+0x4e/0x3b0
[   50.138907]  kthread+0xfc/0x130
[   50.138910]  ? process_one_work+0x340/0x340
[   50.138912]  ? kthread_park+0x80/0x80
[   50.138915]  ret_from_fork+0x35/0x40
[   50.138919] ---[ end trace ca5ea2ec07e00336 ]---

Many GNU/Linux users with Intel iGPUs, including myself, have enjoyed the fine kernel message above when trying 5.3 series Linux kernels. A lot of GNU/Linux distributions have been pushing those 5.3-series kernels upon their users with sad and depressing results. The root cause of it was the i915 kernel module telling the kernel to drop writes to memory areas other kernel modules were writing to. That is a total scandal which leads to total system crashes and hangs and potentially worse consequences like file system corruption. The i915 module is used by all Intel graphics, the module is named i915 for historical reasons (they might as well rename it intel-gfx).

Kernel developer Chris Wilson had this to say about the above kernel message:

"Yikes. That shows that code was inherently more buggy than I thought, as it was causing us to drop writes to pages we didn't own (but thought we did).

The root cause of the warn and ext4 bug is the lack of lock_page around set_page_dirty in userptr_put_pages. We tried putting a lock there, but we recurse into userptr_put_pages from underneath locked pages..."

Linux 5.4 rc1 got rather ugly band-aid patch for the i915 module's unacceptable behavior. That patch was included in the newly released Linux Kernel 5.3.14. Chris Wilson had this to say about it:

"set_page_dirty says:

"For pages with a mapping this should be done under the page lock for the benefit of asynchronous memory errors who prefer a consistent dirty state. This rule can be broken in some special cases, but should be better not to."

Under those rules, it is only safe for us to use the plain set_page_dirty calls for shmemfs/anonymous memory. Userptr may be used with real mappings and so needs to use the locked version (set_page_dirty_lock).

However, following a try_to_unmap() we may want to remove the userptr and so call put_pages(). However, try_to_unmap() acquires the page lock and so we must avoid recursively locking the pages ourselves -- which means that we cannot safely acquire the lock around set_page_dirty(). Since we can't be sure of the lock, we have to risk skip dirtying the page, or else risk calling set_page_dirty() without a lock and so risk fs corruption."

It sounds a lot like some memory leaks (no longer used memory not being freed) in the i915 module was accepted as a "solution" to the i915 module freeing random pieces of memory it didn't own. That's more like a band-aid that a solid and sustainable solution.

'The patch makes Linux 5.3-series kernels from 5.3.14 on work with Intel iGPUs without major issues. The same is not true for 5.4-seriese kernels...

Kernel 5.4.1+ Has Other Issues With i915

The i915 modules in the 5.3 series kernels prior to 5.3.14 are buggy piles of garbage and the i915 module in the 5.1 and 5.2 series kernels have other issues which, when combined with 19.2.x versions fo the Mesa graphics stack, results in random system freezes. Going all the way back to kernel 5.0.21, which works perfectly. That is a good solution users of Intel iGPUs may want to consider.

The latest 5.4.0 and 5.4.1 kernels have some completely different issues with Intel iGPUs. Testing kernel 5.4.1 on a Pentium N4200 with a Intel iGPU using the i915 gave the appearance of a good stable system for a short while. Then it locked up - but not completely. The machine appeared to be completely frozen but the lucky ssh port (8888, configured by Port 8888 in /etc/ssh/sshd_config) remained alive and kicking. sshd's default port 22 is, of course, also lucky, but it is not as lucky. A close-up inspection of dmesg when logged in through ssh revealed the following very sad story:

Gyongree-laptop-i915-problem-kernel-5.4.1.jpg
Kernel 5.4.1: Sad and depressing dmesg message regarding the i915 kernel module for Intel graphics chips.

[ 3850.907971] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0                                               
[ 3850.907977] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.                           
[ 3850.907978] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel                               
[ 3850.907980] drm/i915 developers can then reassign to the right component if it's not a kernel issue.                      
[ 3850.907982] The GPU crash dump is required to analyze GPU hangs, so please always attach it.                              
[ 3850.907984] GPU crash dump saved to /sys/class/drm/card0/error                                                            
[ 3850.909010] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0                                                            
[ 3850.909851] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 3850.910274] i915 0000:00:02.0: Resetting chip for hang on rcs0                                                            
[ 3850.912105] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 3850.912881] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 3856.923990] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 4003.996726] Asynchronous wait on fence i915:xfwm4[2783]:1f95e timed out (hint:intel_atomic_commit_ready+0x0/0x50 [i915])

That leaves the old 5.0.21 kernel and the new 5.3.14 release as viable kernels for computers with Intel iGPUs.

Kemonomimi rabbit.svg
Note: Update: We have tested 5.3.14 and 5.3.15 with an Intel iGPUs for quite some time (regular use) and they are fine. 5.0.21 is also fine; kernels in between have problems.

5.4.0, 5.4.1 and 5.4.2 have the problem shown above. Do not use 5.3-series kernels prior to 5.3.14 or 5.4-series kernels with Intel GPUs or you will have problems.

More Intel Problems Solved

The rest of the change-log for kernel 5.3.14 as well as the change-log for 5.4.1 is mostly filled with USB-related fixes and smaller fixes for the ever-increasing amount workarounds for bugs in Intel's CPUs.

Both 5.3.14 and 5.4.1 work great on AMD Ryzen machines with "Polaris" GPUs. 5.3.14 appears to work fine on Intel systems with Intel iGPUs; 5.4.1 and other recent kernels don't.

The latest kernels can, like always, be acquired from kernel.org.



avatar

Anonymous user #1

8 days ago
Score 0++
Unfortunately, the same errors are also found in sources 5.4.2
avatar

Anonymous user #2

6 days ago
Score 0 You

Hi, I didn't have the issues mentioned here on kernel 5.[0-3].X, no freeze whatsoever but since the 5.4.2(-arch) upgrade I had a few UI freezes. I didn't try to ssh into my host. My GPU is Intel UHD 620 (with i7-8565U cpu). I had the enable_guc=2 and enable_fbc=1. I removed these options to see (to early to tell).

Dec 08 10:34:31 xps13 kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0 Dec 08 10:34:31 xps13 kernel: GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Dec 08 10:34:31 xps13 kernel: Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Dec 08 10:34:31 xps13 kernel: drm/i915 developers can then reassign to the right component if it's not a kernel issue. Dec 08 10:34:31 xps13 kernel: The GPU crash dump is required to analyze GPU hangs, so please always attach it. Dec 08 10:34:31 xps13 kernel: GPU crash dump saved to /sys/class/drm/card0/error Dec 08 10:34:31 xps13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Dec 08 10:34:31 xps13 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} Dec 08 10:34:31 xps13 kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0 Dec 08 10:34:31 xps13 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} Dec 08 10:34:31 xps13 kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} Dec 08 10:34:31 xps13 kernel: [drm] GuC communication enabled Dec 08 10:34:31 xps13 kernel: i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled Dec 08 10:34:31 xps13 kernel: i915 0000:00:02.0: HuC firmware i915/kbl_huc_ver02_00_1810.bin version 2.0 authenticated:yes Dec 08 10:34:34 xps13 kernel: Asynchronous wait on fence i915:gnome-shell[1911]:1b1032 timed out (hint:intel_atomic_commit_ready+0x0/0x50 [i915]) Dec 08 10:34:39 xps13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Dec 08 10:34:41 xps13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Dec 08 10:34:49 xps13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Dec 08 10:34:51 xps13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
avatar

Anonymous user #3

4 days ago
Score 0++

Ubuntu 19.10 kernel 5.4.2 without any special boot options here:

[mar dic 10 12:09:22 2019] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0 [mar dic 10 12:09:22 2019] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [mar dic 10 12:09:22 2019] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [mar dic 10 12:09:22 2019] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [mar dic 10 12:09:22 2019] The GPU crash dump is required to analyze GPU hangs, so please always attach it. [mar dic 10 12:09:22 2019] GPU crash dump saved to /sys/class/drm/card0/error [mar dic 10 12:09:22 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

a few lockups occurs, going back to 5.3.15 for now :-(
avatar

Anonymous user #1

4 days ago
Score 0++
Hello, just hit by "rcs0 reset request timed out" as well. XPS 9360 and arch linux with linux-zen 5.4.2
avatar

Anonymous user #1

3 days ago
Score 0++

Still there at 5.3.15

cat card0.err GPU HANG: ecode 9:0:0x00000000, hang on rcs0 Kernel: 5.3.15_1 x86_64 Time: 1576070496 s 445647 us Boottime: 1006 s 765328 us Uptime: 1003 s 858380 us Epoch: 4295667008 jiffies (1000 HZ) Capture: 4295673024 jiffies; 89939 ms ago, 6016 ms after epoch Reset count: 0 Suspend count: 0 Platform: COFFEELAKE Subplatform: 0x0 PCI ID: 0x3e92 PCI Revision: 0x00 PCI Subsystem: 1028:085a

IOMMU enabled?: 0
avatar

Anonymous user #1

3 days ago
Score 0++
I'm a diehard AMD dude – and AMD dudes are not affected!
Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.