Kernel 5.4.1 And 5.3.14 Are Released Making Linux Users With Intel iGPUs Finally Able To Use 5.3-Series Kernels

From LinuxReviews
Jump to navigationJump to search
Tux.png

The Linux kernel's i915 module for Intel iGPUs has been a mess for quite some time. Reverting all the way back to kernel 5.0.21 has been one solution for low-powered Intel Goldmount "Apollo Lake" SoCs like the Pentium N4200. Kernel 5.3.14 has a patch, also included in kernel 5.4.0, which brings 5.3.x series kernels a step closer to being usable on Intel iGPUs. It makes 5.3.14 usable but 5.4 series kernels have other issues with Intel iGPUs. There's also some fixes for USB and all the Intel CPU-bug mitigations in both 5.4.1 and 5.3.14.

published 2019-11-30last edited 2020-01-11

Acer Swift SFS113-31 20191130 074743.jpg
The Acer Swift SF113-31 has a Intel "Apollo Lake" Goldmount N4200 SoC with a iGPU using the i915 module. 5.3 series Linux kernels have so far been completely usable on this machine thanks to the i915 module screwing around with memory used by the filesystem and other important kernel functions. Kernel 5.3.14 has a patch. Compiling it on this machine's weak 4 core 1.5 GHz (all-core load) Pentium N4200 takes ages.

A Message Nobody Wants To See, Ever

[   50.138567] WARNING: CPU: 1 PID: 1330 at fs/ext4/inode.c:3941 ext4_set_page_dirty+0x3e/0x50
[   50.138638] CPU: 1 PID: 1330 Comm: kworker/u8:4 Not tainted 5.3.8-Seohyun #1
[   50.138639] Hardware name: Acer Swift SF113-31/ASAHI_AP_S, BIOS V1.12 03/30/2018
[   50.138700] Workqueue: i915 __i915_gem_free_work [i915]
[   50.138704] RIP: 0010:ext4_set_page_dirty+0x3e/0x50
[   50.138706] Code: 48 8b 00 a8 01 75 16 48 8b 57 08 48 8d 42 ff 83 e2 01 48 0f 44 c7 48 8b 00 a8 08 74 0d 48 8b 07 f6 c4 20 74 0f e9 92 e7 f7 ff <0f> 0b 48 8b 07 f6 c4 20 75 f1 0f 0b e9 81 e7 f7 ff 90 0f 1f 44 00
[   50.138707] RSP: 0018:ffffc1e60137fd90 EFLAGS: 00010246
[   50.138709] RAX: 0017ffe000002016 RBX: ffff9e337236a200 RCX: 0000000000000000
[   50.138710] RDX: 0000000000000000 RSI: 0000000121400000 RDI: fffff3ecc498ea40
[   50.138711] RBP: fffff3ecc498ea40 R08: 0000000121400000 R09: 0000000000000000
[   50.138712] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000001263a9
[   50.138713] R13: ffff9e3322c11b00 R14: ffff9e33367f9ca0 R15: 0000000000000000
[   50.138714] FS:  0000000000000000(0000) GS:ffff9e337ba80000(0000) knlGS:0000000000000000
[   50.138715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.138716] CR2: 000055fb74151f10 CR3: 000000013c60a000 CR4: 00000000003406e0
[   50.138717] Call Trace:
[   50.138767]  i915_gem_userptr_put_pages+0x14b/0x1e0 [i915]
[   50.138812]  __i915_gem_object_put_pages+0x5b/0xa0 [i915]
[   50.138854]  __i915_gem_free_objects+0x124/0x230 [i915]
[   50.138898]  __i915_gem_free_work+0x64/0x90 [i915]
[   50.138902]  process_one_work+0x199/0x340
[   50.138905]  worker_thread+0x4e/0x3b0
[   50.138907]  kthread+0xfc/0x130
[   50.138910]  ? process_one_work+0x340/0x340
[   50.138912]  ? kthread_park+0x80/0x80
[   50.138915]  ret_from_fork+0x35/0x40
[   50.138919] ---[ end trace ca5ea2ec07e00336 ]---

Many GNU/Linux users with Intel iGPUs, including myself, have enjoyed the fine kernel message above when trying 5.3 series Linux kernels. A lot of GNU/Linux distributions have been pushing those 5.3-series kernels upon their users with sad and depressing results. The root cause of it was the i915 kernel module telling the kernel to drop writes to memory areas other kernel modules were writing to. That is a total scandal which leads to total system crashes and hangs and potentially worse consequences like file system corruption. The i915 module is used by all Intel graphics, the module is named i915 for historical reasons (they might as well rename it intel-gfx).

Kernel developer Chris Wilson had this to say about the above kernel message:

"Yikes. That shows that code was inherently more buggy than I thought, as it was causing us to drop writes to pages we didn't own (but thought we did).

The root cause of the warn and ext4 bug is the lack of lock_page around set_page_dirty in userptr_put_pages. We tried putting a lock there, but we recurse into userptr_put_pages from underneath locked pages..."

Linux 5.4 rc1 got rather ugly band-aid patch for the i915 module's unacceptable behavior. That patch was included in the newly released Linux Kernel 5.3.14. Chris Wilson had this to say about it:

"set_page_dirty says:

"For pages with a mapping this should be done under the page lock for the benefit of asynchronous memory errors who prefer a consistent dirty state. This rule can be broken in some special cases, but should be better not to."

Under those rules, it is only safe for us to use the plain set_page_dirty calls for shmemfs/anonymous memory. Userptr may be used with real mappings and so needs to use the locked version (set_page_dirty_lock).

However, following a try_to_unmap() we may want to remove the userptr and so call put_pages(). However, try_to_unmap() acquires the page lock and so we must avoid recursively locking the pages ourselves -- which means that we cannot safely acquire the lock around set_page_dirty(). Since we can't be sure of the lock, we have to risk skip dirtying the page, or else risk calling set_page_dirty() without a lock and so risk fs corruption."

It sounds a lot like some memory leaks (no longer used memory not being freed) in the i915 module was accepted as a "solution" to the i915 module freeing random pieces of memory it didn't own. That's more like a band-aid that a solid and sustainable solution.

'The patch makes Linux 5.3-series kernels from 5.3.14 on work with Intel iGPUs without major issues. The same is not true for 5.4-seriese kernels...

Kernel 5.4.1+ Has Other Issues With i915

The i915 modules in the 5.3 series kernels prior to 5.3.14 are buggy piles of garbage and the i915 module in the 5.1 and 5.2 series kernels have other issues which, when combined with 19.2.x versions fo the Mesa graphics stack, results in random system freezes. Going all the way back to kernel 5.0.21, which works perfectly. That is a good solution users of Intel iGPUs may want to consider.

The latest 5.4.0 and 5.4.1 kernels have some completely different issues with Intel iGPUs. Testing kernel 5.4.1 on a Pentium N4200 with a Intel iGPU using the i915 gave the appearance of a good stable system for a short while. Then it locked up - but not completely. The machine appeared to be completely frozen but the lucky ssh port (8888, configured by Port 8888 in /etc/ssh/sshd_config) remained alive and kicking. sshd's default port 22 is, of course, also lucky, but it is not as lucky. A close-up inspection of dmesg when logged in through ssh revealed the following very sad story:

Gyongree-laptop-i915-problem-kernel-5.4.1.jpg
Kernel 5.4.1: Sad and depressing dmesg message regarding the i915 kernel module for Intel graphics chips.

[ 3850.907971] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0                                               
[ 3850.907977] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.                           
[ 3850.907978] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel                               
[ 3850.907980] drm/i915 developers can then reassign to the right component if it's not a kernel issue.                      
[ 3850.907982] The GPU crash dump is required to analyze GPU hangs, so please always attach it.                              
[ 3850.907984] GPU crash dump saved to /sys/class/drm/card0/error                                                            
[ 3850.909010] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0                                                            
[ 3850.909851] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 3850.910274] i915 0000:00:02.0: Resetting chip for hang on rcs0                                                            
[ 3850.912105] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 3850.912881] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 3856.923990] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 4003.996726] Asynchronous wait on fence i915:xfwm4[2783]:1f95e timed out (hint:intel_atomic_commit_ready+0x0/0x50 [i915])

That leaves the old 5.0.21 kernel and the new 5.3.14 release as viable kernels for computers with Intel iGPUs.

Kemonomimi rabbit.svg
Update: We have tested 5.3.14 and 5.3.15 with an Intel iGPUs for quite some time (regular use) and they are better but not anywhere near perfect or usable. Going back to 5.0.21 or updating to 5.5 when it is released are viable solutions.

5.4.0, 5.4.1 and 5.4.2 have the problem shown above. Do not use 5.3-series kernels or 5.4-series kernels with Intel GPUs or you will have problems.

More Intel Problems Solved

The rest of the change-log for kernel 5.3.14 as well as the change-log for 5.4.1 is mostly filled with USB-related fixes and smaller fixes for the ever-increasing amount workarounds for bugs in Intel's CPUs.

Both 5.3.14 and 5.4.1 work great on AMD Ryzen machines with "Polaris" GPUs. 5.3.14 appears to work fine on Intel systems with Intel iGPUs; 5.4.1 and other recent kernels don't.

The latest kernels can, like always, be acquired from kernel.org.


3.00
(one vote)


Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.