Linux 5.9-rc7 Is A Total Disaster On Machines With Intel Graphics

From LinuxReviews
Jump to navigationJump to search
Tux.png

The latest Linux 5.9 release candidate won't even let you start the X display server on machines with integrated Intel graphics. Running Linux on machines with integrated Intel graphics has been problematic since Linux 5.0. All those problems remained an issue with Linux 5.9-rc6. Linux 5.9-rc7 takes it one step further, it won't even let you get into a graphical environment without crashing the i915 kernel display driver for Intel GPU chips. It is a complete and utter disaster for people using integrated Intel graphics.

written by 윤채경 (Yoon Chae-kyung)  2020-09-28 - last edited 2020-09-29. © CC BY

Acer swift notebook compiling linux.jpg
A Acer Swift notebook compiling a Linux kernel release candidate a few weeks ago. Probably 5.9-rc5? It has 5.9-rc7 now. Taking picture of this machine running X isn't even possible on that kernel.

Using Linux on machines with Intel integrated graphics, specially low-powered ones, has been very problematic since Linux 5.0 was released. Many laptops and notebooks will randomly or immediately hang without a ahci.mobile_lpm_policy=1 kernel boot parameter due to problems related to SATA controller power-management. Many low-powered laptops and notebooks will also randomly hang if you do not set intel_idle.max_cstate=1 as a kernel boot parameter. Some Intel-powered machines will randomly hang without one or the other and some, like the Acer Swift 1, will randomly hang without both of those. These are not new problems, they have been around for a long time. And they were still present in Linux 5.9-rc6. Linux 5.9-rc7 has a completely new and far worse problem: You can't even start X or Wayland.

Linux 5.9-rc7 will boot, start X and hang on all the Intel-powered machines we have tested thus far. It does not just hang on one or two Intel-powered machines, Linux 5.9-rc7 is a complete and utter disaster on all of them. The only machines we have managed to start X on using Linux 5.9-rc7 are machines with AMD processors and AMD graphics cards. This is a Intel-only problem. It is, concretely, a bug in the i915 Linux kernel driver for Intel graphics cards.

Disabling the X display server on machines with Intel integrated graphics does allow them too boot just fine. Everything appears to be well and dandy. The kernels ring buffer, which is something you can read by typing dmesg into a terminal, has nothing unusual to say. The only mention of the i915 kernel driver is fairly normal in the Linux 5.9-rc7 ring buffer before X or anything else graphical is started:

[    0.845332] i915 0000:00:02.0: vgaarb: deactivate vga console
[    0.849219] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    0.866121] [drm] Initialized i915 1.6.0 20200715 for 0000:00:02.0 on minor 0
[    0.868613] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)

All is well until X, or anything else using graphics, is started.

Linux-5.9-rc7-dmesg.jpg
A very sad story presented in the Linux kernels ring buffer on Linux 5.9-rc7 after trying to start X on a machine with integrated Intel graphics.

The kernel ring buffer becomes far more aggressive once X is started on a machine, any machine, using Intel graphics. This is the kernel ring buffer story after X is started, and immediately frozen, on a Lenovo G50-80 laptop which does work just fine with Linux 5.9-rc6 (as long as the ahci.mobile_lpm_policy=1 kernel parameter is used, anyway):

[ 2319.340363] BUG: kernel NULL pointer dereference, address: 0000000000000064
[ 2319.340367] #PF: supervisor write access in kernel mode
[ 2319.340368] #PF: error_code(0x0002) - not-present page
[ 2319.340370] PGD 0 P4D 0 
[ 2319.340372] Oops: 0002 [#1] SMP NOPTI
[ 2319.340375] CPU: 0 PID: 3877 Comm: kworker/u9:1 Tainted: G          I       5.9.0-rc7-01-Eunseo #1
[ 2319.340376] Hardware name: LENOVO 80E5/Lenovo G50-80, BIOS B0CN95WW 07/31/2015
[ 2319.340382] Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker
[ 2319.340386] RIP: 0010:__get_user_pages_remote+0xa1/0x2c0
[ 2319.340389] Code: df 01 00 00 83 7d 00 01 0f 85 d7 01 00 00 f7 c1 00 00 04 00 0f 84 42 01 00 00 65 48 8b 04 25 00 6d 01 00 48 8b 80 58 07 00 00 <c7> 40 64 01 00 00 00 65 48 8b 04 25 00 6d 01 00 48 c7 44 24 18 00
[ 2319.340390] RSP: 0018:ffffa4ef812dfde8 EFLAGS: 00010206
[ 2319.340392] RAX: 0000000000000000 RBX: 00007f02bf800000 RCX: 0000000000040001
[ 2319.340393] RDX: 00000000000007e9 RSI: 00007f02bf800000 RDI: ffff912d873a9800
[ 2319.340395] RBP: ffffa4ef812dfe5c R08: ffff912cc79bc000 R09: 0000000000000000
[ 2319.340396] R10: 0000000000023b98 R11: 0000000000000006 R12: ffff912d873a9800
[ 2319.340397] R13: ffff912cc79bc000 R14: ffff912d87ebed80 R15: 0000000000042003
[ 2319.340399] FS:  0000000000000000(0000) GS:ffff912d8ec00000(0000) knlGS:0000000000000000
[ 2319.340400] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2319.340401] CR2: 0000000000000064 CR3: 00000003878d2002 CR4: 00000000003706f0
[ 2319.340402] Call Trace:
[ 2319.340407]  __i915_gem_userptr_get_pages_worker+0xc8/0x260
[ 2319.340411]  process_one_work+0x186/0x2e0
[ 2319.340413]  worker_thread+0x4b/0x3a0
[ 2319.340415]  ? rescuer_thread+0x340/0x340
[ 2319.340418]  kthread+0x111/0x130
[ 2319.340421]  ? __kthread_bind_mask+0x60/0x60
[ 2319.340424]  ret_from_fork+0x1f/0x30
[ 2319.340426] Modules linked in: rfcomm xt_DSCP xt_length iptable_mangle nf_conntrack_irc nf_conntrack_sip iptable_raw xt_CT nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rt xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel bluetooth uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc ecdh_generic ecc at24 iTCO_wdt intel_rapl_msr mei_hdcp intel_pmc_bxt iTCO_vendor_support intel_rapl_common iwlmvm x86_pkg_temp_thermal snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic mac80211 intel_powerclamp ledtrig_audio coretemp libarc4 snd_hda_intel iwlwifi snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core pcspkr snd_seq i2c_i801 joydev i2c_smbus snd_seq_device cfg80211 snd_pcm snd_timer mei_me snd ideapad_laptop mei sparse_keymap soundcore lpc_ich wmi rfkill zram ip_tables dm_crypt rtsx_usb_sdmmc mmc_core crct10dif_pclmul
[ 2319.340453]  crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw r8169 rtsx_usb fuse
[ 2319.340458] CR2: 0000000000000064
[ 2319.340460] ---[ end trace a9ced42a65bde482 ]---
[ 2319.340463] RIP: 0010:__get_user_pages_remote+0xa1/0x2c0
[ 2319.340465] Code: df 01 00 00 83 7d 00 01 0f 85 d7 01 00 00 f7 c1 00 00 04 00 0f 84 42 01 00 00 65 48 8b 04 25 00 6d 01 00 48 8b 80 58 07 00 00 <c7> 40 64 01 00 00 00 65 48 8b 04 25 00 6d 01 00 48 c7 44 24 18 00
[ 2319.340466] RSP: 0018:ffffa4ef812dfde8 EFLAGS: 00010206
[ 2319.340467] RAX: 0000000000000000 RBX: 00007f02bf800000 RCX: 0000000000040001
[ 2319.340469] RDX: 00000000000007e9 RSI: 00007f02bf800000 RDI: ffff912d873a9800
[ 2319.340470] RBP: ffffa4ef812dfe5c R08: ffff912cc79bc000 R09: 0000000000000000
[ 2319.340471] R10: 0000000000023b98 R11: 0000000000000006 R12: ffff912d873a9800
[ 2319.340472] R13: ffff912cc79bc000 R14: ffff912d87ebed80 R15: 0000000000042003
[ 2319.340474] FS:  0000000000000000(0000) GS:ffff912d8ec00000(0000) knlGS:0000000000000000
[ 2319.340475] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2319.340476] CR2: 0000000000000064 CR3: 00000003878d2002 CR4: 00000000003706f0

It is possible to ssh in and see what's going on when this happens. There is no hope of terminating or restarting the X process, but you can see the fine kernel ring buffer story outlined above.

Linux kernel scientist Jason A. Donenfeld, who got to experience this very unfortunate regression since Linux 5.9-rc6 when he booted his machine with 5.9-rc7, has looked into the cause of this complete and utter disaster.

"Alright, the failing code seems to be in mm:

        if (flags & FOLL_PIN)
                atomic_set(&current->mm->has_pinned, 1);

Apparently you can't rely on current->mm being valid in this context; it's null here, hence the +0x64 for has_pinned's offset.

This was added by 008cfe4418b3 ("mm: Introduce mm_struct.has_pinned"), which is new for rc7 indeed.

The crash goes away when changing that to:

        if ((flags & FOLL_PIN) && current->mm)
                atomic_set(&current->mm->has_pinned, 1);

But I haven't really evaluated whether or not that's racy or if I need to take locks to do such a thing."

Jason A. Donenfeld on the LKML, Monday September 28th, 2020

Linus Torvalds has been made aware of this very serious regression in Linux 5.9-rc7 and the relevant code change required to fix it. Is is therefore quite possible that Linux 5.9 will be released after rc8. Linus Torvalds has already pushed the final 5.9 release back one week due to other issues during the 5.9 release-cycle.

It is possible that Linus Torvalds will manage to get a Linux 5.9 release that does not break every single machine using Intel integrated graphics out in two weeks. It is too early to rule the final 5.9 release out for Intel graphics users. Linux 5.9-rc7 is merely a release candidate, and release candidates will occationally contain minor bugs. This particular bug isn't minor. Linus Torvalds has said that "WE DO NOT BREAK USERSPACE" time and time again the last decade. It is very unfortunate that he greatly dishonored himself and the entire kernel community by releasing such a buggy utter disgrace of a kernel release candidate. Linux 5.9 and 5.10 need to be outright amazing for there to be any chance of him ever recorvering from this blunder.

4.67
(3 votes)


avatar

Chaekyung

15 months ago
Score 1++

There are some, primarily non-laptop, Intel machines that don't just randomly hang without one or more kernel parameters. Both my laptop and my notebook randomly hang without one or two. If you look at older articles like Linux Kernel 5.5 Will Not Fix The Frequent Intel GPU Hangs In Recent Kernels from January you'll see a very steady stream of comments for people who have issues & are searching for solutions.

The machine you see in the picture below this stories headline is one of many that hang without ahci.mobile_lpm_policy=1 _and_ intel_idle.max_cstate=1 on any 5.x kernel, including 5.9-rc6. I tried disabling them on that one to see if it's still a problem and it is. I didn't try with 5.9-rc7 since that one can't even load X.
Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.