未知原因导致的内核卡死

sdyqs · 2024年03月26日 07:41

笔记本最近在升级 KDE6 后，总是出现不明原因的卡死。其早期表现为网络无法使用，点击托盘的网络管理器后，plasmashell 卡住，所有与网络有关的程序无法使用，都会卡住并无法退出（无论是浏览器还是 curl 等命令行工具），一段时间后 ls 等命令也会出现卡死的现象，无法使用 ctrl+c 退出，无论在哪个目录。使用 strace 会发现卡在 ioctl 系统调用。再过一段时间，除了光标以外的一切程序都会卡死，此时无法关机，进入 tty 后使用 poweroff 或 halt 都会卡住，最后只能通过电源键关机。查看 dmesg 有三个线索：

首先是临近发生网络故障之前有大量的 downshift 信息：

Generic FE-GE Realtek PHY r8169-0-301:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[ 1753.808368] r8169 0000:03:00.1 eth0: Link is Up - 100Mbps/Full (downshifted) - flow control rx/tx

以上消息重复多次，表现出 KDE 桌面提示以太网反复在连接与断开间切换，之后直接报 BUG：

[101813.990261] BUG: unable to handle page fault for address: 000000000000115e
[101813.990269] #PF: supervisor read access in kernel mode
[101813.990272] #PF: error_code(0x0000) - not-present page
[101813.990275] PGD 0 P4D 0
[101813.990279] Oops: 0000 [#1] PREEMPT SMP NOPTI
[101813.990282] CPU: 10 PID: 15323 Comm: kworker/10:2 Tainted: P        W  OE      6.8.1-1-default #1 openSUSE Tumbleweed a408dede100ecd8172a7eae2d0778227ac69e46d

此时电脑开始表现出之前所说的症状，随后是大量重复的 workqueue lockup 信息：

[101856.953486] BUG: workqueue lockup - pool cpus=10 node=0 flags=0x0 nice=0 stuck for 42s!
[101856.953513] Showing busy workqueues and worker pools:
[101856.953517] workqueue events: flags=0x0
[101856.953522]   pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=3/256 refcnt=4
[101856.953529]     pending: delayed_vfree_work, kfree_rcu_monitor, kernfs_notify_workfn
[101856.953547] workqueue events_unbound: flags=0x2
[101856.953554]   pwq 32: cpus=0-15 flags=0x4 nice=0 active=2/512 refcnt=4
[101856.953560]   pwq 32: cpus=0-15 flags=0x4 nice=0 active=2/512 refcnt=3
[101856.953565]     in-flight: 19069:fsnotify_connector_destroy_workfn fsnotify_connector_destroy_workfn, 7379:fsnotify_mark_destroy_workfn fsnotify_mark_destroy_workfn BAR(20178)
[101856.953583] workqueue rcu_gp: flags=0x8
[101856.953588]   pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[101856.953592]     pending: process_srcu
[101856.953600] workqueue mm_percpu_wq: flags=0x8
[101856.953604]   pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=2/256 refcnt=4
[101856.953608]     pending: vmstat_update, lru_add_drain_per_cpu BAR(135)
[101856.953618] workqueue pm: flags=0x4
[101856.953623]   pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[101856.953626]     pending: pm_runtime_work
[101856.953632] workqueue cgroup_destroy: flags=0x0
[101856.953636]   pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=1/1 refcnt=2
[101856.953640]     in-flight: 21110:css_free_rwork_fn
[101856.953672] workqueue usb_hub_wq: flags=0x4
[101856.953677]   pwq 20: cpus=10 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
[101856.953681]     pending: 2*hub_event [usbcore]
[101856.953726] workqueue gfx_low: flags=0xa0002
[101856.953731]   pwq 32: cpus=0-15 flags=0x4 nice=0 active=1/1 refcnt=19
[101856.953734]     pending: drm_sched_free_job_work [gpu_sched]
[101856.953743]     inactive: drm_sched_run_job_work [gpu_sched]

我截取了相关的信息在此处：Mozilla Community Pastebin/Vwe85ras (C)

我的系统信息：

Operating System: openSUSE Tumbleweed 20240320
KDE Plasma Version: 6.0.2
KDE Frameworks Version: 6.0.0
Qt Version: 6.6.2
Kernel Version: 6.8.1-1-default (64-bit)
Graphics Platform: Wayland
Processors: 16 × AMD Ryzen 7 5800H with Radeon Graphics
Memory: 27.3 GiB of RAM

lilydjwg · 2024年03月26日 10:52

见过： [SOLVED] Repeated kernel problems/freezes since 6.7.6 / Kernel & Hardware / Arch Linux Forums
听说 lts 没事。

sdyqs · 2024年03月26日 12:39

我看下他使用的机型也是联想机器，不算新。

话说我的机器买回来两三年没有清理/保养过，这会不会导致这类莫名其妙的故障？

lilydjwg · 2024年03月26日 13:04

不会。这个问题已经有多位 Arch 群友遇到了。推测可能和 nvidia 驱动及内核版本有关。

sdyqs · 2024年03月26日 13:06

好吧，也许我可以先使用升级前老的内核，谢谢了。

sdyqs · 2024年03月28日 13:06

我折腾了一下，使用 sudo zypper in kernel-longterm kernel-longterm-devel 来安装 LTS 版本的内核。注意下次启动时要在 GRUB 里面选择内核版本。如果希望使用 virtualbox，还需要安装 virtualbox-host-source 这个包并运行 sudo /sbin/vboxconfig。

目前我装上的 LTS 内核版本是 6.6.22-1-longterm。