What could be possibly worse that an almost unbeatable boss in
a game or a tough maze that consume hours of gameplay with not
much progress? How about a Linux kernel crash that makes you
lose all the game progress with no apparent reason or feedback?
Though rare, it is a real possibility that would make gamers
quite annoyed, given that Linux is used more and more as a
platform for playing games.
Some technologies are available to collect logs and feedback
the user in case such disastrous events happen, mostly related
with kernel crashes handling mechanisms. The main ones available
are kdump and pstore, but still there are work to be done in
this area...
In this talk we're going to present the basics about kernel
crash handling, like how a kernel panic might happen, how to
deal with that (with an overall discussion about kdump and
pstore techs) and the kdumpst tool, developed specially to
deal with this situation on Steam Deck (and generically on
Arch Linux); also we're gonna discuss some missing
pieces / ideas to make it even less likely gamers need to
complain that their device just got hang for no reason!
FOSForums 2023
Aug 26 - Aug 27, 2023
Institute of Computing, State University of Campinas (Unicamp)
Campinas, São Paulo, Brazil
https://www.fosforums.org/
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
To crash or not to crash: if you do, at least recover fast!
1. To crash or not to crash:
If you do, at least recover fast!
Guilherme G. Piccoli
2023-08-26 / Fosforums
1
2. Bio
Always loved computers, since childhood
BSc Mathematics, MSc Computer Sc.(IC/Unicamp!)
IBM/LTC
Linux kernel, PPC64, kdump, PCI
Canonical
Sustaining eng., kernel, Ubuntu, kdump
Igalia
Cooperative, kdump, x86, btrfs
2
3. What's a kernel crash / panic?
Linux kernel is a regular SW in the end
Written in C (and a bit of Rust)
Oops is not (necessarily) a panic
NULL pointers (concurrency / SMP), lockups
(soft/hard), sysrq
OOM (out-of-memory), invalid memory, BUG() call
3
4. Portrait of a kernel panic
amdgpu 0000:04:00.0: amdgpu: SMU is initialized successfully!
[drm] failed to load ucode VCN0_RAM(0x3A)
[drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0xF
amdgpu 0000:04:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init
amdgpu: probe of 0000:04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0000000000000090
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 6 PID: 685 Comm: systemd-udevd Not tainted 6.1.2-valve1-1-neptune-6
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
RSP: 0018:ffffba3fc0f6fa90 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9373dca69b10 RCX: ffff9373d51db380
RDX: 0000000000000001 RSI: ffff9373d51db3a8 RDI: ffff9373dca69b00
CR2: 0000000000000090 CR3: 000000010bdf6000 CR4: 0000000000350ee0
Call Trace:
amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
amdgpu_device_fini_sw+0x33/0x3c0 [amdgpu]
driver_probe_device+0x1f/0x90
driver_register+0x8d/0xe0
4
5. And how does it happen?
Kernel detects a bad condition
Panic on oops (or live with the oops at your own risk)
Other settings like panic_on_oom, or on lockups, etc
Arch-dependent, "abrupt" condition usually
5
6. Panic (over-simplified) code flow
local IRQ and
preempt disable
dump stack
kdump?
crash_kexec()
disable the
other CPUs
panic notifiers
and kmsg_dump()
arch code / reboot
arch code / kexec
YES NO
6
8. Collecting (all the) data: kdump
kexec the crash kernel, using only special RAM area
Preserves the memory contents of old/broken
kernel
The new kernel collects and compresses the old
kernel memory (vmcore)
The memory dump can be inspected later
Quite standard for servers in the field
8
9. Advantages of kdump
(Almost) full copy of memory
Lots of data to debug
Post-mortem analysis, easy to share with others
Network (and other esoteric ways for) data saving
9
10. Challenges of kdump
Pre-reserved memory required (more and more lately)
Risks during:
Broken kernel shutdown
Crash kernel booting
The data collecting phase
PCI devices reset: a nightmare
No graphics on kdump, (potential) long boot delay
vmcore compatibility with new kernel features
10
11. Interrupt storm: a true story
PCI resets are arch-dependent (if any! PPC64 vs x86)
Real case: Intel NIC running under DPDK
Custom SW piece triggered (likely) a NIC FW bug -
interrupt storm!
kdump kernel unable to boot (stuck at APIC interrupt
enabling)
x86 early PCI infra to the rescue! [discussion]
11
12. Collecting (most) data: pstore
Lightweight mechanism - save the kernel log in a
persistent storage
Multiple backends: RAM, UEFI, ACPI ERST, block
device
Also many frontends: dmesg, console, ftrace(!)
More common in embedded devices (and
chromebooks!)
12
13. Benefits of pstore
Very fast and (hopefully) transparent process
Not much memory required (ramoops)
NO memory reserved if UEFI backend is used
Less prone to failures (citation needed)
Doesn't require kexec support
Not only for crashes: console / pmsg / ftrace
13
14. Challenges of pstore
Cannot collect a full vmcore
dmesg presents "limited" information
(Kinda) circumvented with panic_print
For , runs after panic notifiers - increased risk
now
14
16. Wait, a Linux game console? Oh yeah
Steam Deck, from Valve
CPU/APU AMD Zen 2 (custom), 4-cores/8-threads,
7" display
16 GB of RAM, 3 models of NVMe storage (64G,
256G, 512G)
16
17. Steam Deck
SteamOS 3: Arch-based distro with gamescope
(games) and KDE Plasma (desktop)
Sophisticated stack for games: Steam, Proton (Wine),
DXVK, VKD3D, etc
17
18. Presenting: kdumpst
is an Arch Linux kdump and pstore tool
Available on , supports GRUB and initcpio /
dracut
Default to pstore mode; currently only ramoops
backend (UEFI plans)
Simple / customizable: sysctls, crashkernel and log
compression
Present on SteamOS as Deck's crash collecting tool
kdumpst
AUR
18
20. More missing pieces
Graphical output during kdump
Is it possible? GPU interrupts
UEFI hinting - logo change idea
Reliable PCI reset for kdump kernels
Maintainers communication / awareness
attempt
20
21. Conclusion
Kernel crashes DO happen - need to be prepared
Multiple ways to react to them / trade-offs of data
collecting mechanisms
Still have lots of core things to improve
Linux gaming bumping kdump/pstore techs (unexpected)
21