DRBD kernel error on ganeti2031 led to kernel hang
Open, MediumPublic

Description

On ganeti2031 the kernel hung after some DRBD error. After a reboot it resumed to work fine, nothing immediately actionable, but filing a task to have it on record if it happens again (could be triggered by a hardware error or similar as well):

Oct  9 09:46:21 ganeti2031 kernel: [3548250.074990] block drbd8: We did not send a P_BARRIER for 42748ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:46:59 ganeti2031 kernel: [3548288.985968] block drbd4: We did not send a P_BARRIER for 84640ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:47:01 ganeti2031 kernel: [3548291.033920] block drbd3: We did not send a P_BARRIER for 86228ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:47:01 ganeti2031 kernel: [3548291.033962] block drbd0: We did not send a P_BARRIER for 86248ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:47:04 ganeti2031 kernel: [3548293.081865] block drbd8: We did not send a P_BARRIER for 85756ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:47:42 ganeti2031 kernel: [3548331.996887] block drbd4: We did not send a P_BARRIER for 127652ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:47:45 ganeti2031 kernel: [3548334.040839] block drbd3: We did not send a P_BARRIER for 129236ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:47:45 ganeti2031 kernel: [3548334.040889] block drbd0: We did not send a P_BARRIER for 129256ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:47:47 ganeti2031 kernel: [3548336.088849] block drbd8: We did not send a P_BARRIER for 128764ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:48:25 ganeti2031 kernel: [3548374.999874] block drbd4: We did not send a P_BARRIER for 170656ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:48:28 ganeti2031 kernel: [3548377.047779] block drbd3: We did not send a P_BARRIER for 172244ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:48:28 ganeti2031 kernel: [3548377.047816] block drbd0: We did not send a P_BARRIER for 172264ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:48:30 ganeti2031 kernel: [3548379.095729] block drbd8: We did not send a P_BARRIER for 171772ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Oct  9 09:48:38 ganeti2031 kernel: [3548387.287573] INFO: task md2_raid5:562 blocked for more than 120 seconds.
Oct  9 09:48:38 ganeti2031 kernel: [3548387.294365]       Not tainted 5.10.0-25-amd64 #1 Debian 5.10.191-1
Oct  9 09:48:38 ganeti2031 kernel: [3548387.300728] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308737] task:md2_raid5       state:D stack:    0 pid:  562 ppid:     2 flags:0x00004000
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308741] Call Trace:
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308751]  __schedule+0x282/0x870
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308757]  schedule+0x46/0xb0
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308766]  raid5d+0x3e4/0x620 [raid456]
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308773]  ? add_wait_queue_exclusive+0x70/0x70
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308787]  md_thread+0xa8/0x160 [md_mod]
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308791]  ? add_wait_queue_exclusive+0x70/0x70
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308800]  ? md_write_inc+0x50/0x50 [md_mod]
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308811]  kthread+0x118/0x140
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308821]  ? __kthread_bind_mask+0x60/0x60
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308836]  ret_from_fork+0x1f/0x30
Oct  9 09:48:38 ganeti2031 kernel: [3548387.308869] INFO: task drbd_r_resource:4222 blocked for more than 120 seconds.
Oct  9 09:48:38 ganeti2031 kernel: [3548387.316269]       Not tainted 5.10.0-25-amd64 #1 Debian 5.10.191-1
Oct  9 09:48:38 ganeti2031 kernel: [3548387.322629] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  9 09:48:38 ganeti2031 kernel: [3548387.330638] task:drbd_r_resource state:D stack:    0 pid: 4222 ppid:     2 flags:0x00004004
Oct  9 09:48:38 ganeti2031 kernel: [3548387.330642] Call Trace:

Event Timeline

(me too ubuntu-forum style reply)

This happened again on ganeti2028:

[Thu Jun 13 15:38:21 2024] INFO: task drbd_r_resource:1033579 blocked for more than 121 seconds.
[Thu Jun 13 15:38:21 2024]       Not tainted 5.10.0-30-amd64 #1 Debian 5.10.218-1
[Thu Jun 13 15:38:21 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jun 13 15:38:21 2024] task:drbd_r_resource state:D stack:    0 pid:1033579 ppid:     2 flags:0x00004004
[Thu Jun 13 15:38:21 2024] Call Trace:
[Thu Jun 13 15:38:21 2024]  __schedule+0x282/0x870
[Thu Jun 13 15:38:21 2024]  schedule+0x46/0xb0
[Thu Jun 13 15:38:21 2024]  md_write_start+0x14e/0x230 [md_mod]
[Thu Jun 13 15:38:21 2024]  ? add_wait_queue_exclusive+0x70/0x70
[Thu Jun 13 15:38:21 2024]  raid5_make_request+0x85/0xb90 [raid456]
[Thu Jun 13 15:38:21 2024]  ? submit_bio_noacct+0x2c/0x420
[Thu Jun 13 15:38:21 2024]  ? add_wait_queue_exclusive+0x70/0x70
[Thu Jun 13 15:38:21 2024]  md_handle_request+0x11f/0x1b0 [md_mod]
[Thu Jun 13 15:38:21 2024]  md_submit_bio+0x8b/0x160 [md_mod]
[Thu Jun 13 15:38:21 2024]  submit_bio_noacct+0xf5/0x420
[Thu Jun 13 15:38:21 2024]  ? bio_add_page+0x62/0x90
[Thu Jun 13 15:38:21 2024]  drbd_submit_peer_request+0x18c/0x340 [drbd]
[Thu Jun 13 15:38:21 2024]  receive_Data+0x4bd/0x8c0 [drbd]
[Thu Jun 13 15:38:21 2024]  ? receive_RSDataReply+0x1f0/0x1f0 [drbd]
[Thu Jun 13 15:38:21 2024]  drbd_receiver+0x29f/0x306 [drbd]
[Thu Jun 13 15:38:21 2024]  drbd_thread_setup+0x65/0x140 [drbd]
[Thu Jun 13 15:38:21 2024]  ? drbd_destroy_connection+0x100/0x100 [drbd]
[Thu Jun 13 15:38:21 2024]  kthread+0x118/0x140
[Thu Jun 13 15:38:21 2024]  ? __kthread_bind_mask+0x60/0x60
[Thu Jun 13 15:38:21 2024]  ret_from_fork+0x1f/0x30
[Thu Jun 13 15:38:41 2024] block drbd5: We did not send a P_BARRIER for 214832ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
[Thu Jun 13 15:38:43 2024] block drbd0: We did not send a P_BARRIER for 214572ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
[Thu Jun 13 15:38:43 2024] block drbd7: We did not send a P_BARRIER for 215204ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
[Thu Jun 13 15:38:43 2024] block drbd8: We did not send a P_BARRIER for 215908ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?

Happened once more on ganeti2029 today. We're gradually moving nodes to Bookworm (the routed cluster and magru cluster are already running it and the next refreshes in codfw/eqiad will also immediatelly be added with Bookworm), so hopefully the more recent kernel/DRBD addresses this bug.