Qblg002 is drained frequently

From HPC users
Jump to navigationJump to search

Der Knoten qblg002 wird häufiger (alle paar Monate) in den Status drained gesetzt.

Tickets: 20220105-0193

Status des Knotens

$ scontrol show node qblg002
...
  Reason=Kill task failed [root@2022-01-05T16:45:45]

Beheben mit

$ scontrol update node=qblg002 state=undrain

Anscheinend der letzte Prozess macht Probleme:

$ dmesg -T
[Wed Jan  5 15:13:26 2022] INFO: task namd3:83962 blocked for more than 120 seconds.
[Wed Jan  5 15:13:26 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jan  5 15:13:26 2022] namd3           D ffff93376973e2a0     0 83962      1 0x00000004
[Wed Jan  5 15:13:26 2022] Call Trace:
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f1c9>] schedule+0x29/0x70
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7cb51>] schedule_timeout+0x221/0x2d0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0435b79>] ? sched_clock+0x9/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04dd425>] ? sched_clock_cpu+0x85/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f57d>] wait_for_completion+0xfd/0x140
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04da0b0>] ? wake_up_state+0x20/0x20
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380a9d>] _raw_q_flush+0x6d/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380ac0>] ? _raw_q_flush+0x90/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380de9>] nv_kthread_q_flush+0x19/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837ee1b>] os_flush_work_queue+0x7b/0x80 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8c37a0e>] rm_disable_adapter+0x6e/0x110 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04cb1d2>] ? up+0x32/0x50
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837098e>] ? nv_shutdown_adapter+0x1e/0x140 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370c0e>] ? nv_close_device+0x15e/0x1b0 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370cd1>] ? nvidia_close_callback+0x71/0x150 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837313e>] ? nvidia_close+0xae/0x310 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc836e40f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064a9cc>] ? __fput+0xec/0x260
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064ac2e>] ? ____fput+0xe/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04c1c0b>] ? task_work_run+0xbb/0xe0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa042cc65>] ? do_notify_resume+0xa5/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b8c23b>] ? int_signal+0x12/0x17

Es stehen auch viele Einträge mit

[Fri Dec 10 14:47:36 2021] nvidia 0000:86:00.0: irq 205 for MSI/MSI-X

im dmesg-Log. Dazu sagt Google u.a. https://bbs.archlinux.org/viewtopic.php?id=192447