Qblg002 is drained frequently
From HPC users
Jump to navigationJump to search
Der Knoten qblg002 wird häufiger (alle paar Monate) in den Status drained gesetzt.
Tickets: 20220105-0193
Status des Knotens
$ scontrol show node qblg002 ... Reason=Kill task failed [root@2022-01-05T16:45:45]
Beheben mit
$ scontrol update node=qblg002 state=undrain
Anscheinend der letzte Prozess macht Probleme:
$ dmesg -T [Wed Jan 5 15:13:26 2022] INFO: task namd3:83962 blocked for more than 120 seconds. [Wed Jan 5 15:13:26 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Wed Jan 5 15:13:26 2022] namd3 D ffff93376973e2a0 0 83962 1 0x00000004 [Wed Jan 5 15:13:26 2022] Call Trace: [Wed Jan 5 15:13:26 2022] [<ffffffffa0b7f1c9>] schedule+0x29/0x70 [Wed Jan 5 15:13:26 2022] [<ffffffffa0b7cb51>] schedule_timeout+0x221/0x2d0 [Wed Jan 5 15:13:26 2022] [<ffffffffa0435b79>] ? sched_clock+0x9/0x10 [Wed Jan 5 15:13:26 2022] [<ffffffffa04dd425>] ? sched_clock_cpu+0x85/0xc0 [Wed Jan 5 15:13:26 2022] [<ffffffffa0b7f57d>] wait_for_completion+0xfd/0x140 [Wed Jan 5 15:13:26 2022] [<ffffffffa04da0b0>] ? wake_up_state+0x20/0x20 [Wed Jan 5 15:13:26 2022] [<ffffffffc8380a9d>] _raw_q_flush+0x6d/0x90 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8380ac0>] ? _raw_q_flush+0x90/0x90 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8380de9>] nv_kthread_q_flush+0x19/0x90 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc837ee1b>] os_flush_work_queue+0x7b/0x80 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8c37a0e>] rm_disable_adapter+0x6e/0x110 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffa04cb1d2>] ? up+0x32/0x50 [Wed Jan 5 15:13:26 2022] [<ffffffffc837098e>] ? nv_shutdown_adapter+0x1e/0x140 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8370c0e>] ? nv_close_device+0x15e/0x1b0 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8370cd1>] ? nvidia_close_callback+0x71/0x150 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc837313e>] ? nvidia_close+0xae/0x310 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc836e40f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffa064a9cc>] ? __fput+0xec/0x260 [Wed Jan 5 15:13:26 2022] [<ffffffffa064ac2e>] ? ____fput+0xe/0x10 [Wed Jan 5 15:13:26 2022] [<ffffffffa04c1c0b>] ? task_work_run+0xbb/0xe0 [Wed Jan 5 15:13:26 2022] [<ffffffffa042cc65>] ? do_notify_resume+0xa5/0xc0 [Wed Jan 5 15:13:26 2022] [<ffffffffa0b8c23b>] ? int_signal+0x12/0x17
Es stehen auch viele Einträge mit
[Fri Dec 10 14:47:36 2021] nvidia 0000:86:00.0: irq 205 for MSI/MSI-X
im dmesg-Log. Dazu sagt Google u.a. https://bbs.archlinux.org/viewtopic.php?id=192447