Difference between revisions of "Qblg002 is drained frequently"
From HPC users
Jump to navigationJump to search
(Created page with "Der Knoten qblg002 wird häufiger (alle paar Monate) in den Status drained gesetzt. Tickets: 20220105-0193 Status des Knotens $ scontrol show node qblg002 ... Reason=Ki...") |
|||
(One intermediate revision by the same user not shown) | |||
Line 10: | Line 10: | ||
Beheben mit | Beheben mit | ||
$ scontrol update node=qblg002 state=undrain | $ scontrol update node=qblg002 state=undrain | ||
Anscheinend der letzte Prozess macht Probleme: | |||
<pre> | |||
$ dmesg -T | |||
[Wed Jan 5 15:13:26 2022] INFO: task namd3:83962 blocked for more than 120 seconds. | |||
[Wed Jan 5 15:13:26 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. | |||
[Wed Jan 5 15:13:26 2022] namd3 D ffff93376973e2a0 0 83962 1 0x00000004 | |||
[Wed Jan 5 15:13:26 2022] Call Trace: | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa0b7f1c9>] schedule+0x29/0x70 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa0b7cb51>] schedule_timeout+0x221/0x2d0 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa0435b79>] ? sched_clock+0x9/0x10 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa04dd425>] ? sched_clock_cpu+0x85/0xc0 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa0b7f57d>] wait_for_completion+0xfd/0x140 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa04da0b0>] ? wake_up_state+0x20/0x20 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc8380a9d>] _raw_q_flush+0x6d/0x90 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc8380ac0>] ? _raw_q_flush+0x90/0x90 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc8380de9>] nv_kthread_q_flush+0x19/0x90 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc837ee1b>] os_flush_work_queue+0x7b/0x80 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc8c37a0e>] rm_disable_adapter+0x6e/0x110 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa04cb1d2>] ? up+0x32/0x50 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc837098e>] ? nv_shutdown_adapter+0x1e/0x140 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc8370c0e>] ? nv_close_device+0x15e/0x1b0 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc8370cd1>] ? nvidia_close_callback+0x71/0x150 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc837313e>] ? nvidia_close+0xae/0x310 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffc836e40f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia] | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa064a9cc>] ? __fput+0xec/0x260 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa064ac2e>] ? ____fput+0xe/0x10 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa04c1c0b>] ? task_work_run+0xbb/0xe0 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa042cc65>] ? do_notify_resume+0xa5/0xc0 | |||
[Wed Jan 5 15:13:26 2022] [<ffffffffa0b8c23b>] ? int_signal+0x12/0x17 | |||
</pre> | |||
Es stehen auch viele Einträge mit | |||
[Fri Dec 10 14:47:36 2021] nvidia 0000:86:00.0: irq 205 for MSI/MSI-X | |||
im dmesg-Log. Dazu sagt Google u.a. https://bbs.archlinux.org/viewtopic.php?id=192447 |
Latest revision as of 09:52, 6 January 2022
Der Knoten qblg002 wird häufiger (alle paar Monate) in den Status drained gesetzt.
Tickets: 20220105-0193
Status des Knotens
$ scontrol show node qblg002 ... Reason=Kill task failed [root@2022-01-05T16:45:45]
Beheben mit
$ scontrol update node=qblg002 state=undrain
Anscheinend der letzte Prozess macht Probleme:
$ dmesg -T [Wed Jan 5 15:13:26 2022] INFO: task namd3:83962 blocked for more than 120 seconds. [Wed Jan 5 15:13:26 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Wed Jan 5 15:13:26 2022] namd3 D ffff93376973e2a0 0 83962 1 0x00000004 [Wed Jan 5 15:13:26 2022] Call Trace: [Wed Jan 5 15:13:26 2022] [<ffffffffa0b7f1c9>] schedule+0x29/0x70 [Wed Jan 5 15:13:26 2022] [<ffffffffa0b7cb51>] schedule_timeout+0x221/0x2d0 [Wed Jan 5 15:13:26 2022] [<ffffffffa0435b79>] ? sched_clock+0x9/0x10 [Wed Jan 5 15:13:26 2022] [<ffffffffa04dd425>] ? sched_clock_cpu+0x85/0xc0 [Wed Jan 5 15:13:26 2022] [<ffffffffa0b7f57d>] wait_for_completion+0xfd/0x140 [Wed Jan 5 15:13:26 2022] [<ffffffffa04da0b0>] ? wake_up_state+0x20/0x20 [Wed Jan 5 15:13:26 2022] [<ffffffffc8380a9d>] _raw_q_flush+0x6d/0x90 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8380ac0>] ? _raw_q_flush+0x90/0x90 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8380de9>] nv_kthread_q_flush+0x19/0x90 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc837ee1b>] os_flush_work_queue+0x7b/0x80 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8c37a0e>] rm_disable_adapter+0x6e/0x110 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffa04cb1d2>] ? up+0x32/0x50 [Wed Jan 5 15:13:26 2022] [<ffffffffc837098e>] ? nv_shutdown_adapter+0x1e/0x140 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8370c0e>] ? nv_close_device+0x15e/0x1b0 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc8370cd1>] ? nvidia_close_callback+0x71/0x150 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc837313e>] ? nvidia_close+0xae/0x310 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffc836e40f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia] [Wed Jan 5 15:13:26 2022] [<ffffffffa064a9cc>] ? __fput+0xec/0x260 [Wed Jan 5 15:13:26 2022] [<ffffffffa064ac2e>] ? ____fput+0xe/0x10 [Wed Jan 5 15:13:26 2022] [<ffffffffa04c1c0b>] ? task_work_run+0xbb/0xe0 [Wed Jan 5 15:13:26 2022] [<ffffffffa042cc65>] ? do_notify_resume+0xa5/0xc0 [Wed Jan 5 15:13:26 2022] [<ffffffffa0b8c23b>] ? int_signal+0x12/0x17
Es stehen auch viele Einträge mit
[Fri Dec 10 14:47:36 2021] nvidia 0000:86:00.0: irq 205 for MSI/MSI-X
im dmesg-Log. Dazu sagt Google u.a. https://bbs.archlinux.org/viewtopic.php?id=192447