Difference between revisions of "Qblg002 is drained frequently"

From HPC users
Jump to navigationJump to search
(Created page with "Der Knoten qblg002 wird häufiger (alle paar Monate) in den Status drained gesetzt. Tickets: 20220105-0193 Status des Knotens $ scontrol show node qblg002 ... Reason=Ki...")
 
 
(One intermediate revision by the same user not shown)
Line 10: Line 10:
Beheben mit
Beheben mit
  $ scontrol update node=qblg002 state=undrain
  $ scontrol update node=qblg002 state=undrain
Anscheinend der letzte Prozess macht Probleme:
<pre>
$ dmesg -T
[Wed Jan  5 15:13:26 2022] INFO: task namd3:83962 blocked for more than 120 seconds.
[Wed Jan  5 15:13:26 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jan  5 15:13:26 2022] namd3          D ffff93376973e2a0    0 83962      1 0x00000004
[Wed Jan  5 15:13:26 2022] Call Trace:
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f1c9>] schedule+0x29/0x70
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7cb51>] schedule_timeout+0x221/0x2d0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0435b79>] ? sched_clock+0x9/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04dd425>] ? sched_clock_cpu+0x85/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f57d>] wait_for_completion+0xfd/0x140
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04da0b0>] ? wake_up_state+0x20/0x20
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380a9d>] _raw_q_flush+0x6d/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380ac0>] ? _raw_q_flush+0x90/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380de9>] nv_kthread_q_flush+0x19/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837ee1b>] os_flush_work_queue+0x7b/0x80 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8c37a0e>] rm_disable_adapter+0x6e/0x110 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04cb1d2>] ? up+0x32/0x50
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837098e>] ? nv_shutdown_adapter+0x1e/0x140 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370c0e>] ? nv_close_device+0x15e/0x1b0 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370cd1>] ? nvidia_close_callback+0x71/0x150 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837313e>] ? nvidia_close+0xae/0x310 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc836e40f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064a9cc>] ? __fput+0xec/0x260
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064ac2e>] ? ____fput+0xe/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04c1c0b>] ? task_work_run+0xbb/0xe0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa042cc65>] ? do_notify_resume+0xa5/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b8c23b>] ? int_signal+0x12/0x17
</pre>
Es stehen auch viele Einträge mit
[Fri Dec 10 14:47:36 2021] nvidia 0000:86:00.0: irq 205 for MSI/MSI-X
im dmesg-Log. Dazu sagt Google u.a. https://bbs.archlinux.org/viewtopic.php?id=192447

Latest revision as of 09:52, 6 January 2022

Der Knoten qblg002 wird häufiger (alle paar Monate) in den Status drained gesetzt.

Tickets: 20220105-0193

Status des Knotens

$ scontrol show node qblg002
...
  Reason=Kill task failed [root@2022-01-05T16:45:45]

Beheben mit

$ scontrol update node=qblg002 state=undrain

Anscheinend der letzte Prozess macht Probleme:

$ dmesg -T
[Wed Jan  5 15:13:26 2022] INFO: task namd3:83962 blocked for more than 120 seconds.
[Wed Jan  5 15:13:26 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jan  5 15:13:26 2022] namd3           D ffff93376973e2a0     0 83962      1 0x00000004
[Wed Jan  5 15:13:26 2022] Call Trace:
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f1c9>] schedule+0x29/0x70
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7cb51>] schedule_timeout+0x221/0x2d0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0435b79>] ? sched_clock+0x9/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04dd425>] ? sched_clock_cpu+0x85/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f57d>] wait_for_completion+0xfd/0x140
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04da0b0>] ? wake_up_state+0x20/0x20
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380a9d>] _raw_q_flush+0x6d/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380ac0>] ? _raw_q_flush+0x90/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380de9>] nv_kthread_q_flush+0x19/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837ee1b>] os_flush_work_queue+0x7b/0x80 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8c37a0e>] rm_disable_adapter+0x6e/0x110 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04cb1d2>] ? up+0x32/0x50
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837098e>] ? nv_shutdown_adapter+0x1e/0x140 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370c0e>] ? nv_close_device+0x15e/0x1b0 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370cd1>] ? nvidia_close_callback+0x71/0x150 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837313e>] ? nvidia_close+0xae/0x310 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc836e40f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064a9cc>] ? __fput+0xec/0x260
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064ac2e>] ? ____fput+0xe/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04c1c0b>] ? task_work_run+0xbb/0xe0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa042cc65>] ? do_notify_resume+0xa5/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b8c23b>] ? int_signal+0x12/0x17

Es stehen auch viele Einträge mit

[Fri Dec 10 14:47:36 2021] nvidia 0000:86:00.0: irq 205 for MSI/MSI-X

im dmesg-Log. Dazu sagt Google u.a. https://bbs.archlinux.org/viewtopic.php?id=192447