Difference between revisions of "Qblg002 is drained frequently"

From HPC users
Jump to navigationJump to search
(Created page with "Der Knoten qblg002 wird häufiger (alle paar Monate) in den Status drained gesetzt. Tickets: 20220105-0193 Status des Knotens $ scontrol show node qblg002 ... Reason=Ki...")
 
Line 10: Line 10:
Beheben mit
Beheben mit
  $ scontrol update node=qblg002 state=undrain
  $ scontrol update node=qblg002 state=undrain
Anscheinend der letzte Prozess macht Probleme:
<pre>
$ dmesg -T
[Wed Jan  5 15:13:26 2022] INFO: task namd3:83962 blocked for more than 120 seconds.
[Wed Jan  5 15:13:26 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jan  5 15:13:26 2022] namd3          D ffff93376973e2a0    0 83962      1 0x00000004
[Wed Jan  5 15:13:26 2022] Call Trace:
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f1c9>] schedule+0x29/0x70
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7cb51>] schedule_timeout+0x221/0x2d0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0435b79>] ? sched_clock+0x9/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04dd425>] ? sched_clock_cpu+0x85/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f57d>] wait_for_completion+0xfd/0x140
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04da0b0>] ? wake_up_state+0x20/0x20
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380a9d>] _raw_q_flush+0x6d/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380ac0>] ? _raw_q_flush+0x90/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380de9>] nv_kthread_q_flush+0x19/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837ee1b>] os_flush_work_queue+0x7b/0x80 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8c37a0e>] rm_disable_adapter+0x6e/0x110 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04cb1d2>] ? up+0x32/0x50
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837098e>] ? nv_shutdown_adapter+0x1e/0x140 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370c0e>] ? nv_close_device+0x15e/0x1b0 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370cd1>] ? nvidia_close_callback+0x71/0x150 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837313e>] ? nvidia_close+0xae/0x310 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc836e40f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064a9cc>] ? __fput+0xec/0x260
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064ac2e>] ? ____fput+0xe/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04c1c0b>] ? task_work_run+0xbb/0xe0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa042cc65>] ? do_notify_resume+0xa5/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b8c23b>] ? int_signal+0x12/0x17
</pre>

Revision as of 10:34, 6 January 2022

Der Knoten qblg002 wird häufiger (alle paar Monate) in den Status drained gesetzt.

Tickets: 20220105-0193

Status des Knotens

$ scontrol show node qblg002
...
  Reason=Kill task failed [root@2022-01-05T16:45:45]

Beheben mit

$ scontrol update node=qblg002 state=undrain

Anscheinend der letzte Prozess macht Probleme:

$ dmesg -T
[Wed Jan  5 15:13:26 2022] INFO: task namd3:83962 blocked for more than 120 seconds.
[Wed Jan  5 15:13:26 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jan  5 15:13:26 2022] namd3           D ffff93376973e2a0     0 83962      1 0x00000004
[Wed Jan  5 15:13:26 2022] Call Trace:
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f1c9>] schedule+0x29/0x70
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7cb51>] schedule_timeout+0x221/0x2d0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0435b79>] ? sched_clock+0x9/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04dd425>] ? sched_clock_cpu+0x85/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b7f57d>] wait_for_completion+0xfd/0x140
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04da0b0>] ? wake_up_state+0x20/0x20
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380a9d>] _raw_q_flush+0x6d/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380ac0>] ? _raw_q_flush+0x90/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8380de9>] nv_kthread_q_flush+0x19/0x90 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837ee1b>] os_flush_work_queue+0x7b/0x80 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8c37a0e>] rm_disable_adapter+0x6e/0x110 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04cb1d2>] ? up+0x32/0x50
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837098e>] ? nv_shutdown_adapter+0x1e/0x140 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370c0e>] ? nv_close_device+0x15e/0x1b0 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc8370cd1>] ? nvidia_close_callback+0x71/0x150 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc837313e>] ? nvidia_close+0xae/0x310 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffc836e40f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064a9cc>] ? __fput+0xec/0x260
[Wed Jan  5 15:13:26 2022]  [<ffffffffa064ac2e>] ? ____fput+0xe/0x10
[Wed Jan  5 15:13:26 2022]  [<ffffffffa04c1c0b>] ? task_work_run+0xbb/0xe0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa042cc65>] ? do_notify_resume+0xa5/0xc0
[Wed Jan  5 15:13:26 2022]  [<ffffffffa0b8c23b>] ? int_signal+0x12/0x17