오늘은 PSOD가 발생할 수 있는 이유 중 PCPU의 Heartbeat Timeout으로 인해 발생하는 사례에 대해서 알아보도록 하겠습니다.
VMKernel의 경우, PCPU가 49초 동안 Heartbeat을 응답하지 않으면 49초 Timeout을 감지한 PCPU에서 49초 동안 응답이 없었던 PCPU로 IPI(Inter Processor Interrupt)를 통해 NMI를 수동으로 발생시킵니다.
vmkernel.log를 살펴보면 이러한 경우로 PSOD가 발생하는 경우, 다음과 같은 로그가 기록됩니다.
2023-03-02T10:33:30.154Z cpu60:2098379)@BlueScreen: NMI IPI: Panic requested by another PCPU. RIPOFF(base):RBP:CS [0xf12662(0x420011c00000):0x80:0xf48] (Src 0x1, CPU60) |
이러한 경우가 발생할 수 있는 상황으로 2가지 경우를 생각해 볼 수 있습니다.
1) %rflags register에서 Interrupt Flag를 제거해서 Interrupt를 disable 한 경우,
2) Interrupt Flag는 설정되어 있지만, preemption이 disable 되어 있는 경우 : VMKernel에서는 중요 작업 수행 시에, preemption을 disable 해서 Bottom Half Handler가 실행되지 않도록 합니다.
따라서, Interrupt가 disable 되어 있거나, Preemption이 disable 되어 있는 경우에 PCPU가 Heartbeat을 못 받는 상황이 발생할 수 있습니다.
NMI를 받아 PSOD가 발생했던 PCPU에서 실행 중인 Thread의 Call Stack은 다음과 같습니다.
#0 Panic_WithBacktrace #1 0x0000420011cfbb55 in NMIHandleBtOrHaltRequest #2 NMIHandleIPISource #3 NMI_Interrupt #4 0x0000420011d53393 in IDTNMIWork #5 0x0000420011d546dd in Int2_NMI #6 0x0000420011d4e068 in gate_entry () #7 0x0000420012b12662 in bnxtnet_poll () #8 0x0000420011ea67bc in NetPoll_PollPkts #9 NetPollWorldCallback #10 0x0000420011fb33e2 in CpuSched_StartWorld #11 0x0000420011cc4b50 in ?? () #12 0x0000000000000000 in ?? () |
이러한 상황이 발생할 수 있는 경우를 위에서 2가지 설명드렸습니다.
본 케이스의 경우에는 NetPollWorldCallback 이라는 함수가 실행될 때 실제로 CPU Schedule의 Preemption을 Disable 하는 작업이 수행됩니다.
따라서, 위 Thread가 실행 중이었던 작업이 빠르게 처리가 완료되어야 했습니다.
하지만, 이 작업이 계속해서 지연되면서 PCPU의 Heartbeat 응답을 주기 위한 Timer Handler가 계속해서 동작하지 못했던 것입니다.
문제 발생 시점 이전의 vmkernel.log를 살펴보면 다음과 같이 physical nic관련 메시지가 반복적으로 기록되어 있었습니다.
2023-03-02T09:45:53.693Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (6500ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:45:54.193Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (7000ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:45:54.696Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (7500ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:45:55.198Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (8000ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:45:55.700Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (8500ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:12.584Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (5000ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:13.084Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (5500ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:25.231Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (7500ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:25.733Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (8000ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:26.235Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (8500ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:26.737Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (9000ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:27.239Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (9500ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:27.741Z cpu60:2098452)bnxtnet: cmd_cmpl_wait:1187: [vmnic9 : 0x4520cad5e000] Continue waiting (10000ms) for queue flush response in (fp 0) with poll = 0
2023-03-02T09:46:27.741Z cpu60:2098452)WARNING: bnxtnet: cmd_cmpl_wait:1193: [vmnic9 : 0x4520cad5e000] Timeout (10000ms) waiting for queue flush response in (fp 0) with poll = 0
2023-03-02T09:47:04.868Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6620: vmnic8 : scheduler(0x4310abb6bb90)/device(0x4307194228c0) 1/1 lock up [stopped=0]:
2023-03-02T09:47:04.868Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6627: detected at 27707001 while last xmit at 27701001 and 68296 bytes in flight [window 86460 bytes]
2023-03-02T09:47:04.868Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6632: and last enqueued/dequeued at 27701001/27701001 [stress 0]
2023-03-02T09:47:04.868Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6635: with 511 pkts inflight
2023-03-02T09:47:04.868Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6651: vmnic8: packets completion seems stuck, issuing reset
2023-03-02T09:47:49.870Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6620: vmnic8 : scheduler(0x4310abb6bb90)/device(0x4307194228c0) 1/0 lock up [stopped=0]:
2023-03-02T09:47:49.870Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6627: detected at 27752001 while last xmit at 27756003 and 68296 bytes in flight [window 86460 bytes]
2023-03-02T09:47:49.870Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6632: and last enqueued/dequeued at 27746002/27746002 [stress 0]
2023-03-02T09:47:49.870Z cpu60:2108535)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6635: with 511 pkts inflight
2023-03-02T09:54:55.963Z cpu60:2116994)NetSched: 725: vmnic9-5-tx: worldID = 2116994 exits
2023-03-02T09:54:55.964Z cpu60:2117027)NetSched: 725: vmnic9-37-tx: worldID = 2117027 exits
2023-03-02T09:54:56.907Z cpu60:2117297)NetSched: 725: vmnic8-55-tx: worldID = 2117297 exits
2023-03-02T09:59:12.894Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6620: vmnic9 : scheduler(0x4310aba2e910)/device(0x430719431080) 0/1 lock up [stopped=0]:
2023-03-02T09:59:12.894Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6627: detected at 28435001 while last xmit at 28429001 and 3640 bytes in flight [window 86460 bytes]
2023-03-02T09:59:12.894Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6632: and last enqueued/dequeued at 0/0 [stress 0]
2023-03-02T09:59:12.894Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6635: with 28 pkts inflight
2023-03-02T09:59:12.894Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6651: vmnic9: packets completion seems stuck, issuing reset
2023-03-02T09:59:55.679Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_rx_pkt:2773: [vmnic8 : 0x4520ca9e0000] rxq buff is NULL, for fp->id[0], rx_ring->prod = 0x0, rx_ring->agg = 0x0, cmpl ring cons = 0x0
2023-03-02T09:59:55.679Z cpu60:2098379)bnxtnet: bnxtnet_poll:3847: [vmnic8 : 0x4520ca9e0000] fp NULL or reset fid:0 rc:195887105
2023-03-02T09:59:55.682Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_rx_pkt:2773: [vmnic8 : 0x4520ca9e0000] rxq buff is NULL, for fp->id[0], rx_ring->prod = 0x0, rx_ring->agg = 0x0, cmpl ring cons = 0x0
2023-03-02T09:59:55.682Z cpu60:2098379)bnxtnet: bnxtnet_poll:3847: [vmnic8 : 0x4520ca9e0000] fp NULL or reset fid:0 rc:195887105
2023-03-02T10:00:15.903Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24158, 24172)
2023-03-02T10:00:36.202Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24218, 24232)
2023-03-02T10:00:56.445Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24278, 24292)
2023-03-02T10:01:16.655Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24338, 24352)
2023-03-02T10:01:36.956Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24398, 24412)
2023-03-02T10:01:57.180Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24458, 24472)
2023-03-02T10:02:17.406Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24518, 24532)
2023-03-02T10:02:37.713Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24578, 24592)
2023-03-02T10:02:57.971Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24638, 24652)
2023-03-02T10:03:18.185Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24698, 24712)
2023-03-02T10:03:58.841Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24854, 24868)
2023-03-02T10:04:19.143Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24914, 24928)
2023-03-02T10:04:39.401Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (24974, 24988)
2023-03-02T10:05:20.067Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (25130, 25144)
2023-03-02T10:05:20.269Z cpu60:2117721)NetSched: 725: vmnic8-15-tx: worldID = 2117721 exits
2023-03-02T10:05:20.270Z cpu60:2117749)NetSched: 725: vmnic8-43-tx: worldID = 2117749 exits
2023-03-02T10:05:20.271Z cpu60:2117753)NetSched: 725: vmnic8-47-tx: worldID = 2117753 exits
2023-03-02T10:05:20.271Z cpu60:2117756)NetSched: 725: vmnic8-50-tx: worldID = 2117756 exits
2023-03-02T10:05:20.271Z cpu60:2117759)NetSched: 725: vmnic8-53-tx: worldID = 2117759 exits
2023-03-02T10:05:20.271Z cpu60:2117762)NetSched: 725: vmnic8-56-tx: worldID = 2117762 exits
2023-03-02T10:05:20.271Z cpu60:2117765)NetSched: 725: vmnic8-59-tx: worldID = 2117765 exits
2023-03-02T10:05:20.271Z cpu60:2117769)NetSched: 725: vmnic8-63-tx: worldID = 2117769 exits
2023-03-02T10:05:20.271Z cpu60:2117774)NetSched: 725: vmnic8-68-tx: worldID = 2117774 exits
2023-03-02T10:05:40.352Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (25166, 25180)
2023-03-02T10:05:43.909Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6620: vmnic9 : scheduler(0x4310aba2e910)/device(0x430719431080) 0/1 lock up [stopped=0]:
2023-03-02T10:05:43.909Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6627: detected at 28826002 while last xmit at 28820296 and 8997 bytes in flight [window 86460 bytes]
2023-03-02T10:05:43.909Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6632: and last enqueued/dequeued at 0/0 [stress 0]
2023-03-02T10:05:43.909Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6635: with 69 pkts inflight
2023-03-02T10:05:43.909Z cpu60:2117142)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:6651: vmnic9: packets completion seems stuck, issuing reset
2023-03-02T10:06:00.667Z cpu60:2098379)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic8 : 0x4520ca9e0000] cmd sequence number not match! (25202, 25216)
2023-03-02T10:06:00.836Z cpu60:2117796)NetSched: 725: vmnic8-3-tx: worldID = 2117796 exits
2023-03-02T10:16:46.805Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27102, 27113)
2023-03-02T10:17:17.426Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27191, 27202)
2023-03-02T10:17:37.726Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27240, 27251)
2023-03-02T10:17:57.958Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27289, 27300)
2023-03-02T10:18:18.244Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27338, 27349)
2023-03-02T10:18:38.462Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27387, 27398)
2023-03-02T10:18:58.708Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27436, 27447)
2023-03-02T10:19:39.405Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27565, 27576)
2023-03-02T10:19:59.657Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27614, 27625)
2023-03-02T10:20:20.117Z cpu60:2098455)WARNING: bnxtnet: bnxtnet_process_cmd_cmpl:2933: [vmnic9 : 0x4520cad5e000] cmd sequence number not match! (27697, 27708)
2023-03-02T10:32:47.969Z cpu60:2098379)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0xf10151(0x420011c00000):0x4520ca9e0000:0xf48] (Src 0x1, CPU60)
2023-03-02T10:32:47.969Z cpu60:2098379)0x45392131bdd0:[0x420012b10150]bnxtnet_poll_work@com.broadcom.roce#1+0x135 stack: 0x88000026ebd70000
2023-03-02T10:32:47.969Z cpu60:2098379)0x45392131bf10:[0x420012b12614]bnxtnet_poll@com.broadcom.roce#1+0x75 stack: 0x8000000000000001
2023-03-02T10:32:47.969Z cpu60:2098379)0x45392131bf70:[0x420011ea67bb]NetPollWorldCallback@vmkernel#nover+0x190 stack: 0x37
2023-03-02T10:32:47.969Z cpu60:2098379)0x45392131bfe0:[0x420011fb33e1]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2023-03-02T10:32:47.969Z cpu60:2098379)0x45392131c000:[0x420011cc4b4f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2023-03-02T10:33:01.969Z cpu60:2098379)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0xf10151(0x420011c00000):0x4520ca9e0000:0xf48] (Src 0x1, CPU60)
2023-03-02T10:33:01.969Z cpu60:2098379)0x45392131bdd0:[0x420012b10150]bnxtnet_poll_work@com.broadcom.roce#1+0x135 stack: 0x88000026ebd70000
2023-03-02T10:33:01.969Z cpu60:2098379)0x45392131bf10:[0x420012b12614]bnxtnet_poll@com.broadcom.roce#1+0x75 stack: 0x8000000000000001
2023-03-02T10:33:01.969Z cpu60:2098379)0x45392131bf70:[0x420011ea67bb]NetPollWorldCallback@vmkernel#nover+0x190 stack: 0x37
2023-03-02T10:33:01.969Z cpu60:2098379)0x45392131bfe0:[0x420011fb33e1]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2023-03-02T10:33:01.969Z cpu60:2098379)0x45392131c000:[0x420011cc4b4f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2023-03-02T10:33:29.969Z cpu60:2098379)World: 3072: PRDA 0x42004f000000 ss 0x0 ds 0xf50 es 0xf50 fs 0xf50 gs 0xf50
2023-03-02T10:33:29.969Z cpu60:2098379)World: 3074: TR 0xf58 GDT 0x453880078000 (0xf77) IDT 0x420011d50000 (0xfff)
2023-03-02T10:33:29.969Z cpu60:2098379)World: 3075: CR0 0x8001003d CR3 0x4f0c7000 CR4 0x10216c
2023-03-02T10:33:30.023Z cpu60:2098379)Panic: 767: Saved backtrace: pcpu 60 Heartbeat NMI
2023-03-02T10:33:30.023Z cpu60:2098379)pcpu 60 Heartbeat NMI: 0x45392131bf10:[0x420012b12661]bnxtnet_poll@com.broadcom.roce#1+0xc2 stack: 0x8000000000000001, 0x0, 0x1, 0x4301b6c611e0, 0x1
2023-03-02T10:33:30.023Z cpu60:2098379)pcpu 60 Heartbeat NMI: 0x45392131bf70:[0x420011ea67bb]NetPollWorldCallback@vmkernel#nover+0x190 stack: 0x37, 0x420012b125a0, 0x420011ea67ae, 0x0, 0x0
2023-03-02T10:33:30.023Z cpu60:2098379)pcpu 60 Heartbeat NMI: 0x45392131bfe0:[0x420011fb33e1]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0, 0x420011cc4b50, 0x0, 0x0, 0x0
2023-03-02T10:33:30.023Z cpu60:2098379)pcpu 60 Heartbeat NMI: 0x45392131c000:[0x420011cc4b4f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0
2023-03-02T10:33:30.056Z cpu60:2098379)VMware ESXi 7.0.3 [Releasebuild-20036589 x86_64]
|
즉, Network Packet 수신 처리를 담당하는 Thread에서 Physical NIC 문제로 인하여 Uplink Driver의 함수가 짧은 시간 내에 완료되지 못하였고 해당 기간 동안 PCPU Heartbeat 응답이 없는 것을 인지한 다른 PCPU에서 PSOD 발생을 위해 문제가 발생하고 있던 PCPU에 NMI를 전달한 것입니다.
'Compute' 카테고리의 다른 글
How vMotion Works (0) | 2023.04.16 |
---|---|
esxtop (0) | 2023.04.05 |
Necessary logs in case of system or vm hang (0) | 2023.03.16 |
Config Store Database and Schema List (0) | 2023.03.02 |
FDM Overview (0) | 2023.02.26 |