NSX의 Transport인 ESXi와 Edge Node에는 System Health를 Report하기 위해 NSX-SHA(System Health Agent) 서비스가 위치합니다.
해당 서비스는 각종 State 정보에 대한 변경이 있는 경우 NSX Manager 쪽으로 이를 Report하는 것으로 확인되는데, 간혹 실제로 문제가 없는 상황이지만 Alarm을 전달하는 경우가 있습니다.
Hypervisor Level의 이슈로 인해 Edge VM내의 NSX-SHA 서비스가 영향받고 결국 false-positive alarm까지 발생시키는 사례에 대해서 알아보겠습니다.
[Symptom]
eth0 NIC down 알람표시지만 edge 에서는 up상태
/var/log/syslog 확인시 down 없음. 2일동안 지속
[Troubleshooting Notes]
1. NSX 버전 확인
$ cat ./edge/nsx-agent-state |grep version "config_version": { "dp_config_version": 55, "dp_config_version_acked": 54 "version": "3.2.3.0.1.21787261" "config_version": "7", |
2. NSX Manager에서 Alarm Event 확인
>> 특정 node에서만 확인, 최초 발생 시점이 2024-03-11T09:57:50.733Z
$ grep "edge_nic_link_status" syslog* | grep eth0 | grep b72ab06d-fceb-4dd8-87ba-6e9a9fc71f4b syslog.1:2024-03-12T00:55:19.690Z ### NSX 69617 MONITORING [nsx@6876 alarmId="15a279b7-32c8-440a-939f-82b647f97e9a" alarmState="OPEN" comp="nsx-manager" entId="b72ab06d-fceb-4dd8-87ba-6e9a9fc71f4b" errorCode="MP701099" eventFeatureName="edge_health" eventSev="CRITICAL" eventState="On" eventType="edge_nic_link_status_down" level="FATAL" nodeId="b72ab06d-fceb-4dd8-87ba-6e9a9fc71f4b" subcomp="monitoring"] Edge node NIC eth0 link is down. syslog.1:2024-03-12T01:05:19.756Z ### NSX 69617 MONITORING [nsx@6876 alarmId="d93c358e-2025-4e2d-8b9f-189e8af8e23b" alarmState="OPEN" comp="nsx-manager" entId="b72ab06d-fceb-4dd8-87ba-6e9a9fc71f4b" errorCode="MP701099" eventFeatureName="edge_health" eventSev="CRITICAL" eventState="On" eventType="edge_nic_link_status_down" level="FATAL" nodeId="b72ab06d-fceb-4dd8-87ba-6e9a9fc71f4b" subcomp="monitoring"] Edge node NIC eth0 link is down. syslog.4:2024-03-11T09:57:50.733Z ### NSX 69617 MONITORING [nsx@6876 alarmId="343360a9-b265-4879-9d2d-a4e33a410754" alarmState="OPEN" comp="nsx-manager" entId="b72ab06d-fceb-4dd8-87ba-6e9a9fc71f4b" errorCode="MP701099" eventFeatureName="edge_health" eventSev="CRITICAL" eventState="On" eventType="edge_nic_link_status_down" level="FATAL" nodeId="b72ab06d-fceb-4dd8-87ba-6e9a9fc71f4b" subcomp="monitoring"] Edge node NIC eth0 link is down. |
3. Edge Node에서 Event 확인
>> eventState="On"은 있지만, eventState="Off"는 없음
$ grep "edge_nic_link_status_down" syslog* syslog.1:2024-03-11T09:57:45.704Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="CRITICAL" eventFeatureName="edge_health" eventType="edge_nic_link_status_down" eventSev="critical" eventState="On"] Edge node NIC eth0 link is down. |
4. 문제 시점 주변에 Edge의 /var/log/syslog에 다음과 같은 로그 기록
>> System Health Agent인 nsx-sha에 의한 "sudo cat /sys/class/net/eth0/operstate" 명령어가 Timeout req may hanging 메시지로 인해 실패한 것으로 확인
>> 또한, nsx-sha가 사용하는 NSX-RPC(Remote Procedure Call)도 Broken 로그가 기록되어 있음
$ grep "2024-03-11.*operstate" syslog* syslog.1:2024-03-11T09:48:23.148Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Req timeout, waiting for 36.31458378955722 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13663975.922738194} syslog.1:2024-03-11T09:49:23.534Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Timeout req may hanging, waiting for 96.7007926274091 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13663975.922738194} syslog.1:2024-03-11T09:55:25.735Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Timeout req may hanging, waiting for 458.90203033946455 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13664338.149291515} syslog.1:2024-03-11T09:55:27.639Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Received resp for a timeout req, waiting for 460.8056724201888 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13664398.510184744}, {'seq': 1518, 'type': 0, 'executor': 0, 'timestamp': 13663939.608435009, 'execute_time': 460.79502287879586, 'output': b'up\n', 'error': 'Request timeout when waiting for response'} syslog.1:2024-03-11T09:55:27.640Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING"] Failed to run command: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13664398.510184744} with error Request timeout when waiting for response $ grep "2024-03-11.*Timeout req may hanging" syslog* syslog.1:2024-03-11T09:49:23.534Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Timeout req may hanging, waiting for 96.7007926274091 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13663975.922738194} syslog.1:2024-03-11T09:55:25.735Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Timeout req may hanging, waiting for 458.90203033946455 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13664338.149291515} $ grep "RPC connection is broken" syslog* syslog:2024-03-12T04:23:21.150Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING"] RPC connection is broken syslog:2024-03-12T04:23:21.187Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] RPC connection is broken syslog:2024-03-12T04:23:26.160Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING"] REPEATS: 3 repeats in 5 sec: RPC connection is broken syslog:2024-03-12T04:23:26.195Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] REPEATS: 3 repeats in 5 sec: RPC connection is broken syslog:2024-03-12T05:14:12.319Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING"] RPC connection is broken syslog:2024-03-12T05:14:12.321Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] RPC connection is broken syslog:2024-03-12T05:14:17.322Z ### NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING"] REPEATS: 3 repeats in 5 sec: RPC connection is broken syslog:2024-03-12T05:15:11.416Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] REPEATS: 3 repeats in 59 sec: RPC connection is broken syslog.1:2024-03-11T07:13:34.819Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] RPC connection is broken syslog.1:2024-03-11T07:13:39.826Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] REPEATS: 3 repeats in 5 sec: RPC connection is broken syslog.1:2024-03-11T08:14:34.003Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] RPC connection is broken syslog.1:2024-03-11T08:14:39.009Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] REPEATS: 3 repeats in 5 sec: RPC connection is broken syslog.1:2024-03-11T09:55:32.595Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] RPC connection is broken syslog.1:2024-03-11T09:55:37.601Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] REPEATS: 3 repeats in 5 sec: RPC connection is broken syslog.1:2024-03-12T01:43:35.262Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] RPC connection is broken syslog.1:2024-03-12T01:43:40.273Z ### NSX 3769 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] REPEATS: 3 repeats in 5 sec: RPC connection is broken |
5. 문제 시점 주변에 Edge VM이 위치한 ESXi Host의 Log 확인
>> IO Delay 관련 오류가 기록되어 있음
>> 다른 VM들도 동일한 오류 확인, 즉 특정 VM만의 이슈는 아닌 것으로 파악
>> Datastore 관련 APD Timeout 오류 확인
※ APD 관련 내용은 아래 링크 참고
https://haewon83.tistory.com/126
$ grep "###.*took over" hostd.all 2024-03-11T08:07:01.964Z warning hostd[2101565] [Originator@6876 sub=IoTracker] In thread 2101605, fopen("/vmfs/volumes/6495860a-e4937141-d4a5-6cfe545200b4/###/###.vmx") took over 12 sec. … 2024-03-12T05:09:10.320Z warning hostd[2117588] [Originator@6876 sub=IoTracker] In thread 2101605, fopen("/vmfs/volumes/6495860a-e4937141-d4a5-6cfe545200b4/###/###.vmx") took over 49 sec. |
$ grep "APD_TIMEOUT" vmkernel.log 2024-03-11T07:08:14.317Z cpu29:2097742)LVM: 6029: Received APD EventType: APD_TIMEOUT (5) for device <naa.600a098038314552795d5575396e4e53:1> (gen 2) 2024-03-11T07:08:14.317Z cpu29:2097742)LVM: 5617: Handling APD EventType: APD_TIMEOUT (5) for device <naa.600a098038314552795d5575396e4e53:1> (unlocked, gen 2, cur apd state APD_START, cur dev state 1) 2024-03-11T07:08:14.317Z cpu29:2097742)HBX: 6736: APD EventType: APD_TIMEOUT (5) for vol '###' 2024-03-11T07:08:14.317Z cpu29:2097742)LVM: 6029: Received APD EventType: APD_TIMEOUT (5) for device <naa.600a0980383145522f24557632546962:1> (gen 2) 2024-03-11T07:08:14.317Z cpu29:2097742)LVM: 5617: Handling APD EventType: APD_TIMEOUT (5) for device <naa.600a0980383145522f24557632546962:1> (unlocked, gen 2, cur apd state APD_START, cur dev state 1) 2024-03-11T07:08:14.317Z cpu29:2097742)HBX: 6736: APD EventType: APD_TIMEOUT (5) for vol '###' |
6. 고객사에는 다음 Action Plan을 전달
1) SHA 서비스 재시작 : 재시작 동안 본 서비스에 의한 Metric/Status 정보 업데이트가 중단
# service nsx-sha restart
2) 1번 안으로 Alarm Resolve가 안될 경우 SWN-PS-3Z-PB-T0-BASE06-vmedge01 Edge Node 재부팅
[Conclusion]
1. 즉, 문제 시점에 nsx-sha가 실행했던 "sudo cat /sys/class/net/eth0/operstate" 명령어가 Timeout req may hanging 메시지로 인해 실패한 것이고 이는 VM의 Hang 현상과 관련이 있는 것으로 추정
2. nsx-sha 서비스 재시작 후 Alarm 해소
'Networking' 카테고리의 다른 글
[NSX] Statistics|Received Packets dropped|Cumulative metric has increased (0) | 2024.04.06 |
---|---|
Python and Powershell can be used for NSX Support Bundle (0) | 2024.04.02 |
Packets keep going through load balancer to downed member server (0) | 2024.03.12 |
"incomp" state entries are detected due to ARP resolution failure (0) | 2024.03.03 |
Dataplaned process cannot start due to lack of malloc_heap (0) | 2024.03.03 |