[NSX] BGP down with Hold_Timer_expired because NIC fp-eth1 queue ## TX hang detected
Bare-metal Edge를 사용하는 환경에서 NSX Version 4.1.2.3으로 업그레이드 한 후, 간헐적으로 BGP Down 상태가 발생하는 경우가 있어 이에 대해 확인된 내용을 공유합니다.
[Troubleshooting Notes]
1. NSX 버전 확인
$ cat ./etc/nsx_issue version: 4.1.2.3.0.23382424 node-type: nsx-edge build-type: release export-type: unrestricted |
2. BGP Routing 구성 환경 확인
./edge/tier0_sr_routing_config ROUTING CONFIGURATION: ====================== bgp_config: ----------- ... neighbor: NEIGHBOR #1 ... enable : True enable_bfd : False hold_down_timer : 3 ip_address : ###.###.###.1 keep_alive_timer : 1 max_hop_limit : 1 ... src_ip_address : ###.###.###.2 type : 2 NEIGHBOR #2 ... enable : True enable_bfd : False hold_down_timer : 3 ip_address : ###.###.###.3 keep_alive_timer : 1 max_hop_limit : 1 ... src_ip_address : ###.###.###.4 type : 2 |
3. FRR(BGP Routing Protocol) 로그 확인
실제로 문제 시점에 BGP Down 확인
./var/log/frr.log <DATE_TIME> BGP: ###.###.###.3 [FSM] Timer (holdtime timer expire) <DATE_TIME> BGP: ###.###.###.3 [FSM] Hold_Timer_expired (Established->Clearing), fd 36 <DATE_TIME> BGP: ###.###.###.3 [FSM] Hold timer expire <DATE_TIME> BGP: %NOTIFICATION: sent to neighbor ###.###.###.3 4/0 (Hold Timer Expired) 0 bytes <DATE_TIME> BGP: %ADJCHANGE: neighbor ###.###.###.3(Unknown) in vrf default Down BGP Notification send ... |
4. Syslog 확인
문제가 발생하기 직전에 fp-eth1 NIC가 reset 발생
./var/log/syslog <DATE_TIME> <HOSTNAME> NSX <PID> FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats59" level="WARN"] NIC fp-eth1 queue 10 TX hang detected <DATE_TIME> <HOSTNAME> NSX <PID> FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats59" level="WARN"] NIC fp-eth1 reset successfully <DATE_TIME> <HOSTNAME> NSX <PID> FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="phys-port" tname="dp-ipc56" level="WARN" eventId="vmwNSXPhysicalNicStatus"] {"event_state":0,"event_external_reason":"Physical port link down","event_src_comp_id":"<UUID>","event_sources":{"interface_name":"fp-eth1"}} <DATE_TIME> <HOSTNAME> NSX <PID> FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="phys-port" tname="dp-ipc56" level="INFO"] Sent to nsxa the link status DOWN event of fp-eth1 <DATE_TIME> <HOSTNAME> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="dp-state" level="INFO"] Update dp state device fp-eth1 type 1 mac <MAC_ADDRESS> state up 0 <DATE_TIME> <HOSTNAME> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="dp-database" level="INFO"] Update device fp-eth1 state to DOWN |
5. BGP IP Address가 Binding 된 Interface 정보 확인
###.###.###.4 IP Address가 Binding 되어 있는 uplink-294 Interface의 MAC Address와 fp-eth1의 MAC Address가 동일
./edge/tier0_sr_ip_addr 74: uplink-294@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether <MAC_ADDRESS> brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 minmtu 0 maxmtu 65535 vlan protocol 802.1Q id <ID> <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet ###.###.###.4/<SUBNET> brd ###.###.###.### scope global uplink-294 valid_lft forever preferred_lft forever ./edge/logical-switch-vlan ... { "ingress": { "ifname": "fp-eth1", "ifuid": 1 }, "vlan": <VLAN_ID>, "mac": "<MAC_ADDRESS>", "egress": { "ifuuid": "<UUID>", "ifuid": ### } ... |
6. 이 증상은 이미 알려진 이슈로 아래 KB에서 문제 증상 및 Workaround를 가이드
문제 증상을 해결하기 위해서는 Bare-metal 환경일 경우, NSX 4.1.1 에서 추가된 Tx Hung Detection 기능을 비활성화 할 필요가 있음
NSX Bare Metal Edge NIC flapping
https://knowledge.broadcom.com/external/article?legacyId=97654
[Conclusion]
1. 해당 이슈는 4.1.2.4로 업그레이드를 하거나,
https://docs.vmware.com/en/VMware-NSX/4.1.2.4/rn/vmware-nsx-4124-release-notes/index.html
Fixed Issue 3373801: Physical NIC in bare metal edge resets under heavy traffic.
The reset of the device results in the interface being inoperable for a short time. Traffic might be lost and BFD sessions will flap.
2. KB97654에 기술된 대로 명령어를 통해 추가된 기능을 Disable 하여 회피