[NSX] BGP down with Hold_Timer_expired because NIC fp-eth1 queue ## TX hang detected

Networking

[NSX] BGP down with Hold_Timer_expired because NIC fp-eth1 queue ## TX hang detected

haewon83 2024. 7. 25. 08:55

Bare-metal Edge를 사용하는 환경에서 NSX Version 4.1.2.3으로 업그레이드 한 후, 간헐적으로 BGP Down 상태가 발생하는 경우가 있어 이에 대해 확인된 내용을 공유합니다.

[Troubleshooting Notes]

1. NSX 버전 확인

$ cat ./etc/nsx_issue
version: 4.1.2.3.0.23382424
node-type: nsx-edge
build-type: release
export-type: unrestricted

2. BGP Routing 구성 환경 확인

./edge/tier0_sr_routing_config

    ROUTING CONFIGURATION:
    ======================

    bgp_config:
    -----------
...
    neighbor:
      NEIGHBOR #1
...
        enable                : True
        enable_bfd            : False
        hold_down_timer       : 3
        ip_address            : ###.###.###.1
        keep_alive_timer      : 1
        max_hop_limit         : 1
...
        src_ip_address        : ###.###.###.2
        type                  : 2

      NEIGHBOR #2
...
        enable                : True
        enable_bfd            : False
        hold_down_timer       : 3
        ip_address            : ###.###.###.3
        keep_alive_timer      : 1
        max_hop_limit         : 1
...
        src_ip_address        : ###.###.###.4
        type                  : 2

3. FRR(BGP Routing Protocol) 로그 확인

실제로 문제 시점에 BGP Down 확인

./var/log/frr.log

<DATE_TIME> BGP: ###.###.###.3 [FSM] Timer (holdtime timer expire)
<DATE_TIME> BGP: ###.###.###.3 [FSM] Hold_Timer_expired (Established->Clearing), fd 36
<DATE_TIME> BGP: ###.###.###.3 [FSM] Hold timer expire
<DATE_TIME> BGP: %NOTIFICATION: sent to neighbor ###.###.###.3 4/0 (Hold Timer Expired) 0 bytes
<DATE_TIME> BGP: %ADJCHANGE: neighbor ###.###.###.3(Unknown) in vrf default Down BGP Notification send
...

4. Syslog 확인

문제가 발생하기 직전에 fp-eth1 NIC가 reset 발생

./var/log/syslog

<DATE_TIME> <HOSTNAME> NSX <PID> FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats59" level="WARN"] NIC fp-eth1 queue 10 TX hang detected
<DATE_TIME> <HOSTNAME> NSX <PID> FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" tname="stats59" level="WARN"] NIC fp-eth1 reset successfully
<DATE_TIME> <HOSTNAME> NSX <PID> FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="phys-port" tname="dp-ipc56" level="WARN" eventId="vmwNSXPhysicalNicStatus"] {"event_state":0,"event_external_reason":"Physical port link down","event_src_comp_id":"<UUID>","event_sources":{"interface_name":"fp-eth1"}}
<DATE_TIME> <HOSTNAME> NSX <PID> FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="phys-port" tname="dp-ipc56" level="INFO"] Sent to nsxa the link status DOWN event of fp-eth1
<DATE_TIME> <HOSTNAME> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="dp-state" level="INFO"] Update dp state device fp-eth1 type 1 mac <MAC_ADDRESS> state up 0
<DATE_TIME> <HOSTNAME> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="dp-database" level="INFO"] Update device fp-eth1 state to DOWN

5. BGP IP Address가 Binding 된 Interface 정보 확인

###.###.###.4 IP Address가 Binding 되어 있는 uplink-294 Interface의 MAC Address와 fp-eth1의 MAC Address가 동일

./edge/tier0_sr_ip_addr

    74: uplink-294@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
        link/ether <MAC_ADDRESS> brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 minmtu 0 maxmtu 65535
        vlan protocol 802.1Q id <ID> <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
        inet ###.###.###.4/<SUBNET> brd ###.###.###.### scope global uplink-294
        valid_lft forever preferred_lft forever

./edge/logical-switch-vlan
...
        {
            "ingress": {
                "ifname": "fp-eth1",
                "ifuid": 1
            },
            "vlan": <VLAN_ID>,
            "mac": "<MAC_ADDRESS>",
            "egress": {
                "ifuuid": "<UUID>",
                "ifuid": ###
            }
...

6. 이 증상은 이미 알려진 이슈로 아래 KB에서 문제 증상 및 Workaround를 가이드

문제 증상을 해결하기 위해서는 Bare-metal 환경일 경우, NSX 4.1.1 에서 추가된 Tx Hung Detection 기능을 비활성화 할 필요가 있음

NSX Bare Metal Edge NIC flapping

https://knowledge.broadcom.com/external/article?legacyId=97654

[Conclusion]

1. 해당 이슈는 4.1.2.4로 업그레이드를 하거나,

https://docs.vmware.com/en/VMware-NSX/4.1.2.4/rn/vmware-nsx-4124-release-notes/index.html
Fixed Issue 3373801: Physical NIC in bare metal edge resets under heavy traffic.
The reset of the device results in the interface being inoperable for a short time. Traffic might be lost and BFD sessions will flap.

2. KB97654에 기술된 대로 명령어를 통해 추가된 기능을 Disable 하여 회피