[NSX] Failed to bring up one of vNICs after vMotion of VM edge


Edge에는 2가지 형태의 Form Factor가 있는데, 그 중 하나가 VM으로 배포하는 것입니다.(다른 하나는 물리 서버에 배포하는 방식)

Edge를 VM으로 배포하니, 당연히 Hypervisor가 제공하는 여러 Feature 중 vMotion 기능도 사용이 가능합니다.


본 케이스에서는 Hypervisor의 Maintenance 작업(ESXi Update)을 위해 Edge VM을 vMotion 하는 도중 Network Service에 문제가 생긴 사례를 알아보겠습니다.





2024-01-08 2차례의 vMotion 과정 중, 2번째 vMotion 시점에 edge01의 BGP State가 DOWN으로 변경

  Source Destination Result
2024-01-08T05:11:58.295Z esxi002  esxi001 정상
2024-01-08T07:18:34.320Z esxi001 esxi002  비정상


구성 환경을 보면, x.x.x.17(remote) ↔ x.x.x.18(local)이 단일 연결
즉, BGP Connection이 하나이기 때문에 BGP State가 DOWN으로 변경되면 관련 Routing Table이 상단 Switch에서 제거 되어, North-South Traffic이 모두 연결되지 않음
상단 Switch에서 VRF 구성을 통해 각각 BGP Peer를 구성한 것으로 확인


[Troubleshooting Notes]

1. Edge Support Bundle에서 BGP 구성 현황 파악


    "bgp_config": {
        "ecmp": true,
        "enabled": true,
        "gr_mode": "HELPER_ONLY_MODE",
        "gr_restart_timer": 180,
        "gr_stale_timer": 600,
        "inter_sr_ibgp": false,
        "local_as": xxxxx,
        "multipath_relax": true,
        "neighbor": [
                "address_family": [
                        "allow_as_in": false,
                        "enabled": true,
                        "type": "IPv4_UNICAST"
                "bgp_neighbor_uuid_name": "00005000-0000-1002-0000-000000000004",
                "enable": true,
                "enable_bfd": false,
                "hold_down_timer": 180,
                "ip_address": {
                    "ipv4": "x.x.x.1" ### <-- !!
                "keep_alive_timer": 60,
                "max_hop_limit": 1,
                "name": {
                    "string": "00e7847a-ede1-482e-b27e-91e3264baddc"
                "remote_as": xxxxx,
                "src_ip_address": {
                    "ipv4": "x.x.x.2" ### <-- !!
                "type": 2
                "address_family": [
                        "allow_as_in": false,
                        "enabled": true,
                        "type": "IPv4_UNICAST"
                "bgp_neighbor_uuid_name": "00005000-0000-1004-0000-000000000004",
                "enable": true,
                "enable_bfd": false,
                "hold_down_timer": 180,
                "ip_address": {
                    "ipv4": "x.x.x.17" ### <-- !!
                "keep_alive_timer": 60,
                "max_hop_limit": 1,
                "name": {
                    "string": "e7b7a692-fb19-4ef0-a9fe-1352d15c878b"
                "remote_as": xxxxx ,
                "src_ip_address": {
                    "ipv4": "x.x.x.18" ### <-- !!
                "type": 2


2. 문제 시점에 실제로 BGP에 문제가 있었는지 확인

grep "state=BGP" ./var/log/syslog*

syslog.28:2024-01-08T07:18:51.489Z edge01 NSX 6452 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP x.x.x.17, peer_uuid: e7b7a692-fb19-4ef0-a9fe-1352d15c878b in SR: 12739a6e-f6fe-45c0-a890-46145db40cab, state=BGP_DOWN


3. 문제 시점에 보다 자세한 Routing Protocol 관련 로그 확인

grep "2024/01/08 07:1.*x.x.x.17" ./var/log/frr/frr*

BGP 로그 확인 시, 문제 시점에 다음과 같이 BGP Hold Timer 만료로 인해 BGP Connection에 문제가 생긴 것으로 확인
Hold Timer가 만료되기 위해서는 상단 Switch ↔ Edge간 주고 받기로 되어 있는 BGP Keep Alive 메시지가 누락되어야 함

frr.log.3:2024/01/08 07:12:13.046363 BGP: x.x.x.17 rcvd UPDATE wlen 5 attrlen 0 alen 0
frr.log.3:2024/01/08 07:14:35.295044 BGP: x.x.x.17 rcvd UPDATE wlen 0 attrlen 47 alen 5
frr.log.3:2024/01/08 07:14:35.303484 BGP: x.x.x.17 rcvd UPDATE wlen 5 attrlen 0 alen 0
frr.log.3:2024/01/08 07:14:36.579628 BGP: x.x.x.17 rcvd UPDATE wlen 5 attrlen 0 alen 0
### <-- !! 문제 발생 시점
frr.log.3:2024/01/08 07:18:51.235534 BGP: x.x.x.17 [FSM] Timer (holdtime timer expire)
frr.log.3:2024/01/08 07:18:51.235594 BGP: x.x.x.17 [FSM] Hold_Timer_expired (Established->Clearing), fd 30
frr.log.3:2024/01/08 07:18:51.235597 BGP: x.x.x.17 [FSM] Hold timer expire
frr.log.3:2024/01/08 07:18:51.235626 BGP: %NOTIFICATION: sent to neighbor x.x.x.17 4/0 (Hold Timer Expired) 0 bytes
frr.log.3:2024/01/08 07:18:51.235655 BGP: %ADJCHANGE: neighbor x.x.x.17(Unknown) in vrf default Down BGP Notification send
frr.log.3:2024/01/08 07:18:51.235661 BGP: x.x.x.17: peer keepalive being removed, acquiring lock
frr.log.3:2024/01/08 07:18:51.235664 BGP: x.x.x.17: peer keepalive removed
frr.log.3:2024/01/08 07:18:51.235742 BGP: x.x.x.17(0x150ff92c81d0): close file descriptor
frr.log.3:2024/01/08 07:18:51.235868 BGP: x.x.x.17 (0x150ff92c81d0 -1) went from Established to Clearing
frr.log.3:2024/01/08 07:18:51.235879 BGP: Peer x.x.x.17 fd -1 send BGP_DOWN message to BGP adapter
frr.log.3:2024/01/08 07:18:51.235903 BGP: BGP Adapter: Send BGP_DOWN for peer x.x.x.17 (vrf: default)


4. BGP 측면에서는 3번 단계의 로그의 원인을 현 시점에서는 확인할 수 없어 추후 문제 재현 시, 아래 로그 수집 필요

1. Edge에서 BGP 상태 확인
Edge Node에 admin session으로 접근한 후, Tier-0 SR VRF로 이동하여, 다음 명령어 결과 확인
> get bgp neighbor summary
2. 상단 Switch와 Edge SR Interface에 할당된 BGP IP Address를 이용하여 서로 Ping 테스트
3. Edge SR Interface에서 Packet 수집
> get logical-routers
> get logical-router <Tier-0 SR UUID> interfaces
> start capture interface <uplink interface UUID> ### Live로 보는 방법
> start capture interface <uplink interface UUID> file <filename.pcap> ### 파일로 저장하는 방법
4. VM Edge가 위치한 ESXi Host(Source/Destination 모두)의 Uplink Interface Packet 수집
# esxicli network nic list
# pktcap-uw --uplink <uplink interface> --capture UplinkSndKernel,UplinkRcvKernel -o - | tcpdump-uw -r - -nne  ### Live로 보는 방법
# pktcap-uw --uplink <uplink interface> --capture UplinkSndKernel,UplinkRcvKernel -o /tmp/vmnic#.pcap ### 파일로 저장하는 방법
5. 상단 Switch의 Interface Packet 수집
양 단 간에 주고 받는 BGP Packet을 확인하고자 함이니 수집 부탁드립니다.
6. BGP Debug Log 확인
Edge Node에 admin session으로 접근한 후, Tier-0 SR VRF로 이동한 후
> set debug
> set routing debug bgp all
> get routing debug bgp
문제 재현 완료 후에는 아래 명령어로 debug mode 해제
> clear routing debug bgp all
7. 문제 재현 기간 동안 VM Edge의 BGP 로그 확인


5. NSX 측면에서는 추가로 확인할 내용이 없어, vMotion 과정의 Hypervisor 로그 추가 확인


6. Edge VM의 MAC 주소 확인

$ grep ethernet edge01.vmx

ethernet0.generatedAddress = "xx:xx:xx:xx:xx:5a" ### <-- !!
ethernet1.generatedAddress = "xx:xx:xx:xx:xx:07" ### <-- !!
ethernet2.generatedAddress = "xx:xx:xx:xx:xx:5e" ### <-- !!
ethernet3.generatedAddress = "xx:xx:xx:xx:xx:a9" ### <-- !!


7. Hypervisor에서 vMotion 과정 확인

7-1. vMotion 이후 문제가 없었던 1차 vMotion 시점


2024-01-08T05:11:58.373Z info hostd[2101840] [Originator@6876 sub=Vcsvc.VMotion opID=lr488fve-41800-auto-w95-h5:70004509-68-01-ee-0fd5 user=vpxuser:VSPHERE.LOCAL\Administrator] InitiateDestination [54608463442007720], VM = '/vmfs/volumes/609be506-425ea730-9660-40a6b74b18a4/edge01/edge01.vmx'
2024-01-08T05:12:13.170Z info hostd[2101473] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/609be506-425ea730-9660-40a6b74b18a4/edge01/edge01.vmx opID=lr488fve-41800-auto-w95-h5:70004509-68-01-ee-0fd5] VMotion cleanup completed
2024-01-08T05:11:58.539Z In(05) vmx - Log for VMware ESX pid=2102918 version=7.0.3 build=build-22348816 option=Release
2024-01-08T05:11:58.587Z In(05) vmx - MigrateSetState: Transitioning from state 0 to 8.
2024-01-08T05:11:58.669Z In(05) vmx - MigrateSetState: Transitioning from state 8 to 9.
2024-01-08T05:11:58.717Z In(05) vmx - MigrateSetState: Transitioning from state 9 to 10.
2024-01-08T05:12:12.858Z In(05) vmx - MigrateSetState: Transitioning from state 10 to 11.
2024-01-08T05:12:13.087Z In(05) vmx - MigrateSetState: Transitioning from state 11 to 12.
2024-01-08T05:12:13.104Z In(05) vcpu-0 - MigrateSetState: Transitioning from state 12 to 0.
2024-01-08T05:12:12.903Z cpu69:2103082)Net: 2184: connected edge01.eth3 eth3 to vDS, portID 0x800002e
2024-01-08T05:12:12.903Z cpu69:2103082)Net: 2184: connected edge01.eth2 eth2 to vDS, portID 0x800002f
2024-01-08T05:12:12.903Z cpu69:2103082)Net: 2184: connected edge01.eth1 eth1 to vDS, portID 0x8000030
2024-01-08T05:12:12.904Z cpu69:2103082)Net: 2184: connected edge01 eth0 to VM Network, portID 0xa000031
2024-01-08T05:12:12.908Z cpu69:2103082)NetPort: 1543: enabled port 0xa000031 with mac xx:xx:xx:xx:xx:5a
2024-01-08T05:12:12.977Z cpu69:2103082)NetPort: 1543: enabled port 0x8000030 with mac xx:xx:xx:xx:xx:07
2024-01-08T05:12:12.978Z cpu69:2103082)NetPort: 1543: enabled port 0x800002f with mac xx:xx:xx:xx:xx:5e
2024-01-08T05:12:12.979Z cpu69:2103082)NetPort: 1543: enabled port 0x800002e with mac xx:xx:xx:xx:xx:a9


7-2. vMotion 이후 문제가 없었던 2차 vMotion 시점

2차 vMotion 시점에는 1차와 달리, Edge에 할당된 Ethernet3번의 xx:xx:xx:xx:xx:a9 MAC에 대한 Port Enable 로그가 확인되지 않음
해당 로그가 기록되어야 하는 시점에 "p2m update: cannot reserve" 메시지가 기록 

2024-01-08T07:18:34.381Z info hostd[2101823] [Originator@6876 sub=Vcsvc.VMotion opID=lr488fve-60018-auto-1ab7-h5:70007169-e4-01-7b-f5a5 user=vpxuser:VSPHERE.LOCAL\Administrator] InitiateDestination [54608471037763493], VM = '/vmfs/volumes/609be506-425ea730-9660-40a6b74b18a4/edge01/edge01.vmx'
2024-01-08T07:18:45.997Z info hostd[2101708] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/609be506-425ea730-9660-40a6b74b18a4/edge01/edge01.vmx opID=lr488fve-60018-auto-1ab7-h5:70007169-e4-01-7b-f5a5] VMotion cleanup completed
2024-01-08T07:18:34.538Z In(05) vmx - Log for VMware ESX pid=2106732 version=7.0.3 build=build-22348816 option=Release
2024-01-08T07:18:34.579Z In(05) vmx - MigrateSetState: Transitioning from state 0 to 8.
2024-01-08T07:18:34.656Z In(05) vmx - MigrateSetState: Transitioning from state 8 to 9.
2024-01-08T07:18:34.704Z In(05) vmx - MigrateSetState: Transitioning from state 9 to 10.
2024-01-08T07:18:45.693Z In(05) vmx - MigrateSetState: Transitioning from state 10 to 11.
2024-01-08T07:18:45.925Z In(05) vmx - MigrateSetState: Transitioning from state 11 to 12.
2024-01-08T07:18:45.939Z In(05) vcpu-0 - MigrateSetState: Transitioning from state 12 to 0.
2024-01-08T07:18:45.745Z In(05) vcpu-0 - VMXNET3 user: Ethernet0 Driver Info: version = 17170432 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2024-01-08T07:18:45.784Z In(05) vcpu-0 - VMXNET3 user: Ethernet1 Driver Info: version = 16850944 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2024-01-08T07:18:45.785Z In(05) vcpu-0 - VMXNET3 user: Ethernet2 Driver Info: version = 16850944 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2024-01-08T07:18:45.785Z In(05) vcpu-0 - VMXNET3 user: failed to activate 'Ethernet3', status: 0xbad0001

2024-01-08T07:18:45.742Z cpu88:2106895)Net: 2184: connected edge01.eth3 eth3 to vDS, portID 0x8000046
2024-01-08T07:18:45.742Z cpu88:2106895)Net: 2184: connected edge01.eth2 eth2 to vDS, portID 0x8000047
2024-01-08T07:18:45.742Z cpu88:2106895)Net: 2184: connected edge01.eth1 eth1 to vDS, portID 0x8000048
2024-01-08T07:18:45.742Z cpu88:2106895)Net: 2184: connected edge01 eth0 to VM Network, portID 0xa000049
2024-01-08T07:18:45.745Z cpu88:2106895)NetPort: 1543: enabled port 0xa000049 with mac xx:xx:xx:xx:xx:5a
2024-01-08T07:18:45.784Z cpu88:2106895)NetPort: 1543: enabled port 0x8000048 with mac xx:xx:xx:xx:xx:07
2024-01-08T07:18:45.785Z cpu88:2106895)NetPort: 1543: enabled port 0x8000047 with mac xx:xx:xx:xx:xx:5e
2024-01-08T07:18:45.785Z cpu88:2106895)VmMemCow: 1772: p2m update: cannot reserve - cur 0 0 rsvd 1216 req 65 avail 1279 ### <-- !!
2024-01-08T07:18:45.785Z cpu88:2106895)Vmxnet3: 11097: Failed to map the tx data ring for tq 0


8. xx:xx:xx:xx:xx:a9 MAC Address와 관련된 Ethernet3이 어떤 용도로 사용 중인지 확인


문제가 생긴 MAC Address에 해당하는 Network Interface에 Binding 된 IP Address가 x.x.x.18/29로 확인
이 IP Address는 최초 BGP DOWN 시 확인했던 BGP Local IP Address로 판명

53: uplink-283@if44: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether xx:xx:xx:xx:xx:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 minmtu 0 maxmtu 65535
    vlan protocol 802.1Q id 2 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet x.x.x.18/29 brd x.x.x.23 scope global uplink-283
   valid_lft forever preferred_lft forever



1. vMotion 과정 중 Destination Host에서 Edge VM에 할당된 Ethernet 4개를 연결하는 작업을 진행하고 있었고, 이 중 하나의 Ethernet에 대한 Port Enable이 p2m buffer 때문에 실패


2. 문제 증상을 회피하기 위해서는 Default Size인 p2m buffer를 최대 Size로 증설 필요



Configuring P2M Buffer size for virtual machines. (76387)
Note: NSX Edge VM's can be affected by this issue


