Edge에는 2가지 형태의 Form Factor가 있는데, 그 중 하나가 VM으로 배포하는 것입니다.(다른 하나는 물리 서버에 배포하는 방식)
Edge를 VM으로 배포하니, 당연히 Hypervisor가 제공하는 여러 Feature 중 vMotion 기능도 사용이 가능합니다.
본 케이스에서는 Hypervisor의 Maintenance 작업(ESXi Update)을 위해 Edge VM을 vMotion 하는 도중 Network Service에 문제가 생긴 사례를 알아보겠습니다.
[Environment]
[Symptom]
2024-01-08 2차례의 vMotion 과정 중, 2번째 vMotion 시점에 edge01의 BGP State가 DOWN으로 변경
Source | Destination | Result | |
2024-01-08T05:11:58.295Z | esxi002 | esxi001 | 정상 |
2024-01-08T07:18:34.320Z | esxi001 | esxi002 | 비정상 |
구성 환경을 보면, x.x.x.17(remote) ↔ x.x.x.18(local)이 단일 연결
즉, BGP Connection이 하나이기 때문에 BGP State가 DOWN으로 변경되면 관련 Routing Table이 상단 Switch에서 제거 되어, North-South Traffic이 모두 연결되지 않음
상단 Switch에서 VRF 구성을 통해 각각 BGP Peer를 구성한 것으로 확인
[Troubleshooting Notes]
1. Edge Support Bundle에서 BGP 구성 현황 파악
/edge/tier0_sr_routing_config
ROUTING CONFIGURATION: ====================== { "bgp_config": { "ecmp": true, "enabled": true, "gr_mode": "HELPER_ONLY_MODE", "gr_restart_timer": 180, "gr_stale_timer": 600, "inter_sr_ibgp": false, "local_as": xxxxx, "multipath_relax": true, "neighbor": [ { "address_family": [ { "allow_as_in": false, "enabled": true, "type": "IPv4_UNICAST" } ], "bgp_neighbor_uuid_name": "00005000-0000-1002-0000-000000000004", "enable": true, "enable_bfd": false, "hold_down_timer": 180, "ip_address": { "ipv4": "x.x.x.1" ### <-- !! }, "keep_alive_timer": 60, "max_hop_limit": 1, "name": { "string": "00e7847a-ede1-482e-b27e-91e3264baddc" }, "remote_as": xxxxx, "src_ip_address": { "ipv4": "x.x.x.2" ### <-- !! }, "type": 2 }, { "address_family": [ { "allow_as_in": false, "enabled": true, "type": "IPv4_UNICAST" } ], "bgp_neighbor_uuid_name": "00005000-0000-1004-0000-000000000004", "enable": true, "enable_bfd": false, "hold_down_timer": 180, "ip_address": { "ipv4": "x.x.x.17" ### <-- !! }, "keep_alive_timer": 60, "max_hop_limit": 1, "name": { "string": "e7b7a692-fb19-4ef0-a9fe-1352d15c878b" }, "remote_as": xxxxx , "src_ip_address": { "ipv4": "x.x.x.18" ### <-- !! }, "type": 2 } |
2. 문제 시점에 실제로 BGP에 문제가 있었는지 확인
grep "state=BGP" ./var/log/syslog*
syslog.28:2024-01-08T07:18:51.489Z edge01 NSX 6452 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP x.x.x.17, peer_uuid: e7b7a692-fb19-4ef0-a9fe-1352d15c878b in SR: 12739a6e-f6fe-45c0-a890-46145db40cab, state=BGP_DOWN |
3. 문제 시점에 보다 자세한 Routing Protocol 관련 로그 확인
grep "2024/01/08 07:1.*x.x.x.17" ./var/log/frr/frr*
BGP 로그 확인 시, 문제 시점에 다음과 같이 BGP Hold Timer 만료로 인해 BGP Connection에 문제가 생긴 것으로 확인
Hold Timer가 만료되기 위해서는 상단 Switch ↔ Edge간 주고 받기로 되어 있는 BGP Keep Alive 메시지가 누락되어야 함
frr.log.3:2024/01/08 07:12:13.046363 BGP: x.x.x.17 rcvd UPDATE wlen 5 attrlen 0 alen 0 frr.log.3:2024/01/08 07:14:35.295044 BGP: x.x.x.17 rcvd UPDATE wlen 0 attrlen 47 alen 5 frr.log.3:2024/01/08 07:14:35.303484 BGP: x.x.x.17 rcvd UPDATE wlen 5 attrlen 0 alen 0 frr.log.3:2024/01/08 07:14:36.579628 BGP: x.x.x.17 rcvd UPDATE wlen 5 attrlen 0 alen 0 ### <-- !! 문제 발생 시점 frr.log.3:2024/01/08 07:18:51.235534 BGP: x.x.x.17 [FSM] Timer (holdtime timer expire) frr.log.3:2024/01/08 07:18:51.235594 BGP: x.x.x.17 [FSM] Hold_Timer_expired (Established->Clearing), fd 30 frr.log.3:2024/01/08 07:18:51.235597 BGP: x.x.x.17 [FSM] Hold timer expire frr.log.3:2024/01/08 07:18:51.235626 BGP: %NOTIFICATION: sent to neighbor x.x.x.17 4/0 (Hold Timer Expired) 0 bytes frr.log.3:2024/01/08 07:18:51.235655 BGP: %ADJCHANGE: neighbor x.x.x.17(Unknown) in vrf default Down BGP Notification send frr.log.3:2024/01/08 07:18:51.235661 BGP: x.x.x.17: peer keepalive being removed, acquiring lock frr.log.3:2024/01/08 07:18:51.235664 BGP: x.x.x.17: peer keepalive removed frr.log.3:2024/01/08 07:18:51.235742 BGP: x.x.x.17(0x150ff92c81d0): close file descriptor frr.log.3:2024/01/08 07:18:51.235868 BGP: x.x.x.17 (0x150ff92c81d0 -1) went from Established to Clearing frr.log.3:2024/01/08 07:18:51.235879 BGP: Peer x.x.x.17 fd -1 send BGP_DOWN message to BGP adapter frr.log.3:2024/01/08 07:18:51.235903 BGP: BGP Adapter: Send BGP_DOWN for peer x.x.x.17 (vrf: default) |
4. BGP 측면에서는 3번 단계의 로그의 원인을 현 시점에서는 확인할 수 없어 추후 문제 재현 시, 아래 로그 수집 필요
1. Edge에서 BGP 상태 확인 Edge Node에 admin session으로 접근한 후, Tier-0 SR VRF로 이동하여, 다음 명령어 결과 확인 > get bgp neighbor summary 2. 상단 Switch와 Edge SR Interface에 할당된 BGP IP Address를 이용하여 서로 Ping 테스트 3. Edge SR Interface에서 Packet 수집 > get logical-routers > get logical-router <Tier-0 SR UUID> interfaces > start capture interface <uplink interface UUID> ### Live로 보는 방법 또는 > start capture interface <uplink interface UUID> file <filename.pcap> ### 파일로 저장하는 방법 4. VM Edge가 위치한 ESXi Host(Source/Destination 모두)의 Uplink Interface Packet 수집 # esxicli network nic list # pktcap-uw --uplink <uplink interface> --capture UplinkSndKernel,UplinkRcvKernel -o - | tcpdump-uw -r - -nne ### Live로 보는 방법 또는 # pktcap-uw --uplink <uplink interface> --capture UplinkSndKernel,UplinkRcvKernel -o /tmp/vmnic#.pcap ### 파일로 저장하는 방법 5. 상단 Switch의 Interface Packet 수집 양 단 간에 주고 받는 BGP Packet을 확인하고자 함이니 수집 부탁드립니다. 6. BGP Debug Log 확인 Edge Node에 admin session으로 접근한 후, Tier-0 SR VRF로 이동한 후 > set debug > set routing debug bgp all > get routing debug bgp 문제 재현 완료 후에는 아래 명령어로 debug mode 해제 > clear routing debug bgp all 7. 문제 재현 기간 동안 VM Edge의 BGP 로그 확인 /var/log/frr/frr.log |
5. NSX 측면에서는 추가로 확인할 내용이 없어, vMotion 과정의 Hypervisor 로그 추가 확인
6. Edge VM의 MAC 주소 확인
$ grep ethernet edge01.vmx ethernet0.generatedAddress = "xx:xx:xx:xx:xx:5a" ### <-- !! ethernet1.generatedAddress = "xx:xx:xx:xx:xx:07" ### <-- !! ethernet2.generatedAddress = "xx:xx:xx:xx:xx:5e" ### <-- !! ethernet3.generatedAddress = "xx:xx:xx:xx:xx:a9" ### <-- !! |
7. Hypervisor에서 vMotion 과정 확인
7-1. vMotion 이후 문제가 없었던 1차 vMotion 시점
hostd.log 2024-01-08T05:11:58.373Z info hostd[2101840] [Originator@6876 sub=Vcsvc.VMotion opID=lr488fve-41800-auto-w95-h5:70004509-68-01-ee-0fd5 user=vpxuser:VSPHERE.LOCAL\Administrator] InitiateDestination [54608463442007720], VM = '/vmfs/volumes/609be506-425ea730-9660-40a6b74b18a4/edge01/edge01.vmx' 2024-01-08T05:12:13.170Z info hostd[2101473] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/609be506-425ea730-9660-40a6b74b18a4/edge01/edge01.vmx opID=lr488fve-41800-auto-w95-h5:70004509-68-01-ee-0fd5] VMotion cleanup completed vmware.log 2024-01-08T05:11:58.539Z In(05) vmx - Log for VMware ESX pid=2102918 version=7.0.3 build=build-22348816 option=Release 2024-01-08T05:11:58.587Z In(05) vmx - MigrateSetState: Transitioning from state 0 to 8. 2024-01-08T05:11:58.669Z In(05) vmx - MigrateSetState: Transitioning from state 8 to 9. 2024-01-08T05:11:58.717Z In(05) vmx - MigrateSetState: Transitioning from state 9 to 10. 2024-01-08T05:12:12.858Z In(05) vmx - MigrateSetState: Transitioning from state 10 to 11. 2024-01-08T05:12:13.087Z In(05) vmx - MigrateSetState: Transitioning from state 11 to 12. 2024-01-08T05:12:13.104Z In(05) vcpu-0 - MigrateSetState: Transitioning from state 12 to 0. vmkernel.log 2024-01-08T05:12:12.903Z cpu69:2103082)Net: 2184: connected edge01.eth3 eth3 to vDS, portID 0x800002e 2024-01-08T05:12:12.903Z cpu69:2103082)Net: 2184: connected edge01.eth2 eth2 to vDS, portID 0x800002f 2024-01-08T05:12:12.903Z cpu69:2103082)Net: 2184: connected edge01.eth1 eth1 to vDS, portID 0x8000030 2024-01-08T05:12:12.904Z cpu69:2103082)Net: 2184: connected edge01 eth0 to VM Network, portID 0xa000031 2024-01-08T05:12:12.908Z cpu69:2103082)NetPort: 1543: enabled port 0xa000031 with mac xx:xx:xx:xx:xx:5a 2024-01-08T05:12:12.977Z cpu69:2103082)NetPort: 1543: enabled port 0x8000030 with mac xx:xx:xx:xx:xx:07 2024-01-08T05:12:12.978Z cpu69:2103082)NetPort: 1543: enabled port 0x800002f with mac xx:xx:xx:xx:xx:5e 2024-01-08T05:12:12.979Z cpu69:2103082)NetPort: 1543: enabled port 0x800002e with mac xx:xx:xx:xx:xx:a9 |
7-2. vMotion 이후 문제가 없었던 2차 vMotion 시점
2차 vMotion 시점에는 1차와 달리, Edge에 할당된 Ethernet3번의 xx:xx:xx:xx:xx:a9 MAC에 대한 Port Enable 로그가 확인되지 않음
해당 로그가 기록되어야 하는 시점에 "p2m update: cannot reserve" 메시지가 기록
hostd.log 2024-01-08T07:18:34.381Z info hostd[2101823] [Originator@6876 sub=Vcsvc.VMotion opID=lr488fve-60018-auto-1ab7-h5:70007169-e4-01-7b-f5a5 user=vpxuser:VSPHERE.LOCAL\Administrator] InitiateDestination [54608471037763493], VM = '/vmfs/volumes/609be506-425ea730-9660-40a6b74b18a4/edge01/edge01.vmx' 2024-01-08T07:18:45.997Z info hostd[2101708] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/609be506-425ea730-9660-40a6b74b18a4/edge01/edge01.vmx opID=lr488fve-60018-auto-1ab7-h5:70007169-e4-01-7b-f5a5] VMotion cleanup completed vmware.log 2024-01-08T07:18:34.538Z In(05) vmx - Log for VMware ESX pid=2106732 version=7.0.3 build=build-22348816 option=Release 2024-01-08T07:18:34.579Z In(05) vmx - MigrateSetState: Transitioning from state 0 to 8. 2024-01-08T07:18:34.656Z In(05) vmx - MigrateSetState: Transitioning from state 8 to 9. 2024-01-08T07:18:34.704Z In(05) vmx - MigrateSetState: Transitioning from state 9 to 10. 2024-01-08T07:18:45.693Z In(05) vmx - MigrateSetState: Transitioning from state 10 to 11. 2024-01-08T07:18:45.925Z In(05) vmx - MigrateSetState: Transitioning from state 11 to 12. 2024-01-08T07:18:45.939Z In(05) vcpu-0 - MigrateSetState: Transitioning from state 12 to 0. 2024-01-08T07:18:45.745Z In(05) vcpu-0 - VMXNET3 user: Ethernet0 Driver Info: version = 17170432 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0 2024-01-08T07:18:45.784Z In(05) vcpu-0 - VMXNET3 user: Ethernet1 Driver Info: version = 16850944 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0 2024-01-08T07:18:45.785Z In(05) vcpu-0 - VMXNET3 user: Ethernet2 Driver Info: version = 16850944 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0 2024-01-08T07:18:45.785Z In(05) vcpu-0 - VMXNET3 user: failed to activate 'Ethernet3', status: 0xbad0001 vmkernel.log 2024-01-08T07:18:45.742Z cpu88:2106895)Net: 2184: connected edge01.eth3 eth3 to vDS, portID 0x8000046 2024-01-08T07:18:45.742Z cpu88:2106895)Net: 2184: connected edge01.eth2 eth2 to vDS, portID 0x8000047 2024-01-08T07:18:45.742Z cpu88:2106895)Net: 2184: connected edge01.eth1 eth1 to vDS, portID 0x8000048 2024-01-08T07:18:45.742Z cpu88:2106895)Net: 2184: connected edge01 eth0 to VM Network, portID 0xa000049 2024-01-08T07:18:45.745Z cpu88:2106895)NetPort: 1543: enabled port 0xa000049 with mac xx:xx:xx:xx:xx:5a 2024-01-08T07:18:45.784Z cpu88:2106895)NetPort: 1543: enabled port 0x8000048 with mac xx:xx:xx:xx:xx:07 2024-01-08T07:18:45.785Z cpu88:2106895)NetPort: 1543: enabled port 0x8000047 with mac xx:xx:xx:xx:xx:5e 2024-01-08T07:18:45.785Z cpu88:2106895)VmMemCow: 1772: p2m update: cannot reserve - cur 0 0 rsvd 1216 req 65 avail 1279 ### <-- !! 2024-01-08T07:18:45.785Z cpu88:2106895)Vmxnet3: 11097: Failed to map the tx data ring for tq 0 |
8. xx:xx:xx:xx:xx:a9 MAC Address와 관련된 Ethernet3이 어떤 용도로 사용 중인지 확인
/edge/tier0_sr_ip_addr
문제가 생긴 MAC Address에 해당하는 Network Interface에 Binding 된 IP Address가 x.x.x.18/29로 확인
이 IP Address는 최초 BGP DOWN 시 확인했던 BGP Local IP Address로 판명
53: uplink-283@if44: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether xx:xx:xx:xx:xx:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 minmtu 0 maxmtu 65535 vlan protocol 802.1Q id 2 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet x.x.x.18/29 brd x.x.x.23 scope global uplink-283 valid_lft forever preferred_lft forever |
[Conclusion]
1. vMotion 과정 중 Destination Host에서 Edge VM에 할당된 Ethernet 4개를 연결하는 작업을 진행하고 있었고, 이 중 하나의 Ethernet에 대한 Port Enable이 p2m buffer 때문에 실패
2. 문제 증상을 회피하기 위해서는 Default Size인 p2m buffer를 최대 Size로 증설 필요
[References]
Configuring P2M Buffer size for virtual machines. (76387)
https://kb.vmware.com/s/article/76387
Note: NSX Edge VM's can be affected by this issue
'Networking' 카테고리의 다른 글
Dataplaned process cannot start due to lack of malloc_heap (0) | 2024.03.03 |
---|---|
What does "No-neighbor" mean? (1) | 2024.02.24 |
[NSX] Gateway Firewall (0) | 2024.01.28 |
[NSX] Distributed Firewall (1) | 2024.01.26 |
[NSX] BGP Basic Check (0) | 2024.01.20 |