본문 바로가기

Networking

Packets keep going through load balancer to downed member server

 

 

NSX Load Balancer를 사용할 때, Member Server의 상태가 Down으로 변경되었음에도 Client로부터의 Packet이 지속적으로 들어온다는 이슈 사항이 있어 진행했던 내용을 공유 합니다.

 

[Symptom]

고객사에서는 HTTP Active Monitor를 사용 중에 있었고, 이에 대한 테스트를 위해 특정 시점에 Web Server에서 200 대신 404를 Return 하도록 변경을 진행합니다.

이 때 404를 Return한 특정 Member Servrer의 상태는 정상적으로 Down으로 변경이 되나, Web Server의 Access Log를 보면 계속해서 Client의 HTTP POST Request가 들어오는 것을 확인할 수 있습니다.

 

[Environment]

Client IP : xxx.xxx.xxx.94
VIP : xxx.xxx.xxx.59
Back-end Server#1 : xxx.xxx.xxx.59
Back-end Server#2 : xxx.xxx.xxx.60

 

[Troubleshooting Notes]

1. NSX 버전 확인

$ cat ./edge/nsx-agent-state |grep version
        "config_version": {
            "dp_config_version": 7200236,
            "dp_config_version_acked": 7200235
        "version": "3.2.1.1.0.20115712" >>>
        "config_version": "1",



2. Load Balancer 정보 확인
Load Balancer는 Inline 구성(Service Interface가 없음), Member Server의 상태를 감지하기 위한 Active Monitor 사용(HTTP 기반)

./edge/lb-status
 
    {
        "lbs": [
            {
                "cpu_usage": "0",
                "display_name": "xxx-LB", >>>
                "enabled": true,

                "pools": [
     
                    {
                        "backup_disabled": "0",
                        "backup_down": "0",
                        "backup_graceful_disabled": "0",
                        "backup_unknown": "0",
                        "backup_unused": "0",
                        "backup_up": "0",
                        "display_name": "xxx.59:58080", >>>
                        "member_num": "2",
                        "members": [
                            {
                                "display_name": "xxx02", >>>
                                "ip": "xxx.xxx.xxx.60", >>>
                                "last_check_time": "1709168965700",
                                "last_state_change_time": "1709107134410",
                                "monitors": [
                                    {
                                        "display_name": "http-58080-lb-monitor-1",
                                        "id": "345c6221-3746-4a19-8c80-cea00319fea9",
                                        "last_check_time": "1709168965700",
                                        "last_state_change_time": "1709107134410",
                                        "status": "up",
                                        "type": "HTTP",
                                        "url": "/healthCheck" >>>
                                    }
                                ],
                                "port": "58080", >>>
                                "status": "up", >>>
                                "type": "primary"
                            },
                            {
                                "display_name": "xxx01", >>>
                                "ip": "xxx.xxx.xxx.59", >>>
                                "last_check_time": "1709168965700",
                                "last_state_change_time": "1709106954383",
                                "monitors": [
                                    {
                                        "display_name": "http-58080-lb-monitor-1",
                                        "id": "345c6221-3746-4a19-8c80-cea00319fea9",
                                        "last_check_time": "1709168965700",
                                        "last_state_change_time": "1709106954383",
                                        "status": "up",
                                        "type": "HTTP",
                                        "url": "/healthCheck" >>>
                                    }
                                ],
                                "port": "58080", >>>
                                "status": "up", >>>
                                "type": "primary"
                            }
                        ],
                        "primary_disabled": "0",
                        "primary_down": "0",
                        "primary_graceful_disabled": "0",
                        "primary_unknown": "0",
                        "primary_unused": "0",
                        "primary_up": "2",
                        "status": "up",
                        "type": "l4",
                        "uuid": "65eb7980-0834-4ead-987a-dc430bd6f4df",
                        "vss": {
                            "e206e5d3-d8d8-4437-be83-21966789eb86": "xxx.59:58080"
                        }
                    },

./edge/lb-monitor
 
        {
            "id": {
                "left": 3772998482630429209,
                "right": 10124319148970999465
            },
            "display_name": "http-58080-lb-monitor-1",
            "type": "HTTP",
            "interval": 5, >>>
            "timeout": 5, >>>
            "rise_count": 3, >>>
            "fall_count": 3, >>>
            "monitor_port": "58080",
            "http_monitor": {
                "request_method": "HTTP_METHOD_GET", >>>
                "request_url": "/healthCheck", >>>
                "request_version": "HTTP_VERSION_1_1",
                "response_code": [
                    "200" >>>
                ]
            }
        },

./edge/logical-routers
 
    {
        "uuid": "ef2d95b2-873b-4cb4-9a9c-723d97f81ae5", >>>
        "mp_router_id": "d0d011aa-c9a9-4f1f-bf00-9b37657fd02a",
        "name": "SR-xxx-T1-GW", >>>

        "ports": [
 
                {
                    "ifuuid": "7492753d-5b05-4931-9acc-2f1692efeb2d",
                    "ifuid": 331,
                    "type": "loopback", >>>
                    "ptype": "loopback",
                    "lrouter": "ef2d95b2-873b-4cb4-9a9c-723d97f81ae5",
                    "ipns": [  
                        "xxx.xxx.xxx.59/32", >>> VIP

        {
            "uuid": "d0d011aa-c9a9-4f1f-bf00-9b37657fd02a",
            "mp_router_id": "d0d011aa-c9a9-4f1f-bf00-9b37657fd02a",
            "name": "DR-xxx-T1-GW", >>>

                {
                    "ifuuid": "3d43b3ed-2a6a-428c-b3e5-246dbfc7580d",
                    "ifuid": 303,
                    "type": "lif",
                    "ptype": "downlink",

                    "overlay_vni": 65551,
                    "ipns": [
                        "xxx.xxx.xxx.1/24" >>>
                ],

 

3. LAB에서 고객과 유사한 테스트 진행
3-1. 정상 상황에서 Active Monitor를 통한 TCP 3-way Handshake Packet 확인

 

3-2. Active Monitor의 Port인 80을 8080으로 변경 → 상태 점검 실패 의도

 

3-3. Member Server의 상태가 Down으로 변경

 

3-4. Member Server가 Down 된 상태가 되더라도 Active Monitor로 부터 8080 Port를 향한 Packet은 계속 유입

[root@localhost ~]# tcpdump -i ens192 port 8080
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
05:55:37.592062 IP 100.64.120.1.sslp > localhost.localdomain.webcache: Flags [S], seq 291446904, win 64240, options [mss 1460,sackOK,TS val 1693911096 ecr 0,nop,wscale 8], length 0
05:55:37.592126 IP localhost.localdomain.webcache > 100.64.120.1.sslp: Flags [R.], seq 0, ack 291446905, win 0, length 0
05:55:42.591438 IP 100.64.120.1.www-ldap-gw > localhost.localdomain.webcache: Flags [S], seq 666982271, win 64240, options [mss 1460,sackOK,TS val 1693916097 ecr 0,nop,wscale 8], length 0
05:55:42.591602 IP localhost.localdomain.webcache > 100.64.120.1.www-ldap-gw: Flags [R.], seq 0, ack 666982272, win 0, length 0
05:55:47.592814 IP 100.64.120.1.quicksuite > localhost.localdomain.webcache: Flags [S], seq 1548529182, win 64240, options [mss 1460,sackOK,TS val 1693921098 ecr 0,nop,wscale 8], length 0
05:55:47.592969 IP localhost.localdomain.webcache > 100.64.120.1.quicksuite: Flags [R.], seq 0, ack 1548529183, win 0, length 0

 

4. 고객사의 실제 이슈는 Active Monitor로부터 들어온 Packet이 아니라, Member Server가 Down 상태로 전환되었는데도 Client로부터의 HTTP POST Method Packet이 계속 들어오는 점
Back-end Server의 Packet 분석
Member Server의 Status는 정확하게 Active Monitor에 정의된 대로 3회 실패 이후 Down으로 변경

9   2024-03-04 05:14:37.080901  100.64.32.1 xxx.xxx.xxx.59 HTTP    120 GET /healthCheck HTTP/1.1
>> 매 5초마다 Healthcheck Packet 발생
358 2024-03-04 05:14:42.081178  100.64.32.1 xxx.xxx.xxx.59 HTTP    120 GET /healthCheck HTTP/1.1
 
4869    2024-03-04 05:16:22.097673  100.64.32.1 xxx.xxx.xxx.59 HTTP    120 GET /healthCheck HTTP/1.1
4871    2024-03-04 05:16:22.098767  xxx.xxx.xxx .59 100.64.32.1 HTTP    429 HTTP/1.1 200   (text/plain)
 
>> 아래 HealthCheck 부터 404 Return
5144    2024-03-04 05:16:27.097617  100.64.32.1 xxx.xxx.xxx.59 HTTP    120 GET /healthCheck HTTP/1.1
5156    2024-03-04 05:16:27.107894  xxx.xxx.xxx .59 100.64.32.1 HTTP    71  HTTP/1.1 404 
 
5409    2024-03-04 05:16:32.099453  100.64.32.1 xxx.xxx.xxx.59 HTTP    120 GET /healthCheck HTTP/1.1
5413    2024-03-04 05:16:32.100708  xxx.xxx.xxx.59 100.64.32.1 HTTP    71  HTTP/1.1 404 
 
5599    2024-03-04 05:16:37.101238  100.64.32.1 xxx.xxx.xxx.59 HTTP    120 GET /healthCheck HTTP/1.1
5603    2024-03-04 05:16:37.109112  xxx.xxx.xxx.59 100.64.32.1 HTTP    71  HTTP/1.1 404 
 
>> Member가 Down으로 인지된 이후에도 아래 만큼 HealthCheck Packet이 전달
 
5831    2024-03-04 05:16:42.102944  100.64.32.1 xxx.xxx.xxx.59 HTTP    120 GET /healthCheck HTTP/1.1
5835    2024-03-04 05:16:42.104640  xxx.xxx.xxx.59 100.64.32.1 HTTP    71  HTTP/1.1 404 
 
6037    2024-03-04 05:16:47.104486  100.64.32.1 xxx.xxx.xxx.59 HTTP    120 GET /healthCheck HTTP/1.1
6041    2024-03-04 05:16:47.113108  xxx.xxx.xxx.59 100.64.32.1 HTTP    71  HTTP/1.1 404 
 
./var/log/syslog.1  

>> Active Monitor에 정의된 것과 같이 정확히 3번의 Fail Count 후에 down 으로 변경
 
2024-03-04T05:16:37.930Z edge27.homeplusnet.co.kr NSX 13787 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="WARN"] [999a00d8-941f-42d7-b8d9-06f9248c9e7f] upstream LB65eb7980-0834-4ead-987a-dc430bd6f4df peer xxx.xxx.xxx.59:58080 change to down

2024-03-04T05:16:37.966Z edge27.homeplusnet.co.kr NSX 13787 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="WARN"] [999a00d8-941f-42d7-b8d9-06f9248c9e7f] HLCK: monitor 345c6221-3746-4a19-8c80-cea00319fea9 server: xxx.xxx.xxx.59:58080 change to down, code: 15)

 

5. Back-end Server의 Web Server Access Log를 보면 다음과 같이, Member Server가 Down으로 변경된 시점인 05:16:37초 이후에도 계속해서 Client IP인 xxx.xxx.xxx.94로부터 POST Call이 들어오고 있음

 

6. 확인 결과, Member Server가 Down 상태가 된 후에도 들어오던 Client로부터의 Packet은 이전에 TCP 3-way Handshake을 하고 그 위에서 HTTP Request만 계속 보내던 Flow

즉, TCP 입장에서는 Source IP/Destination IP, Source Port/Destination Port가 변하지 않은 상황

동일한 기존 TCP Stream(tcp.stream == 154)은 Member Server가 Down 되더라도 계속해서 유입 가능

tcp.stream == 154
3167    2024-03-04 05:15:47.582910    xxx.xxx.xxx.94    xxx.xxx.xxx.59    HTTP    990    POST /api/extern/online/get-point/card HTTP/1.1  (application/x-www-form-urlencoded)
...
9855    2024-03-04 05:17:29.883576    xxx.xxx.xxx.59    xxx.xxx.xxx.94    HTTP/JSON    59    HTTP/1.1 200  , JSON (application/json)

 

7. 확인 결과 LB는 Member Server가 Down 상태로 변해도 datapath 쪽으로 해당 정보를 안내하지 않기 때문에 기존 Connection의 경우에는 계속해서 Member Server로 유입이 가능

만약, Client에서 신규 Connection을 생성하여 LB에 접근했다면 Down된 Member Server로는 Traffic을 전달하지 않는 설계