본문 바로가기

Compute

vCenter failover happened due to time sync

NTP Server와의 시간 동기화 이슈로 인해서, vCenter의 HA가 동작하는 케이스를 소개합니다.

## vCenter 7.0

 

[문제 증상]

vCenter가 예기치 않게 Failover 되었으며, Failover 된 후로 Passive Node에서 정상적으로 서비스 기동되었습니다.

 

[분석 내용]

1. 우선, HA가 동작했기 때문에 vcha.log 부터 확인해야 합니다.

먼저, Failover 전 Passive Node였던 서버에서 vcha.log를 확인합니다.

로그를 보면 실제로 Master Node와의 연결에 문제가 생긴 것을 감지하고 이 후에 자기 자신을 Master로 선출하면서 작업이 완료되는 것을 확인할 수 있습니다.

2022-12-30T00:40:48.387Z info vcha[01854] [Originator@6876 sub=Election opID=SWI-1] [ClusterElection::ChangeState] Slave => Startup : Lost master

2022-12-30T00:42:43.505Z info vcha[01886] [Originator@6876 sub=Election] MasterShutdown

2022-12-30T00:43:14.507Z info vcha[01854] [Originator@6876 sub=Election opID=SWI-1] [ClusterElection::ChangeState] Candidate => Master : Master selected


2022-12-30T00:43:14.507Z info vcha[01864] [Originator@6876 sub=Cluster opID=SWI-41a7] Am now master

2022-12-30T00:43:15.508Z info vcha[01891] [Originator@6876 sub=Agent] Attempting to takeover as PRIMARY

 

그 다음으로는 Failover 전 Active Node였던 서버에서 vcha.log를 확인합니다.

로그를 확인해보면, vMon 프로세스가 Failover 작업을 시도한 것으로 확인됩니다.

2022-12-30T00:42:17.062Z info vcha[45553] [Originator@6876 sub=Agent] Triggered vMon initiated failover
2022-12-30T00:42:17.062Z info vcha[01700] [Originator@6876 sub=Agent] Processing event kVmonFailover
2022-12-30T00:42:17.062Z info vcha[01700] [Originator@6876 sub=Agent] vMon initiated failover

2022-12-30T00:43:13.924Z info vcha[01700] [Originator@6876 sub=Agent] Commit planned failover

2022-12-30T00:43:13.926Z info vcha[01700] [Originator@6876 sub=Cluster] Set Takeover Req
2022-12-30T00:43:13.926Z verbose vcha[01700] [Originator@6876 sub=Cluster] Setting Key = /vcha/cluster/TAKEOVER_REQ Value = planned

 

2. vMon이 Failover를 Trigger 했기 때문에 다음으로는 vMon 로그를 확인해야 합니다.

vMon 로그를 보면, Failover가 발생한 원인으로 pschealth 서비스에 문제가 있었기 때문인 것을 알 수 있습니다.

2022-12-30T00:42:17.062350+00:00 warning vmon  <pschealth> Health of service failed. Health data: 2022-12-30T00:42:17.022Z  Running command: ['/usr/lib/vmware-vmafd/bin/vmafd-cli', 'get-heartbeat-status']
2022-12-30T00:42:17.062613+00:00 warning vmon
2022-12-30T00:42:17.062751+00:00 notice vmon  <pschealth> Recover from service api health check failure. Fail count 0
2022-12-30T00:42:17.062790+00:00 notice vmon  SOCKET creating new socket, connecting to /storage/vmware-vmon/vchalistener
2022-12-30T00:42:17.062822+00:00 notice vmon  <pschealth> Initiated VCHA failover for service.
2022-12-30T00:42:17.062849+00:00 notice vmon  <event-pub> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata pschealth,UNHEALTHY,HEALTHY,1
2022-12-30T00:42:17.062881+00:00 notice vmon  Sending HA Node Failover Request
2022-12-30T00:42:17.062992+00:00 notice vmon  SOCKET 700 (9)
2022-12-30T00:42:17.063027+00:00 notice vmon  recv detected client closed connection
2022-12-30T00:42:17.063059+00:00 err vmon  Error Callback fired with error Remote disconnected
2022-12-30T00:42:17.063235+00:00 err vmon  Sending failure reply for HA Node Failover request
2022-12-30T00:42:17.063366+00:00 notice vmon  VCHA failover call done. Success : 0

 

3. pschealth 서비스에 문제가 있던 것으로 확인되었기 때문에, 해당 시점에 pschealth 서비스 관련 내용을 확인해야 합니다. 아래 로그를 보시면, pschealth 서비스의 상태가 unhealthy 하다고 기재되어 있습니다.

해당 메시지 이전을 살펴보면, "Time has been changed" 로그가 있고 Timestamp를 보면, 00:42:04 -> 00:41:10 으로 과거 시간으로 돌아간 것을 알 수 있습니다.

2022-12-30T00:42:04.263381+00:00 LGCVCTR052V su[40944]: Successful su for vpostgres by root
2022-12-30T00:42:04.263477+00:00 LGCVCTR052V su[40944]: + ??? root:vpostgres
2022-12-30T00:42:04.263660+00:00 LGCVCTR052V su[40944]: pam_unix(su:session): session opened for user vpostgres by (uid=0)
2022-12-30T00:42:04.281971+00:00 LGCVCTR052V su[40944]: pam_unix(su:session): session closed for user vpostgres
 
2022-12-30T00:41:10.000274+00:00 LGCVCTR052V systemd[1]: Time has been changed
2022-12-30T00:41:10.000619+00:00 LGCVCTR052V lsassd[45900]: 0x7f8faa975700:Failed to sync system time [error code: 40075]
 
2022-12-30T00:41:12.322511+00:00 LGCVCTR052V su[41040]: Successful su for vpostgres by root
2022-12-30T00:41:12.322610+00:00 LGCVCTR052V su[41040]: + ??? root:vpostgres
2022-12-30T00:41:12.322643+00:00 LGCVCTR052V su[41040]: pam_unix(su:session): session opened for user vpostgres by (uid=0)
2022-12-30T00:41:12.340222+00:00 LGCVCTR052V su[41040]: pam_unix(su:session): session closed for user vpostgres
2022-12-30T00:41:32.384235+00:00 LGCVCTR052V su[41220]: Successful su for vpostgres by root
2022-12-30T00:41:32.384331+00:00 LGCVCTR052V su[41220]: + ??? root:vpostgres
 
2022-12-30T00:41:40.009425+00:00 LGCVCTR052V pschealthd: VMware Directory-Service, 0, 1671082672
2022-12-30T00:41:40.009684+00:00 LGCVCTR052V pschealthd: VMware Certificate-Service, 0, 1671082671
2022-12-30T00:41:40.009821+00:00 LGCVCTR052V pschealthd: Websso, 0, 1671082673
2022-12-30T00:41:40.009950+00:00 LGCVCTR052V pschealthd: Lookup service, 0, 1671082674
2022-12-30T00:41:40.010078+00:00 LGCVCTR052V pschealthd: Security Token Service, 0, 1671082674
2022-12-30T00:41:40.010202+00:00 LGCVCTR052V pschealthd: License Service, 0, 1671082674
2022-12-30T00:41:40.010322+00:00 LGCVCTR052V pschealthd: Detected PSC system is not healthy - Wait for atleast one minute before failing

 

4. vCenter의 경우에는 위와 같이 시간 동기화 이슈로 인해서 과거 시간으로 돌아가게 되면 문제가 발생할 수 있습니다.

이와 관련하여 아래 KB가 Release 되어 있습니다.

vCenter HA (VCHA) may perform repeated failover operations every 5 to 10 minutes (50113811)

https://kb.vmware.com/s/article/50113811

 

5. vCenter의 vmafd 프로세스가 생성하는 SSO Token의 경우 유효한 시간 값을 가져야 하기 때문에 NTP Server와의 시간 동기화가 매우 중요합니다.

 

6. 위와 같은 경우에 NTP Server에서 ntp 서비스 로그를 확인하여, 문제 발생 시점에 특별한 로그가 기록되지는 않았는지 확인이 필요합니다.