SAN Layout : https://www.pearsonitcertification.com/articles/article.aspx?p=1944878&seqNum=7
SCSI Protocol : https://www.cs.uml.edu/~bill/cs520/slides_07_scsisnia.pdf
- SCSI는 Client-Server Protocol
- Client는 Initiator라고 불리며, Server에 보낼 Request 생성
- 단일 Initiator는 여러 Application Client를 통한 Request 생성
- Server는 Target이라고 불리며, Initiator의 Request를 수신 및 실행하고 이에 따른 결과를 Initiator에게 전달
- Target에는 하나의 Task Manager가 있고, 숫자가 매겨진 LU(Logical Unit)을 가지고 있음 --> LUN
- Target에는 Task Manager외에 Device Server가 있어 Initiator로부터 전달받은 Request를 처리하여, 특정 LUN에 전달
- Target 쪽에는 Target에 실행할 명령어들(Task)을 Holding 하고 있는 Queue가 위치
- Task 종류
- Simple : 순서에 상관없이 실행 가능
- Ordered : 반드시 순서에 맞게 실행
- Head of queue : Queue의 Front에 Task를 추가를 Target에 알리는 Task
- Auto Contingent Allegiance(ACA) : 이전에 실행한 Command가 Error Condition으로 진입하는 경우
- SCSI Protocol Service 종류
- Execute와 Confirmation Service
- Data Transfer Service
- Command Phase : Command Descriptor Block(CDB)를 이용하여 Command와 Parameter를 전달
- Data Phase : Command에 따른 Data 전송
- Status Phase : Command 실행 결과에 따른 상태 정보 전송
- SCSI I/O Operations 종류
- Data 전송이 없는 I/O
- Initiator가 SCSI Command를 Target에 전송한 후 Target은 Status만 Return
- SCSI Command 종류
- Test Unit Ready
- Start/Stop Unit
- Rewind
- Data 전송이 있는 I/O
- Data Phase를 이용하여 정보 교환
- Data In/Out Transmits
- Data는 한 번에 전송될 수도 있고 여러 Data Phase를 걸쳐서 전송될 수도 있음
- SCSI Command 종류
- Read/Write
- Inquiry
- Data Phase를 이용하여 정보 교환
- Data 전송이 없는 I/O
- Command Descriptor Block(CDB)
- Initiator는 SCSI Command를 CDB에 담아서 Target에 전달
- 첫 번째 Byte는 Operation Code
- 마지막 Byte는 Control Code
-
- CDB의 크기는 10,12, 16 Bytes 또는 가변 길이도 가능
- 표준 SCSI Command 종류
- SCSI Status
- Initiator가 Target에 전송한 SCSI Command의 성공 여부 확인
- Busy 또는 Not Ready 표현
- Error Condition 표현
- Target task set이 Full 난 경우도 표현
SCSI Sense Code : https://www.t10.org/lists/2sensekey.htm
- SCSI Command가 CHECK CONDITION Status로 완료가 되는 경우에 Sense Data에서 SCSI Sense Key 확인
- iSCSI Packet을 통해 확인한 SCSI Sense Code 예제 : https://wiki.wireshark.org/uploads/__moin_import__/attachments/SampleCaptures/iscsi-tapel.gz
- https://www.t10.org/lists/asc-alph.txt 참고
Frame 374: 98 bytes on wire (784 bits), 98 bytes captured (784 bits) Ethernet II, Src: Netgear_5b:9b:a2 (00:0f:b5:5b:9b:a2), Dst: VMware_f9:ef:be (00:0c:29:f9:ef:be) Internet Protocol Version 4, Src: 192.139.81.227, Dst: 192.168.1.208 Transmission Control Protocol, Src Port: 3260, Dst Port: 36247, Seq: 2165, Ack: 1361, Len: 32 [2 Reassembled TCP Segments (80 bytes): #373(48), #374(32)] iSCSI (SCSI Response) Opcode: SCSI Response (0x21) Response: Command completed at target (0x00) Status: Check Condition (0x02) TotalAHSLength: 0 (0x00) DataSegmentLength: 31 (0x0000001f) InitiatorTaskTag: 0x0000000f StatSN: 16 (0x00000010) ExpCmdSN: 16 (0x00000010) MaxCmdSN: 48 (0x00000030) ExpDataSN: 0x00000000 BidiReadResidualCount: 0 (0x00000000) ResidualCount: 0 (0x00000000) Request in: 372 Time from request: 0.075611000 seconds SenseLength: 29 (0x001d) Flags: 0x80 SCSI: SNS Info [LUN: 0x0001] .111 0000 = SNS Error Type: Current Error (0x70) Valid: 112 0... .... = Filemark: False .0.. .... = EOM: False ..0. .... = ILI: False .... 0101 = Sense Key: Illegal Request (0x5) Sense Info: 0x00000000 Additional Sense Length: 21 Command-Specific Information: 00000000 Additional Sense Code+Qualifier: Invalid Field In Cdb (0x2400) Field Replaceable Unit Code: 0x00 0... .... = SKSV: False .000 0000 0000 0000 0000 0000 = Sense Key Specific: 0x000000 |
- SCSI Sense Code는 3 가지 부분으로 구성
- SCSI Device/Status : https://www.t10.org/lists/2status.htm
- SCSI Sense Keys : https://www.t10.org/lists/2sensekey.htm
- SCSI Additional Sense Data : https://www.t10.org/lists/asc-alph.txt
- 예를 들어, Log에 H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 와 같이 기록되어 있다면, H:0x0 D0x2 P:0x0은 SCSI Device/Status를 나타내고, 0x3은 SCSI Sense Key 그리고 마지막으로 0x11 0x0은 SCSI Additional Sense Data를 표현
- 아래 사이트를 이용하여, 간단하게 Readable 형태로 출력 가능
SCSI Errors
- SCSI Error는 vmkernel.log에서 다음 Component 들과 연관
- nmp_ThrottleLogForDevice
- ScsiDeviceIO
- HppThrottleLogForDevice
SCSI Errors - vmkernel.log
2023-03-07T22:26:20.681Z cpu34:2098047)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0x8a (0x45d92ca46a40, 2138740) to dev "naa.624a9370e793c401a188430c00019385" on path "vmhba2:C0:T2:L245" Failed: 2023-03-07T22:26:20.681Z cpu34:2098047)ScsiDeviceIO: 4277: Cmd(0x45d92ca46a40) 0x8a, CmdSN 0x8000006c from world 2138740 to dev "naa.624a9370e793c401a188430c00019385" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x29 0x3 2022-09-15T00:29:50.351Z cpu76:2098395)WARNING: HPP: HppThrottleLogForDevice:1136: Cmd 0x28 (0x45dd2d622788, 0) to dev "naa.50000398d8235271" on path "vmhba3:C0:T12:L0" Failed: |
- 예제
- ScsciDeviceIO가 Report한 메시지
Reported by ScsiDeviceIO
2023-03-07T22:26:20.681Z cpu34:2098047)ScsiDeviceIO: 4277: Cmd(0x45d92ca46a40) 0x8a, CmdSN 0x8000006c from world 2138740 to dev "naa.624a9370e793c401a188430c00019385" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x29 0x3 Cmd 0x8a ## SCSI Command (16bytes Write) H:0x0 ## Host-Side / HBA D:0x2 ## Device-Side / SP(Storage Processor) P:0x0 ## Plug-in Side / NMP 0xb 0x29 0x3 ## Sense Key + Additional Sense Key + Additional Sense Code(ASC) + Additional Sense Code Qualifier(ASCQ) |
- NMP가 Report한 메시지
Reported by NMP
2023-03-07T22:26:20.681Z cpu34:2098047)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0x8a (0x45d92ca46a40, 2138740) to dev "naa.624a9370e793c401a188430c00019385" on path "vmhba2:C0:T2:L245" Failed: vmhba2 ## SCSI Command가 보내졌던 HBA :C0 ## Channel :T2 ## Target :L245 ## LUN ID |
- Host-side 문제 발생
Reported by Host-sde
Cmd(0x412e449523c0) 0x2a, CmdSN 0x46f43 from world 32797 to dev "naa.600143801259bd790000500002e10000" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1 H:0x5 ## Abort |
- Device-side 문제 발생
Reported by Device-side
Cmd(0x412e40439fc0) 0x2a, CmdSN 0x800000ea from world 38844 to dev"naa.600143801259bd790000500001090000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x2 0x3a 0x1 D:0x28 ## Task Set Full |
- Plugin-side 문제 발생
Reported by Plugin-side
Cmd(0x4124448131c0) 0x2a, CmdSN 0x17 from world 19189 to dev "naa.60000970000292603381533030394636" failed H:0x0 D:0x2 P:0x8 Possible sense data: 0x2 0x4 0x3 P:0x8 ## Backing pool for thin provisioned LUN is out of space |
- Valid sense data
Valid sense data
Cmd(0x4125411e8f00) 0x9e, CmdSN 0x158614 from world 8272 to dev "naa.60000970000292601192533030354546" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 D:0x2 Valid sense data: 0x5 ## Illegal Request 0x25 0x0 ## Logical Unit Not Supported |
-
- Possible sense data
Possible sense data
Cmd(0x412e40439fc0) 0x2a, CmdSN 0x800000ea from world 38844 to dev"naa.600143801259bd790000500001090000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x2 0x3a 0x1 D:0x28 Possible sense data: 0x2 ## Not Ready 0x3a 0x1 ## Medium Not Present - Tray Closed |
SCSI Disk Performance
- Disk Latency
- https://kb.vmware.com/s/article/2007236
performance has deteriorated - vmkernel.log
2023-02-09T11:27:25.558Z cpu11:66711)WARNING: ScsiDeviceIO: 1203: Device naa.600300570213ff3026a0f4ea0a156e16 performance has deteriorated. I/O latency increased from average value of 97 microseconds to 12551 microseconds. |
[참고 자료]
Understanding the storage path failover sequence in VMware ESXi native multipathing (1027963)
https://ikb.vmware.com/s/article/1027963?lang=en_US
SCSI events that can trigger ESX server to fail a LUN over to another path (1003433)
https://ikb.vmware.com/s/article/1003433
분석 시 활용 Tip
- HBA 목록 확인
- # esxcli storage core adapter list
HBA Name Driver Link State UID Capabilities Description -------- ---------- ---------- ------------------------------------ ------------------- ----------- vmhba0 lsi_mr3 link-n/a sas.52cea7f077cd4c00 (0000:18:00.0) Broadcom PERC H730P Mini vmhba1 vmw_ahci link-n/a sata.vmhba1 (0000:00:11.5) Intel Corporation Lewisburg SATA AHCI Controller vmhba2 vmw_ahci link-n/a sata.vmhba2 (0000:00:17.0) Intel Corporation Lewisburg SATA AHCI Controller vmhba3 lpfc link-down fc.200000109bb46c09:100000109bb46c09 Second Level Lun ID (0000:3b:00.0) Emulex Corporation Emulex LightPulse LPe32000 PCIe Fibre Channel Adapter vmhba4 lpfc link-down fc.200000109bb46c0a:100000109bb46c0a Second Level Lun ID (0000:3b:00.1) Emulex Corporation Emulex LightPulse LPe32000 PCIe Fibre Channel Adapter vmhba64 brcmnvmefc link-down fc.200000109bb46c09:100000109bb46c09 (0000:3b:00.0) Emulex Corporation Emulex LightPulse LPe32000 PCIe Fibre Channel Adapter vmhba65 brcmnvmefc link-down fc.200000109bb46c0a:100000109bb46c0a (0000:3b:00.1) Emulex Corporation Emulex LightPulse LPe32000 PCIe Fibre Channel Adapter vmhba66 iscsi_vmk online iqn.1998-01.com.vmware:w2-tse-d12 Second Level Lun ID iSCSI Software Adapter |
- FC Event 확인
- # esxcli storage san fc events get
- # grep "Frame Dropped" vmkernel.log
2014-08-18T20:17:51.421Z cpu25:16409)WARNING: vmklinux: vmklnx_iodm_event:988:vmhba3: Frame Dropped 36 times in 60s, SAN connection check required. 2014-08-18T20:18:02.152Z cpu19:20266)WARNING: vmklinux: vmklnx_iodm_event:988:vmhba3: Frame Dropped 53 times in 60s, SAN connection check required. 2014-08-18T20:18:12.275Z cpu30:16414)WARNING: vmklinux: vmklnx_iodm_event:988:vmhba3: Frame Dropped 66 times in 60s, SAN connection check required. 2014-08-18T20:18:22.297Z cpu27:16411)WARNING: vmklinux: vmklnx_iodm_event:988:vmhba3: Frame Dropped 70 times in 60s, SAN connection check required. |
- SCSI error 확인
- 결과에서 H, D, P를 우선 확인하여 Host-side / Device-side / Plugin-side 문제를 구분
- # sed -n "s@.*naa\.[0-z]\{32\}.*\(vmhba[0134]:C[0-9]:T[0-9]\):L[0-9]\{1,3\}.*\(H:0x[0-z]\{1,2\} D:0x[0-z]\{1,2\} P:0x[0-z]\{1,3\}.*0x[0-z]\{1,2\} 0x[0-z]\{1,2\} 0x[0-z]\{1,2\}\).*@\1 \2@p" vmkernel.log | sort | uniq -c
- # grep ScsiDeviceIO vmkernel.* | grep "Valid sense data" | awk'{print$13,$15,$16,$17,$18,$19,$20,$21,$22,$23}' | sort | uniq -c
- HBA가 확인된 경우, NMP 로그 확인
- # sed -n "s@.*NMP.*\(vmhba[35]:C[0-9]:T[0-9]\).*@\1@p" vmkernel.log | sort | uniq -c
- HBA가 확인된 경우, 해당 HBA를 통해서 discover 된 Target 확인
- # sed -n "s@.*\(vmhba5:C[0-9]:T[0-9]\).*\(Target:\).*\(WWPN:.*\)@\1 \2 \3@p" esxcfg-mpath_-b.txt | sort -u
'Storage' 카테고리의 다른 글
ATS(Atomic Test & Set) (0) | 2023.05.15 |
---|---|
Locked Files (0) | 2023.04.25 |
vSAN UUID, Delete vSAN Object (0) | 2023.04.23 |
Driver/Firmware Check - HBA (0) | 2023.03.18 |
vSAN Health Service - Component Limits (2) | 2023.02.28 |