[NUMA] Overview
ESXi NUMA Deep Dive 문서를 기반으로 확인한 내용을 정리합니다.
History
NUMA는 shared memory architecture로 개별 CPU는 local memory와 다른 CPU에 할당된 remote memory를 사용
CPU가 자신의 Local memory를 access 하는 것이 성능 측면에서 best
다른 CPU의 remote memory를 access 하는 경우 latency가 증가하고, bandwidth가 줄어들기 때문에 성능 측면에서 penalty 발생
Multiprocessor 환경이 필요해지면서 기존 Bus 기반 System에서 처리할 수 있는 Bandwidth 제약으로 인해 문제 발생
CPU가 추가될 수록, CPU 별로 사용 가능한 Bandwidth가 줄게되고,
CPU가 추가될 수록, Bus Length가 길어지면서 이는 Latency 증가로 이어짐
이러한 형태를 UMA(Uniform Memory Access Architecture)라고 부름.
CPU는 System Bus를 통해서 Memory Controller를 포함하는 northbridge에 연결
northbridge에는 I/O controller도 연결되어 있어 모든 I/O는 northbridge를 통해 CPU에 연결
UMA는 확장성에 제약이 있는 모델이어서, 확장성을 향상시키기 위해 NUMA(Non-Uniform Memory Access Architecture) 도입
NUMA에서 변경된 주요 내용
Non-Uniform Memory Access Organization
NUMA는 중앙 집중화되어 있던 Memory Pool 형태에서 벗어나 Topology 속성을 도입
CPU와 Memory 간의 Signal Path 길이에 따라 Memory 위치를 분류하여, latency와 bandwidth bottleneck을 개선
아래 그림처럼 CPU0 입장에서 자신의 Memory Controller에 연결된 Memory는 Local Memory로 부르고, Interconnect를 통해 CPU1의 Memory Controller에 연결된 Memory는 Remote Memory
CPU0가 Remote Memory에 접근하는 경우, Interconnect도 거쳐가야 하고 CPU1의 Memory Controller에도 연결되어야 해서 Memory Access Time이 증가
Point-to-Point Interconnect topology
Intel Nehalem 에서 QuickPath Architecture 도입
QuickPath Architecture에서 Memory Controller가 CPU로 이동하고, CPU 간 data link로 QuickPath Point-to-Point Interconnect(QPI) 도입
Scalable cache coherence solutions
CPU의 각 Core는 private path로 LLC에 연결되어 있고, 이 path는 수 천개의 wire로 구성
이는 확장성에 제약이 있는 구조로 이를 확장 가능하도록 하기 위해 Sandy Bridge Architecture에서 LLC를 Core로 부터 떼어내고 scalable ring on-die Interconnect라는 구조를 도입
이 ring 구조는 개별 Core가 전체 LLC slice에 접근 가능하도록 함
NUMA 예제
다음 그림에서 CPU는 2개이고, 각 CPU당 10개의 Core가 위치
CPU당 4개의 Memory Channel이 있고, Channel 마다 최대 3개의 DIMM 사용 가능
각 Channel에서 단일 16GB RAM DIMM이 장착되어 있어, CPU 마다 64GB의 Memory가 사용 가능하고 System 전체적으로는 128GB의 Memory 설치
결과적으로 각 NUMA Node는 10개의 Core에 64GB Memory를 포함
NUMA 사용 확인
위에서 설명한 것처럼 CPU는 자신의 Local Memory와 다른 CPU의 Memory Controller에 연결된 Memory에 모두 접근 가능
ESXi Kernel의 CPU와 NUMA Scheduler는 VM의 Memory 할당을 관리하며,
NUMA Scheduler는 workload가 최대한 분산되어, Local Memory 접근을 최대화 하는 것이 목표
NUMA Scheduler는 VM의 CPU/Memory configuration과 Physical Core Count/Memory Configuration 정보를 이용
VM의 vCPU가 단일 CPU의 Core 개수를 초과하는 경우, VMKernel은 vCPU를 CPU 수에 맞게 even 하게 분배
아래 예제는 총 2개의 CPU가 있고, CPU 당 16개의 Core가 위치한 환경에서 Windows VM에 vCPU를 32개로 준 상황
ESXi VMKernel - esxtop
3:55:40pm up 12 days 9:50, 1499 worlds, 12 VMs, 58 vCPUs; MEM overcommit avg: 0.00, 0.00, 0.00 PMEM /MB: 261758 total: 3685 vmk,104487 other, 153585 free VMKMEM/MB: 261372 managed: 3228 minfree, 15181 rsvd, 246191 ursvd, high state NUMA /MB: 130684 (29583), 131072 (123618) PSHARE/MB: 51 shared, 50 common: 1 saving SWAP /MB: 0 curr, 0 rclmtgt: 0.00 r/s, 0.00 w/s ZIP /MB: 0 zipped, 0 saved MEMCTL/MB: 0 curr, 0 target, 45517 max |
Windows Guest OS - coreinfo.exe
C:\Users\Administrator\Downloads\SysinternalsSuite>Coreinfo64.exe Coreinfo v3.6 - Dump information on system CPU and memory topology Copyright (C) 2008-2022 Mark Russinovich Sysinternals - www.sysinternals.com Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz Intel64 Family 6 Model 85 Stepping 7, GenuineIntel Microcode signature: 05003302 HTT * Hyperthreading enabled CET - Supports Control Flow Enforcement Technology Kernel CET - Kernel-mode CET Enabled User CET - User-mode CET Allowed HYPERVISOR * Hypervisor is present VMX - Supports Intel hardware-assisted virtualization SVM - Supports AMD hardware-assisted virtualization X64 * Supports 64-bit mode SMX - Supports Intel trusted execution SKINIT - Supports AMD SKINIT SGX - Supports Intel SGX NX * Supports no-execute page protection SMEP * Supports Supervisor Mode Execution Prevention SMAP * Supports Supervisor Mode Access Prevention PAGE1GB * Supports 1 GB large pages PAE * Supports > 32-bit physical addresses PAT * Supports Page Attribute Table PSE * Supports 4 MB pages PSE36 * Supports > 32-bit address 4 MB pages PGE * Supports global bit in page tables SS * Supports bus snooping for cache operations VME * Supports Virtual-8086 mode RDWRFSGSBASE * Supports direct GS/FS base access FPU * Implements i387 floating point instructions MMX * Supports MMX instruction set MMXEXT - Implements AMD MMX extensions 3DNOW - Supports 3DNow! instructions 3DNOWEXT - Supports 3DNow! extension instructions SSE * Supports Streaming SIMD Extensions SSE2 * Supports Streaming SIMD Extensions 2 SSE3 * Supports Streaming SIMD Extensions 3 SSSE3 * Supports Supplemental SIMD Extensions 3 SSE4a - Supports Streaming SIMDR Extensions 4a SSE4.1 * Supports Streaming SIMD Extensions 4.1 SSE4.2 * Supports Streaming SIMD Extensions 4.2 AES * Supports AES extensions AVX * Supports AVX instruction extensions AVX2 * Supports AVX2 instruction extensions AVX-512-F * Supports AVX-512 Foundation instructions AVX-512-DQ * Supports AVX-512 double and quadword instructions AVX-512-IFAMA - Supports AVX-512 integer Fused multiply-add instructions AVX-512-PF - Supports AVX-512 prefetch instructions AVX-512-ER - Supports AVX-512 exponential and reciprocal instructions AVX-512-CD * Supports AVX-512 conflict detection instructions AVX-512-BW * Supports AVX-512 byte and word instructions AVX-512-VL * Supports AVX-512 vector length instructions FMA * Supports FMA extensions using YMM state MSR * Implements RDMSR/WRMSR instructions MTRR * Supports Memory Type Range Registers XSAVE * Supports XSAVE/XRSTOR instructions OSXSAVE * Supports XSETBV/XGETBV instructions RDRAND * Supports RDRAND instruction RDSEED * Supports RDSEED instruction CMOV * Supports CMOVcc instruction CLFSH * Supports CLFLUSH instruction CX8 * Supports compare and exchange 8-byte instructions CX16 * Supports CMPXCHG16B instruction BMI1 * Supports bit manipulation extensions 1 BMI2 * Supports bit manipulation extensions 2 ADX * Supports ADCX/ADOX instructions DCA - Supports prefetch from memory-mapped device F16C * Supports half-precision instruction FXSR * Supports FXSAVE/FXSTOR instructions FFXSR - Supports optimized FXSAVE/FSRSTOR instruction MONITOR - Supports MONITOR and MWAIT instructions MOVBE * Supports MOVBE instruction ERMSB * Supports Enhanced REP MOVSB/STOSB PCLMULDQ * Supports PCLMULDQ instruction POPCNT * Supports POPCNT instruction LZCNT * Supports LZCNT instruction SEP * Supports fast system call instructions LAHF-SAHF * Supports LAHF/SAHF instructions in 64-bit mode HLE - Supports Hardware Lock Elision instructions RTM - Supports Restricted Transactional Memory instructions DE * Supports I/O breakpoints including CR4.DE DTES64 - Can write history of 64-bit branch addresses DS - Implements memory-resident debug buffer DS-CPL - Supports Debug Store feature with CPL PCID * Supports PCIDs and settable CR4.PCIDE INVPCID * Supports INVPCID instruction PDCM - Supports Performance Capabilities MSR RDTSCP * Supports RDTSCP instruction TSC * Supports RDTSC instruction TSC-DEADLINE * Local APIC supports one-shot deadline timer TSC-INVARIANT * TSC runs at constant rate xTPR - Supports disabling task priority messages EIST - Supports Enhanced Intel Speedstep ACPI - Implements MSR for power management TM - Implements thermal monitor circuitry TM2 - Implements Thermal Monitor 2 control APIC * Implements software-accessible local APIC x2APIC * Supports x2APIC CNXT-ID - L1 data cache mode adaptive or BIOS MCE * Supports Machine Check, INT18 and CR4.MCE MCA * Implements Machine Check Architecture PBE - Supports use of FERR#/PBE# pin PSN - Implements 96-bit processor serial number PREFETCHW * Supports PREFETCHW instruction Maximum implemented CPUID leaves: 00000016 (Basic), 80000008 (Extended). Maximum implemented address width: 48 bits (virtual), 45 bits (physical). Processor signature: 00050657 Logical to Physical Processor Map: *------------------------------- Physical Processor 0 -*------------------------------ Physical Processor 1 --*----------------------------- Physical Processor 2 ---*---------------------------- Physical Processor 3 ----*--------------------------- Physical Processor 4 -----*-------------------------- Physical Processor 5 ------*------------------------- Physical Processor 6 -------*------------------------ Physical Processor 7 --------*----------------------- Physical Processor 8 ---------*---------------------- Physical Processor 9 ----------*--------------------- Physical Processor 10 -----------*-------------------- Physical Processor 11 ------------*------------------- Physical Processor 12 -------------*------------------ Physical Processor 13 --------------*----------------- Physical Processor 14 ---------------*---------------- Physical Processor 15 ----------------*--------------- Physical Processor 16 -----------------*-------------- Physical Processor 17 ------------------*------------- Physical Processor 18 -------------------*------------ Physical Processor 19 --------------------*----------- Physical Processor 20 ---------------------*---------- Physical Processor 21 ----------------------*--------- Physical Processor 22 -----------------------*-------- Physical Processor 23 ------------------------*------- Physical Processor 24 -------------------------*------ Physical Processor 25 --------------------------*----- Physical Processor 26 ---------------------------*---- Physical Processor 27 ----------------------------*--- Physical Processor 28 -----------------------------*-- Physical Processor 29 ------------------------------*- Physical Processor 30 -------------------------------* Physical Processor 31 Logical Processor to Socket Map: **------------------------------ Socket 0 --**---------------------------- Socket 1 ----**-------------------------- Socket 2 ------**------------------------ Socket 3 --------**---------------------- Socket 4 ----------**-------------------- Socket 5 ------------**------------------ Socket 6 --------------**---------------- Socket 7 ----------------**-------------- Socket 8 ------------------**------------ Socket 9 --------------------**---------- Socket 10 ----------------------**-------- Socket 11 ------------------------**------ Socket 12 --------------------------**---- Socket 13 ----------------------------**-- Socket 14 ------------------------------** Socket 15 Logical Processor to NUMA Node Map: ****************---------------- NUMA Node 0 ----------------**************** NUMA Node 1 Approximate Cross-NUMA Node Access Cost (relative to fastest): 00 01 00: 1.0 1.5 01: 1.5 1.9 Logical Processor to Cache Map: *------------------------------- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *------------------------------- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *------------------------------- Unified Cache 0, Level 2, 1 MB, Assoc 16, LineSize 64 **------------------------------ Unified Cache 1, Level 3, 22 MB, Assoc 11, LineSize 64 -*------------------------------ Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*------------------------------ Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*------------------------------ Unified Cache 2, Level 2, 1 MB, Assoc 16, LineSize 64 --*----------------------------- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 --*----------------------------- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 --*----------------------------- Unified Cache 3, Level 2, 1 MB, Assoc 16, LineSize 64 --**---------------------------- Unified Cache 4, Level 3, 22 MB, Assoc 11, LineSize 64 ---*---------------------------- Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ---*---------------------------- Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ---*---------------------------- Unified Cache 5, Level 2, 1 MB, Assoc 16, LineSize 64 ----*--------------------------- Data Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64 ----*--------------------------- Instruction Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64 ----*--------------------------- Unified Cache 6, Level 2, 1 MB, Assoc 16, LineSize 64 ----**-------------------------- Unified Cache 7, Level 3, 22 MB, Assoc 11, LineSize 64 -----*-------------------------- Data Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64 -----*-------------------------- Instruction Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64 -----*-------------------------- Unified Cache 8, Level 2, 1 MB, Assoc 16, LineSize 64 ------*------------------------- Data Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64 ------*------------------------- Instruction Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64 ------*------------------------- Unified Cache 9, Level 2, 1 MB, Assoc 16, LineSize 64 ------**------------------------ Unified Cache 10, Level 3, 22 MB, Assoc 11, LineSize 64 -------*------------------------ Data Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64 -------*------------------------ Instruction Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64 -------*------------------------ Unified Cache 11, Level 2, 1 MB, Assoc 16, LineSize 64 --------*----------------------- Data Cache 8, Level 1, 32 KB, Assoc 8, LineSize 64 --------*----------------------- Instruction Cache 8, Level 1, 32 KB, Assoc 8, LineSize 64 --------*----------------------- Unified Cache 12, Level 2, 1 MB, Assoc 16, LineSize 64 --------**---------------------- Unified Cache 13, Level 3, 22 MB, Assoc 11, LineSize 64 ---------*---------------------- Data Cache 9, Level 1, 32 KB, Assoc 8, LineSize 64 ---------*---------------------- Instruction Cache 9, Level 1, 32 KB, Assoc 8, LineSize 64 ---------*---------------------- Unified Cache 14, Level 2, 1 MB, Assoc 16, LineSize 64 ----------*--------------------- Data Cache 10, Level 1, 32 KB, Assoc 8, LineSize 64 ----------*--------------------- Instruction Cache 10, Level 1, 32 KB, Assoc 8, LineSize 64 ----------*--------------------- Unified Cache 15, Level 2, 1 MB, Assoc 16, LineSize 64 ----------**-------------------- Unified Cache 16, Level 3, 22 MB, Assoc 11, LineSize 64 -----------*-------------------- Data Cache 11, Level 1, 32 KB, Assoc 8, LineSize 64 -----------*-------------------- Instruction Cache 11, Level 1, 32 KB, Assoc 8, LineSize 64 -----------*-------------------- Unified Cache 17, Level 2, 1 MB, Assoc 16, LineSize 64 ------------*------------------- Data Cache 12, Level 1, 32 KB, Assoc 8, LineSize 64 ------------*------------------- Instruction Cache 12, Level 1, 32 KB, Assoc 8, LineSize 64 ------------*------------------- Unified Cache 18, Level 2, 1 MB, Assoc 16, LineSize 64 ------------**------------------ Unified Cache 19, Level 3, 22 MB, Assoc 11, LineSize 64 -------------*------------------ Data Cache 13, Level 1, 32 KB, Assoc 8, LineSize 64 -------------*------------------ Instruction Cache 13, Level 1, 32 KB, Assoc 8, LineSize 64 -------------*------------------ Unified Cache 20, Level 2, 1 MB, Assoc 16, LineSize 64 --------------*----------------- Data Cache 14, Level 1, 32 KB, Assoc 8, LineSize 64 --------------*----------------- Instruction Cache 14, Level 1, 32 KB, Assoc 8, LineSize 64 --------------*----------------- Unified Cache 21, Level 2, 1 MB, Assoc 16, LineSize 64 --------------**---------------- Unified Cache 22, Level 3, 22 MB, Assoc 11, LineSize 64 ---------------*---------------- Data Cache 15, Level 1, 32 KB, Assoc 8, LineSize 64 ---------------*---------------- Instruction Cache 15, Level 1, 32 KB, Assoc 8, LineSize 64 ---------------*---------------- Unified Cache 23, Level 2, 1 MB, Assoc 16, LineSize 64 ----------------*--------------- Data Cache 16, Level 1, 32 KB, Assoc 8, LineSize 64 ----------------*--------------- Instruction Cache 16, Level 1, 32 KB, Assoc 8, LineSize 64 ----------------*--------------- Unified Cache 24, Level 2, 1 MB, Assoc 16, LineSize 64 ----------------**-------------- Unified Cache 25, Level 3, 22 MB, Assoc 11, LineSize 64 -----------------*-------------- Data Cache 17, Level 1, 32 KB, Assoc 8, LineSize 64 -----------------*-------------- Instruction Cache 17, Level 1, 32 KB, Assoc 8, LineSize 64 -----------------*-------------- Unified Cache 26, Level 2, 1 MB, Assoc 16, LineSize 64 ------------------*------------- Data Cache 18, Level 1, 32 KB, Assoc 8, LineSize 64 ------------------*------------- Instruction Cache 18, Level 1, 32 KB, Assoc 8, LineSize 64 ------------------*------------- Unified Cache 27, Level 2, 1 MB, Assoc 16, LineSize 64 ------------------**------------ Unified Cache 28, Level 3, 22 MB, Assoc 11, LineSize 64 -------------------*------------ Data Cache 19, Level 1, 32 KB, Assoc 8, LineSize 64 -------------------*------------ Instruction Cache 19, Level 1, 32 KB, Assoc 8, LineSize 64 -------------------*------------ Unified Cache 29, Level 2, 1 MB, Assoc 16, LineSize 64 --------------------*----------- Data Cache 20, Level 1, 32 KB, Assoc 8, LineSize 64 --------------------*----------- Instruction Cache 20, Level 1, 32 KB, Assoc 8, LineSize 64 --------------------*----------- Unified Cache 30, Level 2, 1 MB, Assoc 16, LineSize 64 --------------------**---------- Unified Cache 31, Level 3, 22 MB, Assoc 11, LineSize 64 ---------------------*---------- Data Cache 21, Level 1, 32 KB, Assoc 8, LineSize 64 ---------------------*---------- Instruction Cache 21, Level 1, 32 KB, Assoc 8, LineSize 64 ---------------------*---------- Unified Cache 32, Level 2, 1 MB, Assoc 16, LineSize 64 ----------------------*--------- Data Cache 22, Level 1, 32 KB, Assoc 8, LineSize 64 ----------------------*--------- Instruction Cache 22, Level 1, 32 KB, Assoc 8, LineSize 64 ----------------------*--------- Unified Cache 33, Level 2, 1 MB, Assoc 16, LineSize 64 ----------------------**-------- Unified Cache 34, Level 3, 22 MB, Assoc 11, LineSize 64 -----------------------*-------- Data Cache 23, Level 1, 32 KB, Assoc 8, LineSize 64 -----------------------*-------- Instruction Cache 23, Level 1, 32 KB, Assoc 8, LineSize 64 -----------------------*-------- Unified Cache 35, Level 2, 1 MB, Assoc 16, LineSize 64 ------------------------*------- Data Cache 24, Level 1, 32 KB, Assoc 8, LineSize 64 ------------------------*------- Instruction Cache 24, Level 1, 32 KB, Assoc 8, LineSize 64 ------------------------*------- Unified Cache 36, Level 2, 1 MB, Assoc 16, LineSize 64 ------------------------**------ Unified Cache 37, Level 3, 22 MB, Assoc 11, LineSize 64 -------------------------*------ Data Cache 25, Level 1, 32 KB, Assoc 8, LineSize 64 -------------------------*------ Instruction Cache 25, Level 1, 32 KB, Assoc 8, LineSize 64 -------------------------*------ Unified Cache 38, Level 2, 1 MB, Assoc 16, LineSize 64 --------------------------*----- Data Cache 26, Level 1, 32 KB, Assoc 8, LineSize 64 --------------------------*----- Instruction Cache 26, Level 1, 32 KB, Assoc 8, LineSize 64 --------------------------*----- Unified Cache 39, Level 2, 1 MB, Assoc 16, LineSize 64 --------------------------**---- Unified Cache 40, Level 3, 22 MB, Assoc 11, LineSize 64 ---------------------------*---- Data Cache 27, Level 1, 32 KB, Assoc 8, LineSize 64 ---------------------------*---- Instruction Cache 27, Level 1, 32 KB, Assoc 8, LineSize 64 ---------------------------*---- Unified Cache 41, Level 2, 1 MB, Assoc 16, LineSize 64 ----------------------------*--- Data Cache 28, Level 1, 32 KB, Assoc 8, LineSize 64 ----------------------------*--- Instruction Cache 28, Level 1, 32 KB, Assoc 8, LineSize 64 ----------------------------*--- Unified Cache 42, Level 2, 1 MB, Assoc 16, LineSize 64 ----------------------------**-- Unified Cache 43, Level 3, 22 MB, Assoc 11, LineSize 64 -----------------------------*-- Data Cache 29, Level 1, 32 KB, Assoc 8, LineSize 64 -----------------------------*-- Instruction Cache 29, Level 1, 32 KB, Assoc 8, LineSize 64 -----------------------------*-- Unified Cache 44, Level 2, 1 MB, Assoc 16, LineSize 64 ------------------------------*- Data Cache 30, Level 1, 32 KB, Assoc 8, LineSize 64 ------------------------------*- Instruction Cache 30, Level 1, 32 KB, Assoc 8, LineSize 64 ------------------------------*- Unified Cache 45, Level 2, 1 MB, Assoc 16, LineSize 64 ------------------------------** Unified Cache 46, Level 3, 22 MB, Assoc 11, LineSize 64 -------------------------------* Data Cache 31, Level 1, 32 KB, Assoc 8, LineSize 64 -------------------------------* Instruction Cache 31, Level 1, 32 KB, Assoc 8, LineSize 64 -------------------------------* Unified Cache 47, Level 2, 1 MB, Assoc 16, LineSize 64 Logical Processor to Group Map: ******************************** Group 0 |