Many VMware ESXi administrators have experienced the Purple Screen of Death issue. The most annoying thing about this problem is that you have a distrust of your own infrastructure. The thoughts are constantly spinning in my head that the same problem can repeat itself on another server.
What is PSOD?
PSOD stands for Purple Screen of Diagnostics , often referred to as Purple Screen of Death from the more famous Blue Screen of Death found in Microsoft Windows.
This is a diagnostic screen displayed by VMware ESXi when the kernel encounters a fatal error in which it either cannot safely recover or cannot continue to run.
It shows the state of memory at the time of the failure, as well as additional information that is important in resolving the cause of the failure: ESXi version and build, exception type, register dump, backtrace, server uptime, error messages, and kernel dump information. (file created after the error, containing additional diagnostic information).
This screen is displayed in the server console. To see it, you will either need to be in the datacenter and connect a monitor, or connect remotely using out-of-band server management (iLO, iDRAC, IMM, etc. depending on your vendor).
Why does PSOD appear?
PSOD - . , ESXi UNIX, UNIX. ESXi (vmkernel) , , , . : ESXi , , «» , , « » , !
PSOD:
1. , RAM CPU. «MCE» «NMI».
«MCE» — , . , , .
«NMI» — , , . NMI HW, , ESXi 5.0 , PSOD. . MCE, , NMI, , .
2.
· ESXi SW (. KB2105711)
· (. KB2136430 )
· : , , (. KB2034111, KB2150280)
· + (. KB2105522 )
3. ; , (. KB2146526, KB2148123)
PSOD?
, , , . . HA, . , «» , , .
, , , , , VSAN, PSOD vSAN.
?
1. .
, - . (IMM, iLO, iDRAC, …), , , . .
2. VMware.
, VMware, . (RCA).
3. ESXi.
, . , RCA, . , , DRS, , PSOD .
4. coredump
- coredump. Coredump, vmkernel-zdump, , , , , . PSOD, 1, , coredump.
:
b. .dump
c. .dump vCenter — netdump
Coredump , PSOD , . ESXi SCP, (, Notepad ++). , , . VMware , vmkernel, :
5. .
. , , - , . , :
Exception Type 0 #DE: Divide Error
Exception Type 1 #DB: Debug Exception
Exception Type 2 NMI: Non-Maskable Interrupt
Exception Type 3 #BP: Breakpoint Exception
Exception Type 4 #OF: Overflow (INTO instruction)
Exception Type 5 #BR: Bounds check (BOUND instruction)
Exception Type 6 #UD: Invalid Opcode
Exception Type 7 #NM: Coprocessor not available
Exception Type 8 #DF: Double Fault
Exception Type 10 #TS: Invalid TSS
Exception Type 11 #NP: Segment Not Present
Exception Type 12 #SS: Stack Segment Fault
Exception Type 13 #GP: General Protection Fault
Exception Type 14 #PF: Page Fault
Exception Type 16 #MF: Coprocessor error
Exception Type 17 #AC: Alignment Check
Exception Type 18 #MC: Machine Check Exception
Exception Type 19 #XF: SIMD Floating-Point Exception
Exception Type 20-31: Reserved
Exception Type 32-255: User-defined (clock scheduler)
, . Intel 64 IA-32, 1: Intel 64 IA-32, 3A.
VMware. PSOD:
|
|
LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed | |
Panic requested by one or more 3rd party NMI handlers | |
COS Error: Oops | |
Lost Heartbeat | |
ASSERT bora/vmkernel/main/pframe_int.h:527 | |
NOT_IMPLEMENTED /build/mts/release/bora-84374/bora/vmkernel/main/util.c:83 | |
Spin count exceeded (iplLock) — possible deadlock | |
PCPU 1 locked up. Failed to ack TLB invalidate | |
#GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303 | |
#PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e | |
Machine Check Exception: Unable to continueHardware (Machine) Error | |
Hardware (Machine) Error | |
PCPU: 1 hardware errors seen since boot (1 corrected by hardware) |
6.
, , , , - , , PSOD. , , , .
, , (, VMware Log Insight SolarWinds LEM ), , , .
:
|
| |
| /var/log/syslog.log | . |
VMkernel | /var/log/vmkernel.log | , ESXi. , PSOD, , . |
ESXi | /var/log/hostd.log | , ESXi . |
VMkernel | /var/log/vmkwarning.log | , . , (Heap WorkHeap). |
vCenter | /var/log/vpxa.log | , vCenter, , vCenter PSOD. |
shell | /var/log/shell.log | , PSOD . |