How to Defeat VMware's Purple Screen of Death?

Many VMware ESXi administrators have experienced the Purple Screen of Death issue. The most annoying thing about this problem is that you have a distrust of your own infrastructure. The thoughts are constantly spinning in my head that the same problem can repeat itself on another server.

What is PSOD?

PSOD stands for Purple Screen of Diagnostics , often referred to as Purple Screen of Death from the more famous Blue Screen of Death found in Microsoft Windows.

This is a diagnostic screen displayed by VMware ESXi when the kernel encounters a fatal error in which it either cannot safely recover or cannot continue to run.

It shows the state of memory at the time of the failure, as well as additional information that is important in resolving the cause of the failure: ESXi version and build, exception type, register dump, backtrace, server uptime, error messages, and kernel dump information. (file created after the error, containing additional diagnostic information).

This screen is displayed in the server console. To see it, you will either need to be in the datacenter and connect a monitor, or connect remotely using out-of-band server management (iLO, iDRAC, IMM, etc. depending on your vendor).  

Picture 1
Picture 1

Why does PSOD appear?

PSOD -   . , ESXi UNIX, UNIX. ESXi (vmkernel) , , , . : ESXi , , «» , , « » , !

PSOD:

1. , RAM CPU. «MCE» «NMI».

«MCE» — , . , , .

«NMI» — , , . NMI HW, , ESXi 5.0 , PSOD. . MCE, , NMI,   ,  .

2.

·         ESXi SW (. KB2105711)

·         (. KB2136430 )

·         : , , (. KB2034111,  KB2150280)

·         + (. KB2105522 )

·         (. KB2012125,  KB2127997)

3. ;  , (. KB2146526,  KB2148123)

PSOD?

,   , , .     . HA, . , «» , , .

, , , , , VSAN, PSOD vSAN.

?

1. .

, -   . (IMM, iLO, iDRAC, …), , , . .

Figure 2
2

2. VMware.

, VMware, . (RCA).

3. ESXi.

, . , RCA, . , , DRS, , PSOD .

4. coredump

coredump. Coredump, vmkernel-zdump, , , , , . PSOD, 1, , coredump.

:

.  

b.  .dump  

c.  .dump   vCenter — netdump

Coredump ,   PSOD , . ESXi SCP, (, Notepad ++). , , . VMware , vmkernel, :

Figure 3
3

5. .

.  , , - , .  , : 

Exception Type 0 #DE: Divide Error

Exception Type 1 #DB: Debug Exception

Exception Type 2 NMI: Non-Maskable Interrupt

Exception Type 3 #BP: Breakpoint Exception

Exception Type 4 #OF: Overflow (INTO instruction)

Exception Type 5 #BR: Bounds check (BOUND instruction)

Exception Type 6 #UD: Invalid Opcode

Exception Type 7 #NM: Coprocessor not available

Exception Type 8 #DF: Double Fault

Exception Type 10 #TS: Invalid TSS

Exception Type 11 #NP: Segment Not Present

Exception Type 12 #SS: Stack Segment Fault

Exception Type 13 #GP: General Protection Fault

Exception Type 14 #PF: Page Fault

Exception Type 16 #MF: Coprocessor error

Exception Type 17 #AC: Alignment Check

Exception Type 18 #MC: Machine Check Exception

Exception Type 19 #XF: SIMD Floating-Point Exception

Exception Type 20-31: Reserved

Exception Type 32-255: User-defined (clock scheduler)

, .  Intel 64 IA-32, 1:      Intel 64 IA-32, 3A.

VMware. PSOD:

LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed

NMI (1014767)

Panic requested by one or more 3rd party NMI handlers

COS Error: Oops

«» (1006802)

Lost Heartbeat

« » (1009525)

ASSERT bora/vmkernel/main/pframe_int.h:527

ASSERT NOT_IMPLEMENTED (1019956)

NOT_IMPLEMENTED /build/mts/release/bora-84374/bora/vmkernel/main/util.c:83

ASSERT NOT_IMPLEMENTED (1019956)

Spin count exceeded (iplLock) — possible deadlock

« » (1020105)

PCPU 1 locked up. Failed to ack TLB invalidate

TLB, (1020214)

#GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303

13 14 (1020181)

#PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e

Machine Check Exception: Unable to continueHardware (Machine) Error

(MCE) (1005184)

Hardware (Machine) Error

PCPU: 1 hardware errors seen since boot (1 corrected by hardware)

6.

, , , , - , , PSOD.  , , , .

, ,   (,  VMware Log Insight SolarWinds LEM ), , ,     .

:

/var/log/syslog.log

.

VMkernel

/var/log/vmkernel.log

, ESXi.  , PSOD, , .

ESXi

/var/log/hostd.log

, ESXi .

VMkernel

/var/log/vmkwarning.log

, .  , (Heap WorkHeap).

vCenter

/var/log/vpxa.log

, vCenter, , vCenter PSOD.

shell

/var/log/shell.log

, PSOD .




All Articles