My BSOD Odyssey

Trouble

Noob Whisperer
After four days of frustration, a very sore forehead and much more gray hair but probably overall less from pulling it out in angry disappointment, I've managed to resolve a very agravating intermittent and quite random Blue Screen issue on my computer.

First, my system, or at least the primary components involved;
Motherboard = Gigabyte GA-870-UD3 (Rev 2.1)
CPU = AMD Phenom II x6 Black Edition Thuban 1090T 3.2 Ghz AM3 125W
Memory = G. Skill Ripjaws Series DDR3 1333 (PC3 10666) 16gigs (4 X 4gig Sticks)

Second, the problem;
Code:
[B]Probably caused by : ntkrnlmp.exe[/B] ( nt!KeWaitForMultipleObjects+611 )
Followup: MachineOwner
---------
5: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
[B]SYSTEM_SERVICE_EXCEPTION[/B] (3b)
An exception happened while executing a system service routine.
Arguments:
Arg1: 00000000c0000005, Exception code that caused the bugcheck
Arg2: fffff80003092248, Address of the instruction which caused the bugcheck
Arg3: fffff8800a5737f0, Address of the context record for the exception that caused the bugcheck
Arg4: 0000000000000000, zero
Code:
[B]Probably caused by : memory_corruption[/B] ( nt!MiFindNodeOrParent+0 )
Followup: MachineOwner
---------
4: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
[B]IRQL_NOT_LESS_OR_EQUAL[/B] (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 0000000000000028, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, bitfield :
 bit 0 : value 0 = read operation, 1 = write operation
 bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: fffff80003111628, address which referenced memory
Code:
[B]Probably caused by : memory_corruption[/B] ( nt!MiDeleteSystemPagableVm+2d7 )
Followup: MachineOwner
---------
4: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
[B]SYSTEM_SERVICE_EXCEPTION[/B] (3b)
An exception happened while executing a system service routine.
Arguments:
Arg1: 00000000c000001d, Exception code that caused the bugcheck
Arg2: fffff8000307c467, Address of the instruction which caused the bugcheck
Arg3: fffff8800715b890, Address of the context record for the exception that caused the bugcheck
Arg4: 0000000000000000, zero
Code:
[B]Probably caused by : ntkrnlmp.exe[/B] ( nt! ?? ::FNODOBFM::`string'+330bc )
Followup: MachineOwner
---------
5: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
[B]MEMORY_MANAGEMENT[/B] (1a)
    # Any other values for parameter 1 must be individually examined.
Arguments:
Arg1: 0000000000000403, The subtype of the bugcheck.
Arg2: fffff680000a6b30
Arg3: a42000021283c867
Arg4: f7fff680000a6b30
Code:
[B]Probably caused by : hardware[/B] ( Wdf01000!FxRequestBase::CompleteSubmitted+177 )
Followup: MachineOwner
---------
1: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
[B]IRQL_NOT_LESS_OR_EQUAL[/B] (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 00000000000000f4, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000001, bitfield :
 bit 0 : value 0 = read operation, 1 = write operation
 bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: fffff800030e7784, address which referenced memory
These five BSODs had been produced very recently and rather consistently while doing some very resource intensive work converting some very old VHS tapes to digital format, converting to HD and encoding for DVD. Before that I had only had a couple BSODs and they had been very random (a month or so apart) and without any apparent reason (in one instance the machine was completely idle).

So having this information and having some familiarity with troubleshooting BSODs, I'm thinking "Memory", although from watching cybercore's fine work in our BSOD forum, I also know that these very same or similar Blue Screens can be produced by bad drivers, overclocking, overheating, improperly set memory frequencies and timings, etc., but since I was already aware of these issues I had eliminated them earlier by monitoring heat issues (which have never been a problem) and updating all drivers to their most current versions available and I have never overclocked this PC, even resorting to the latest Beta BIOS available for my board.

So arming myself with Prime95 and MemTest86 I quickly found that when Torture Testing with Prime95 if I used the Blend test, which "tests some of everything" (CPU and lots of RAM) I would start getting errors on several worker sets after about the 6th test, but if I chose to run the Small FFTs or In-place large FFTs, where RAM is not tested as much, the tests would run well without errors.
Also when I ran MemTest86 with all four sticks of RAM in place, it would not get a single pass in excess of 20% without errors being produced. So now I'm pretty sure I have some bad memory.

Well properly testing four 4Gig sticks of RAM individually (multiple passes (10), multiple slots (all), is a bit daunting and very time consuming and as it turns out no help since every stick passed flawlessly without a single error in every slot, still it had to be done for my peace of mind. So now what;

Well I think, it must be a combination of stick and slots based on some type of problem with the crossover integrations of slots and banks. So I set about testing every conceiveable combination of stick and slots and found that they all passed without any errors being produced. So now what;

Well I think, it must have something to do with bringing the second bank into the equation when the first bank is populated correctly. So after testing this hypothesis with a third stick in one of the second bank slots (alternating which bank is fully populated and which single slot of the second bank is used, again all tests past flawlessly. So now what;
Well I think that there just must have been a single stick that had been poorly seated during the original four stick test. So installed all four stick again, making sure that all four are firmly seated, fired up MemTest86 and BAM, 19% into the first test, errors all over the place. So now what;

Well I think, I have a computer that will run flawlessly with 12 Gigs (any three sticks) of RAM, that's not too bad is it? Do I really need 16 Gigs of RAM? Probably not. But I just can't let an issue like this go unresolved. So what now;

Off to the inter-web, that vast compendium of all information about everything. As you may well guess, mine is not an easy problem to get pinpoint accurate results from any given search engine, so after much searching of the search results I finally came upon this, which by the way, both resolutions included in the article work and resolved my problem but I chose the dumbing down of the memory, from 1333 to 1066 and tightening the timings to 7, 7, 7, 20 as the long term solution for me since increasing voltages (minor though they are) seemed like a riskier solution over the long haul (years) and dropping the frequency and adjusting the timings doesn't seem to have in anyway produced any noticeable slowing of the computer. I know technically it has, it's just not noticeable and I now have a computer that is stable, and can run Prime95 for hours and hours as well as 10 passes of MemTest86 with all four sticks of RAM in place without error.

Since this was such a giant pain for me to resolve, I thought I would post this information in the hope that it may help someone else with the same or similar issue. It seems that after additional research that the problem (as I suspected to begin with was unique to AMD boards and processors) can happen with various boards, Intel based systems as well, whenever all memory slots are populated with DDR3 1333 (or above) memory.
Hope this helps
http://forum.giga-byte.co.uk/index.php/topic,4606.0.html
Randy.
 
Back
Top