Windows 10 Win 10 BSOD, RMA CPU?

Francis V

New Member
NOTE: Sorry for the long post, trying to be thorough.

Just for reference, these are the specs of the machine. Everything is running stock, it's a work machine:
OS: Win 10 Pro
CPU: i7-6700K
Motherboard: ASUS Z170-A LGA 1151
RAM: 2x G.SKILL Ripjaws V Series 16GB 288-Pin DDR4
Using onboard GPU
Combination of SSDs (EVOs) and HDDs (WD Blue and Gold)

So this computer starting BSODing yesterday/night before in the middle of the night at 11pm (per event log), after having been untouched since I left at 6pm. It BSOD'd frequently in the morning when I arrived (every 30 seconds to 5 minutes) and worsened to the point where it could barely get to the log on screen.

For most of my morning troubleshooting I assumed it was due to a Win 10 update and a driver conflict. The faulting module in the memory dump was intelppm, and the minidumps showed a 124 error (WHEA_UNCORRECTABLE_ERROR (124)), so I figured the chipset and/or CPU drivers were to blame.

However, this is what I've done so far, so now I'm looking for confirmation if I should try an RMA for the CPU (using onboard gpu, so no dedicated gfx card to test):
  1. The PC has 2x sticks of RAM, so I tried each individually, no change.
  2. I disconnected all hard disks except the primary, no change.
  3. I replaced the main disk with another disk with a Win 10 install on it, no change, same BSOD.
  4. I put in a fresh SSD and loaded the Win 10 installer from a new Win 10 Pro Install disc. It BSOD's TWICE loading the CD interface (Win 10 not even installed yet). I didn't even know you could BSOD at this point...
  5. The third try it did load the CD interface and got about halfway through the install of Win 10 until it BSOD'd.
  6. I replaced the PSU with a known good, no change.
  7. I managed to find another spare machine with a 6700 (non-K) chip, and put that in my machine. Problem gone. I can now boot back into the normal Win 10 install, as well as the secondary test install disk I put in, completely stable for over 18 hours now. It basically went from 30x or so BSOD's an hour to nothing with this switch.
Up until that last step I was still suspecting that maybe both sticks of RAM failed or that the problem was with the mobo. I have never, in all my years in IT, had a CPU fail during normal use, so I'm still suspicious about my findings.

However, with the above tests I'm pretty sure I (mostly) ruled out the RAM, ruled out the hard drives, ruled out the OS, ruled out the PSU. Even though it looks completely certain that it's the CPU, I'm just finding that hard to believe. And if it is the CPU, has anyone experienced something like this? The PC has had a rock solid uptime of about 10 months, treated very well in a clean environment, CPU always ran cool, max temp it has reached is 61C.

Anyway, any thoughts would be appreciated. I don't know how to go about and test the CPU for the purpose of an RMA process without an OS, so any clues on that would be great too!

Included are Sysinfo file for specs and one of the mini dumps in case that confirms anything (though to me the 124 error is probably way too generic).

The CPU I suspect is a 6700K, so the current 6700 that's working is a bit of a step down, but no biggie for my normal workload. Also I haven't bothered putting the second stick of RAM back in, so the specs only show 1 stick at 16 GB.
 

Attachments

  • E13 Files.zip
    251.1 KB · Views: 323
Yup! Seen lots of CPU failures over the years. I used to build them from scratch back in my foundry days. I also used to burn them up intentionally using a variety of methods including overclocking, stress-testing such as HeavyLoad etc. When teaching A+ certification courses at my local Junior College (ROP), we would intentionally install dead CPUs or failing ones from a drawerful of them we kept in the hardware labs! Then teach the students how to identify a faulty CPU using software tools and such. Many of them had never seen a dead CPU or a dead Mobo before the class. In the midterm and final labs, I also would bend the pins on the CPUs or even break them off and of course that would produce some very wonky symptoms. Usually the ''A" students would be able to figure it out in a 3 hr. lab test; but not always.:ahaha::tribal:

You can use tools such as SPECCY, SI-SANDRA, and CPU-z ID, and HWMONITOR to look at your CPU. There are also lots of tools you can use available on the Ultimate BootCD available free here: UBCD.com

I haven't done much of that with the more modern i3-i7 CPUs; but the testing process hasn't changed that much I'm sure. Intel also has some of their own CPU testing diagnostics (older ones are found on UBCD disc), so recommend you check their website: intel.com

My friend who worked at Intel until last year mentioned something about 3 yrs. warranty on their 6th and 7th gen chips. If older than that, you'll probably have to pay something for the out of warranty RMA replacement and 1-way shipping back to Intel.

Best,:up:
<<<<BIGBEARJEDI>>>>
 
Hey BBJ,

Problem with your suggestion of monitoring the CPU with those programs is that I can't put the CPU in a system to run anything, as it BSODs at logon. I was able to get into the BIOS and look at the basic specs, and the voltages and temp all looked fine. Once I get some more thermal paste I can throw the 6700K back in here and see if I can get it booted long enough to get a snapshot from HWMonitor and see if that tells us anything.

The chip is roughly 10 months old, so assuming it's actually bad that would still be within warranty.

And a bit off topic, in terms of physical damage I've seen problem CPUs as well, overclocking them to death, condensation (pin rot), broken pins, etc.

Just nothing like this, since it never got hot, was never overclocked, and ran stably for 10 months until yesterday when it had problems out of nowhere (seemingly anyway). I can only suspect a manufacturing defect, not damage that happened after the fact.
 
Last edited:
Yesterday it didn't matter what I did, none of the advanced start up options allowed any sort of time in Windows (it even BSOD'd after a system refresh). I can't test it right now since the CPU is out and I'm out of thermal compound, but I'll be able to remedy that over lunch.
 
Ok, then it is probably hardware. WHEA BSODs can be caused by bad CPU on die cache.
 
NOTE: Sorry for the long post, trying to be thorough.

Just for reference, these are the specs of the machine. Everything is running stock, it's a work machine:
OS: Win 10 Pro
CPU: i7-6700K
Motherboard: ASUS Z170-A LGA 1151
RAM: 2x G.SKILL Ripjaws V Series 16GB 288-Pin DDR4
Using onboard GPU
Combination of SSDs (EVOs) and HDDs (WD Blue and Gold)

So this computer starting BSODing yesterday/night before in the middle of the night at 11pm (per event log), after having been untouched since I left at 6pm. It BSOD'd frequently in the morning when I arrived (every 30 seconds to 5 minutes) and worsened to the point where it could barely get to the log on screen.

For most of my morning troubleshooting I assumed it was due to a Win 10 update and a driver conflict. The faulting module in the memory dump was intelppm, and the minidumps showed a 124 error (WHEA_UNCORRECTABLE_ERROR (124)), so I figured the chipset and/or CPU drivers were to blame.

However, this is what I've done so far, so now I'm looking for confirmation if I should try an RMA for the CPU (using onboard gpu, so no dedicated gfx card to test):
  1. The PC has 2x sticks of RAM, so I tried each individually, no change.
  2. I disconnected all hard disks except the primary, no change.
  3. I replaced the main disk with another disk with a Win 10 install on it, no change, same BSOD.
  4. I put in a fresh SSD and loaded the Win 10 installer from a new Win 10 Pro Install disc. It BSOD's TWICE loading the CD interface (Win 10 not even installed yet). I didn't even know you could BSOD at this point...
  5. The third try it did load the CD interface and got about halfway through the install of Win 10 until it BSOD'd.
  6. I replaced the PSU with a known good, no change.
  7. I managed to find another spare machine with a 6700 (non-K) chip, and put that in my machine. Problem gone. I can now boot back into the normal Win 10 install, as well as the secondary test install disk I put in, completely stable for over 18 hours now. It basically went from 30x or so BSOD's an hour to nothing with this switch.
Up until that last step I was still suspecting that maybe both sticks of RAM failed or that the problem was with the mobo. I have never, in all my years in IT, had a CPU fail during normal use, so I'm still suspicious about my findings.

However, with the above tests I'm pretty sure I (mostly) ruled out the RAM, ruled out the hard drives, ruled out the OS, ruled out the PSU. Even though it looks completely certain that it's the CPU, I'm just finding that hard to believe. And if it is the CPU, has anyone experienced something like this? The PC has had a rock solid uptime of about 10 months, treated very well in a clean environment, CPU always ran cool, max temp it has reached is 61C.

Anyway, any thoughts would be appreciated. I don't know how to go about and test the CPU for the purpose of an RMA process without an OS, so any clues on that would be great too!

Included are Sysinfo file for specs and one of the mini dumps in case that confirms anything (though to me the 124 error is probably way too generic).

The CPU I suspect is a 6700K, so the current 6700 that's working is a bit of a step down, but no biggie for my normal workload. Also I haven't bothered putting the second stick of RAM back in, so the specs only show 1 stick at 16 GB.
Hi Francis,
I'm just about to debug your dump file and will post back shortly.

Basically Bugcheck 124 means a hardware error occurred and usually this Bugcheck is linked with overclocking as well as overheating.

However

This bugcheck can also be caused by a myriad of things and I've even seen an out of date copy of Chrome to be the culprit. Basically we need to run a few tests as well as those already tried.

Gimme an hour and I'll post back with more information.
 
Hi Francis,
I'm just about to debug your dump file and will post back shortly.

Basically Bugcheck 124 means a hardware error occurred and usually this Bugcheck is linked with overclocking as well as overheating.

However

This bugcheck can also be caused by a myriad of things and I've even seen an out of date copy of Chrome to be the culprit. Basically we need to run a few tests as well as those already tried.

Gimme an hour and I'll post back with more information.
Hi Kemical,

Great thanks!

The CPU's never been overclocked and always seemed to be running cool (in the 40 to 45C range during normal work, so neither of those should have anything to do with it, unless the PSU was feeding too much power.

I'll wait on your response, maybe by then I'll have had a chance to put the CPU in question into the other PC to see what it does there. If it's a CPU hardware issue it should have the same problems there.
 
Code:
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 124, {0, ffffe0017b0e0028, b2000000, 14}

Probably caused by : GenuineIntel

Followup:     MachineOwner

Code:
PRIMARY_PROBLEM_CLASS:  0x124_GenuineIntel_PROCESSOR_TLB

A translation lookaside buffer (TLB) is a memory cache that stores recent translations of virtual memory to physical addresses for faster retrieval. This link will explain it in more detail that I could:
http://www.dauniv.ac.in/downloads/CArch_PPTs/CompArchCh10L05TranslLookAheadBuff.pdf

This can be cause by corrupt software but after reading your first post in detail I see you have pretty much tried everything to discount the software (OS).

The only thing I can't seem to find is results for the Intel® Processor Diagnostic Tool ?
Purpose
The purpose of the Intel® Processor Diagnostic Tool is to verify the functionality of an Intel® Microprocessor. The diagnostic checks for brand identification, verifies the processor operating frequency, tests specific processor features, and performs a stress test on the processor.

Other than running the above, try clearing the Bios by either removing the battery or hold down the little black button. Reboot and load 'optimised settings' before making any further changes.

I'll wait on your response, maybe by then I'll have had a chance to put the CPU in question into the other PC to see what it does there. If it's a CPU hardware issue it should have the same problems there.
Agreed and good idea.
 
Diagram from a book. TLB is inside the CPU die
CPU.PNG
 
It's a good read but definitely not for everyone. "The Art of Memory Forensics" by Michael Hale Ligh
 
Yeah, doesn't seem like something for pleasure reading...

As far as to what's going on right now, the CPU is now cozy in a stock Dell Optiplex 7040 and BSODs on startup shortly before it would reach the Windows log on screen (MACHINE_CHECK_EXCEPTION). This machine seems to be running Win 7, which I don't think should matter though. The BIOS detected the new CPU and reset its settings (original CPU is a 6700, now a 6700K, so should work just fine).

I can't get into Windows, so it's not as easy to get any logs.

Considering this is 100% separate hardware now and a different OS altogether and still blue screening I'm going to stick with my CPU died theory and I'll try for an RMA with Intel.

I can't even run the Intel tool because I can't get into any OS with this CPU now, so that's not something I'll be able to get you. :|

Any other thoughts before I go the RMA route?
 
The fact it does more or less the same thing in two different machines I'd go for the RMA toot sweet. Good luck on a speedy process.
 
I'm in agreement with kemical and neem on this; sounds to me as if the CPU is defective.:waah: I'd RMA it as advised. If you're as OCD as some of us it might be worth trying to boot from a Ubuntu LiveCD disk or USB stick and see if Linux will boot. That completely takes Windows software out of the picture, and you can even disconnect your bootdrive (C: drive) and boot the Linux from the disk or usb distro in RAM. [THIS STEP IS OPTIONAL, BUT REALLY NARROWS THINGS DOWN IF THE HDD IS OUT OF THE PICTURE!]. If Ubuntu fails to load also, or gives weird load errors, then clearly the CPU chip is bad, as I believe you took necessary steps to make sure the mobo is ok. You can do this in a few hours or maybe half a day. It's pretty easy and the website and instructions for Ubuntu are here: Download Ubuntu Desktop | Download | Ubuntu

Ubuntu is a terrific diagnostic tool,:thumbs_up: and it's gotten me out of some real jams at Customer sites who want to just throw their computer away because some other tech or friend told them the Mobo or CPU chip is gone--if Ubuntu boots, not so!! :noway: Usually just a bad hard drive, RAM Stick, or corrupted Windows. Replace the bad parts, reload windows, and 99.9% of the time it's fixed!:up: I carry a Ubuntu USB stick on my keychain wherever I go--even on Vacation! It's so handy to be able to determine if it's a Mobo/CPU problem or a faulty drive/RAM stick or Windows issue in just a couple of minutes.

Best,:bighug:
<<<BBJ>>>
 
Back
Top