Category Archives: Machine Learning

Windows 10 1809 Install + Threadripper 2 + x399 + Intel 9260 = BSOD Perfect Disaster

In the Fall of 2018 I built an epic PC for gaming / machine learning.  It was based on the latest technology and included a number of high end parts but not limited to an AMD Threadripper 2950x, ASRock Taichi x399 motherboard, 2x NVIDIA GeForce RTX 2080 TI in SLI and a Samsung 970 Evo M.2 SSD (full parts list @ pcpartpicker).  When I built the machine in the fall, I loaded it with Windows 10 ( I think version 1803 ) and everything pretty much worked great.  I was even able to have a 3DMark Score in the top 50 or 20 for a few days until the serious overclockers took over, with their custom voltage regulators and LN2.  Now, I’m not even in the top 100.

full parts list @ pcpartpicker
19th for a few days 😉 Now, I’m not even in the top 100.

So, a few weeks ago I wanted to refresh the system, clear out any possible malware, and start fresh.  I didn’t remember which version of windows I had, or what was on what USB key and before I knew it – I had created a new USB Installer key over the old with Windows 10 1809, little did I know that this is where my problems and headache would begin.

When I went to install a fresh copy of Windows 10 1809, I started to run into a BSOD ( blue screen of death ) with a BUGCODE NDIS_DRIVER error.


BSOD ( blue screen of death ) with a Stop code: BUGCODE NDIS_DRIVER

Then the system would restart and produce a new “Install Windows” error in a dialog box.  This error reads:


The computer restarted unexpectedly or encountered an
unexpected error. Windows installation cannot proceed.  To install
Windows, click “OK” to restart the computer, and then restart the
installation.

I must have tried 4 or 5 times to just start over, then I tried to locate a different version of the OS.  The best I could find ( from microsoft ) was an ISO of 19H1, but this was much worse as I didn’t even get the dialog.

TLDR; Eventually I found a workaround that didn’t work ( and broke windows update ) before discovering the actual solution ( on an MSI forum) to install the correct and updated Intel 9260 Drivers.

The first work around I tried was on Windows 10 version 1809, where I used a registry hack I found to continue the installation process, which seemed to work.  I got the OS booted and everything seemed fine at first, but after I started installing software I noticed that Windows Update was stuck and wouldn’t load any more updates; some of which I knew were critical to the work I wanted to do.  The machine also seemed clunky, don’t get me wrong it was still blazing fast on 3DMark, but little things like window and mouse jitter seemed odd for such a fast new machine.

After a week of waiting for it to fix it myself I started to investigate the problem further.  I tried a few fixes that end up not working and tried to “upgrade” windows. I tried 19H1, and even tried to “upgrade” to 1809 both of which failed, BSOD’ed, and then restored me to a somewhat working OS without windows updates.

I continued to try various things, none of which worked, and many of which worried me that I was making my Windows 10 Install more unstable.

It was time to get help.

I started with the Windows 10 Troubleshooter, but it reported that it couldn’t find anything wrong.  Then I went on to use the “Get help” link on the Windows Update panel. This led me down the path of what seemed like several solutions, but again anytime I got to the critical point of manually installing the critical update pack, it would fail, or stall.

I scheduled a call with Microsoft Tech Support and a level 1 tech answered the phone.  She remoted into my machine, and walked me through a few things but soon realized that none of those things were going to help and then she recommended that we try “upgrading” or “re-installing”.  I told her that is how I got here and that wasn’t going to help. She then scheduled me for a Level 2 tech call 3-4 days later; and since my machine was mostly working I patiently waited.

8:00am, … 8:15am — missed call from Belleview, WA — It must have been the Level 2 tech trying to call me while I was on my phone (I still have no idea why it didn’t ring).  Eventually by 8:20/30 am we had finally connected. My 2 year old daughter was now awake and playing in the background.

At this point the Level 2 tech had now remoted into my machine ( after i gave permission of course ) and we were trying more advanced stuff.  She was clicking and typing confidently through my OS shutting down services trying to restart them, and trying to get windows update working again.  After several attempts of trying to use her ninja skills to force manual updates, and or get the windows update service working, she said, we need to try to reinstall.  I informed her that I had tried this before and showed her what I did to get the OS working in the first place, but nevertheless she persisted. We tried the “upgrade/install” process like she insisted and again it kicked us back to the saved restore point.  She couldn’t see the BSOD, but I told her what I saw. After a few variants of this, she insisted we try a fresh install.

At this point I was still full of hope that she would resolve my issue, and since all of my data is on a 2nd HD.  I disconnected that SATA drive from the machine, and followed her lead. She made a custom install USB and we secure wiped the SSD and did a fresh install.  Again BSOD, and the error message! I was devastated but still hopeful, but from here on out there was no restore point. I was eagerly awaiting the Level 2 tech to tell me what to do next, but then I realized that she had given up too.  I was so mad that I started the call with a somewhat working PC and now I had a machine that didn’t even have an OS on it.  Hours of set up time wasted again!

She talked me off the ledge, and scheduled me for an in store ( yes the Microsoft Store, in the mall ) visit.  I was a little offended to say the least because she was implying, but never said, there was an “issue between the chair and the keyboard” ( an old tech saying ) for when they think the operator is in error.  She kept blaming my hardware too, scapegoating hardware that was working just an hour ago.  I insisted it was a software/ OS issue, but she held her ground and scheduled a follow up call with me.

I was pissed, because Now I would most likely have to visit an MS store with an even less experienced tech.

When I got to the store after carrying my practically 30-40lbs “copper” brick through the mall I arrived at the counter.  They wanted to check-in my box and tried to tell me that I’d have to leave my PC with them for at least 3-5 days.  I was struck with shock and annoyed, the Level 2 tech didn’t warn me of this at all.  I tried to find out what they were going to do to resolve the issue, and after about an hour of debating with them I realized that they were basically going to try the same stuff, monkey with the BIOS randomly, and then blame it on the hardware (which has always been working).

We exchanged emails and parted ways.  They told me based on the images I had sent them that if they found out anything the would send it along.

Later that night I did get an email from the Microsoft Store.

Hello Justin,

Here is communication I received from our team member in Redmond that he suggested. If you have any questions please let us know.

“I would advise the customer look at this thread: https://www.reddit.com/r/Amd/comments/721s6z/threadripperasrock_x399_taichi_windows_10_needs/

It appears that IOMMU (Input–output memory management unit) needs to be disabled in the North Bridge EFI settings in the UEFI as it can cause issues with the installation. If issues persist I would point the customer to the motherboard manufacturer for support as each of his parts in his custom build have their own unique OEM warranty. “

I followed the suggestion and tried a few combinations of changing the IOMMU and virtualization settings in the BIOS.  I had no luck, and in fact changing some of these settings resulted in me having to reset my CMOS settings to default. When they did work, the BSOD and the error message would inevitably appear again.

This got me thinking to search the ASRock forums and support pages.  It was at this point that I noticed that there was a new BIOS, v3.50, which is odd because I had been checking for weeks for one with the ASRock APP Shop which just a day ago didn’t report the new BIOS.  I downloaded the new v3.50 BIOS from 1/9/2019 and flashed the MB. ( I don’t think the new BIOS helped, as the ultimate solution wasn’t in the change log). I then again tried the MS Store advise and ran into the same problems.

After again scouring the web for new solutions, I happened upon an MSI x399 thread that offered new hope ( HELP. Can’t install Win10 Keeping getting blue screen and “STOPCODE BUGCODE NDIS DRIVER ).  It was at this point that I realized that it wasn’t a problem unique to my Taichi board, but affected other MBs as well.

I found a few posts mirroring the concern over the IOMMU, the NVM M.2 SSD interface, the 10+ core count in the Threadripper line, and just blaming Windows 10 v1809 as the problem.  Some people even recommended a few sites to download 1709 and 1703 and claimed that they were able to get around the issue that way.

I don’t think you should do any of these solutions.  IOMMU support is actually critically important to Threadripper because it has a NUMA ( non uniform memory architecture ) where blocks of RAM are accessible through different CPUs.  I think disabling this can cause huge issues. Even if you could install the OS on a SATA disk, why would you want to when you have an M.2 SSD? Underclocking with your core count might be possible on the MSI BIOS but it didn’t work on my ASRock Taichi x399, and who knows maybe windows if it did install will think you have two different PCs trying to use that license once you change the core count.  Finally, why use an older version of the OS if you can get 1809 working.

I eventual found this post MSI MEG X399 Creation, Blue Screen during Win10 Install (and a fix!), which points out that this is a Intel 9260 driver issue with Windows 10 1809.

Normally this error might not affect the ASRock Taichi x399, as it comes with an Intel 3168 Wireless-AC module, however, I replaced mine with something faster and with BT 5.0 ( the Intel Wireless-AC 9260 ). The default 3168 Taichi wireless module is actually pretty slow, so, you can see why I would replace it with the 9260.

Since all of the MSI MEG x399 boards come with this module, this explains why I wasn’t able to find a solution when I was limiting my scope to ASRock MBs which come with the 3168 instead.

TLDR; Once I downloaded the driver I copied the unzipped folder to a 2nd USB stick in the root dir ( even though you might be able to have it on the windows install usb ).  When you get to installing the OS, and your selecting your install location (disk/ partition/ drive), you have the option to format, etc…, and to load a driver.  It’s at this time that you load that driver, if you try to wait, you’ll get a BSOD, corrupted install and have to start over.  ( I installed v20.120.1, win10 64 bit only, 3/6/2019 )

Windows threw up a dialog about not finding any valid drivers, close this, you’ll have a few windows to close before you can browse for your driver.  Locate the directory of your 9260 drivers and install. After this it was smooth sailing for me, and the machine felt smoother and more responsive than it did before all of this started.  It however didn’t have any faster 3DMark scores, so don’t expect any miracles.

Summary

While the Microsoft staff was friendly, I found it incredibly frustrating that ultimately they couldn’t help me.

This is a huge failing on their part, especially since Mac OS and Linux are free by comparison, the hardware was supported and functioning, so you’d expect a certain level of support, especially when you buy the Retail Full Version of the OS, for $200.

It’s a 2nd major failing on MS Support Scripts that they so willingly blame the hardware.  All of the hardware is from major brands and have stated that they have compatibility support for Windows 10 64bit.  It was a 3rd failing of MS that the installer of 1809 fails so bruitilly for such a minor issue, even if earlier versions of the Intel 9260 driver BSOD the installer fail, then there are plenty of mitigation strategies that MS could employ.  It was a 4th fail on MSs part that the 19H1 release still does not correct this issue when it’s only a driver issue and Intel has had a new working driver for 6-months. Their 5th fail was that none of their staff or KB ( Knowledge Base ) was aware of the issue, when this affected ALL MSI x399 Creators edition boards, the Level 2 tech ( the highest MS has ) didn’t even try this approach before giving up.

In the process of trying to solve this problem, I saw a lot of issues with 1809, and the 1803 installers.  On top of that Windows 10 has had a number of issues with Threadripper since it’s release. However, this is to be expected with new platforms AMD is probably as much to blame as MS on this one.  AMD needs to better support it’s driver teams and get these issued worked out as early as possible, this is one thing that NVIDIA and Intel still do better than AMD. With all that said someone Intel’s latest driver didn’t make it to the installer.

Now that my PC is working with 1809, and it’s not that bad to reinstall, I’m a happy customer, but I had wished this whole process would have been a lot easier.  I hoping that anyone reading this article can help apply pressure to MS, Intel, and AMD to make this better, because they are not going to do this on their own, especially if 19H1 is any indication.

Crossposted to medium