InformB

New Member
Joined
Mar 27, 2023
Messages
11
I've been using Windows server 2012R2 Standard for almost a decade now. I have apps running on the server which have created about 50,000 local users. These are local users created using WMIC scripts and eq. API's. The purpose is to allow for easy integration with IIS authentication. The app creates a local user account and IIS uses the built in Windows authentication to authenticate the user.

It's been running great on 2021R2 but I recently decided to upgrade to a 2019 Datacenter edition so I created a new 2019 server and used the same scripts to recreate the 50,000 users on the 2019 server.

However when I reboot the 2019 server it just hangs for hours at the spinning circle and takes up 100% CPU. It never get the login screen. I've never had this issues with 2012R2, to test it I spun up a new 2012R2 server and recreated the same 50,000 local users with the scripts and it took a few minutes to reboot but that was it.

Is there something different about 2019 server datacenter vs 2012r2 standard which causes the 2019 server to hang on boot when there are 50,000 local users? Is it trying to process all the users before presenting the login (I've set it to login automatically to an admin account but it never seems to get there). I'm at my wits end now and I'm open to suggestions on how to debug this further or what settings should I look at? I've checked out the group security and logon policies and they're both the same for the 2012 and 2019 servers. What am I overlooking?

They're both running on AWS with 1GB RAM. Thanks in advance and please don't hesitate to throw out whatever ideas you may have. I cannot change the architecture at this time and the easiest way to integrate IIS with authentication is to create a local user account. They don't need to have local login privileges only network login (for IIS), so these accounts are added to a custom Group on the windows server and removed from the standard Users group.
 
Solution
It's possible that the issue is related to the number of local users on the 2019 server. While there is no hard limit on the number of local users that can be created on a Windows server, excessive numbers of users can cause performance issues.
You may want to try some troubleshooting steps to determine the cause of the issue:
1. Check the event logs on the 2019 server to see if there are any errors or warnings that might indicate what is causing the hang.
2. Try booting the server in safe mode to see if it can successfully boot without all of the installed services and drivers. If it does boot successfully, you can begin to narrow down the cause by selectively enabling services and drivers until the issue returns.
3. Check the system...
Looking at the procmon logs for the 2019 server more carefully I noticed that it's not that LSASS is taking a long time to parse the 50K users from SAM, it appears to be stuck in a loop!! LSASS keep reading all the SAM entries over and over again in a loop for 12 hours! Here's an excerpt from the logs showing when it ends one loop and then starts another loop. It's the exact same loop everytime. I can't see the end of the loop because procmon timed out after a few hours of logging but for those few hours it ends the last user, then reads the services NTDS and then restart the SAM user loop again. Anyone may any sense of this?

 
Got it, Mike. Let's get straight to troubleshooting your issue without any fluff.

Observations:​

  • Fresh Windows Server 2019 image fails to boot after running scripts to add 50,000 users.
  • Windows Server 2012R2 handles the exact same process without problems.
  • Hangs at boot with 100% CPU usage and no usable logs or Safe Mode access.

Likely Causes:​

  1. User Profile Service Overload:
    • Windows Server 2019 might be hitting a performance bottleneck or exhaustion of system resources (memory, CPU, or storage IOPS) due to handling such a high number of user accounts.
  2. Registry/Log Size Issue:
    • Creating this many users may inflate the SAM (Security Accounts Manager) database or expand the Windows registry beyond manageable sizes, causing 2019 to hang on boot.
  3. Service or Driver Incompatibility:
    • Differences in how services or drivers are designed in 2019 compared to 2012R2 might mean one is choking on tasks the other handles better.
  4. File System or Access Rights:
    • Permissions or file system lockouts could be interfering with required system services during startup.
  5. CPU Scheduling:
    • The OS may be burning CPU cycles trying to handle all user sessions during startup. Resource allocation differences between 2019 and 2012R2 may be at play.

Recommended Steps:​

1. Try a Minimal Boot via Command Line:​

  • Boot to recovery using installation media.
  • Open a Command Prompt and reset non-essential services:
    Code:
    cmd sc config UserManager start= disabled sc config LanmanServer start= disabled
  • Restart and see if it gets past the circle spinner.

2. Analyze the SAM/Registry:​

  • Boot into recovery mode and mount the drive to analyze:
    • Check C:\Windows\System32\Config\SAM and C:\Windows\System32\Config\System for unusual file sizes. A bloated SAM database can crash the system.
    • Consider truncating or editing these files temporarily using tools like NTDSUtil to isolate the problem.

3. Isolate User Script Issues:​

  • If everything runs fine BEFORE users are added:
    • Test adding 10,000 users instead of 50,000 to confirm the exact tipping point.
    • Validate the script for potential issues with account properties (e.g., overly long names, restrictive permissions).

4. Analyze Boot Process via Debugging Tools:​

  • Attach debug tools (e.g., WinDbg) if possible:
    • Hook the machine to another system to trace what process or thread is consuming resources during boot.

5. Check for Known Server 2019 Limitations:​

  • Look into known scalability or configuration differences between 2012R2 and 2019 related to user accounts or file system limits.

6. Leverage Disk Cleanup & Reset:​

  • Clear temporary files, logs, or cache before rebooting.

This seems like either a scalability issue or a specific driver/service problem. Your next step depends on whether isolating components helps identify a breaking point. Let me know what results or clues you uncover, and I’ll assist further.