50,000 users in 2019 hangs on boot but works on 2012R2

InformB

New Member
I've been using Windows server 2012R2 Standard for almost a decade now. I have apps running on the server which have created about 50,000 local users. These are local users created using WMIC scripts and eq. API's. The purpose is to allow for easy integration with IIS authentication. The app creates a local user account and IIS uses the built in Windows authentication to authenticate the user.

It's been running great on 2021R2 but I recently decided to upgrade to a 2019 Datacenter edition so I created a new 2019 server and used the same scripts to recreate the 50,000 users on the 2019 server.

However when I reboot the 2019 server it just hangs for hours at the spinning circle and takes up 100% CPU. It never get the login screen. I've never had this issues with 2012R2, to test it I spun up a new 2012R2 server and recreated the same 50,000 local users with the scripts and it took a few minutes to reboot but that was it.

Is there something different about 2019 server datacenter vs 2012r2 standard which causes the 2019 server to hang on boot when there are 50,000 local users? Is it trying to process all the users before presenting the login (I've set it to login automatically to an admin account but it never seems to get there). I'm at my wits end now and I'm open to suggestions on how to debug this further or what settings should I look at? I've checked out the group security and logon policies and they're both the same for the 2012 and 2019 servers. What am I overlooking?

They're both running on AWS with 1GB RAM. Thanks in advance and please don't hesitate to throw out whatever ideas you may have. I cannot change the architecture at this time and the easiest way to integrate IIS with authentication is to create a local user account. They don't need to have local login privileges only network login (for IIS), so these accounts are added to a custom Group on the windows server and removed from the standard Users group.
 
It's possible that the issue is related to the number of local users on the 2019 server. While there is no hard limit on the number of local users that can be created on a Windows server, excessive numbers of users can cause performance issues.

You may want to try some troubleshooting steps to determine the cause of the issue:

1. Check the event logs on the 2019 server to see if there are any errors or warnings that might indicate what is causing the hang.

2. Try booting the server in safe mode to see if it can successfully boot without all of the installed services and drivers. If it does boot successfully, you can begin to narrow down the cause by selectively enabling services and drivers until the issue returns.

3. Check the system resources (CPU, memory, disk usage) during the boot process to see if there are any bottlenecks that might be causing the hang.

4. Consider temporarily disabling some of the startup programs to see if any of them might be causing the issue.

5. Run the System File Checker (SFC) tool to check for any system file corruption that might be contributing to the issue.

It's also worth noting that running a Windows server on 1GB of RAM is quite low and may be contributing to the performance issues. It might be worth considering increasing the RAM allocation for the server to see if that helps.
 
I would highly like to avoid a chatgpt response as it's absolutely useless. My first comment was it does NOT boot, which means I can't check logs, I can't get to the login screen, I've checked the bootlog and the last thing it shows was a successful loading of a driver. I can't even boot into safe mode as it just hangs.

I start with a clean 2019 image, run the scripts to create the users and it's working great, the moment I reboot it just keeps spinning on the circle with 100% CPU utilization. When I do the exact same process with 2012R2 it works fine.
 
There must be some kind of processing going on that did not exist in Server 2012 that would prohibit that many accounts. Are you able to see the PID or descriptor of the CPU process taking up all the resources? And yeah the bot really can't solve this kind of problem without a real investigation of whatever is going on with the server. There are some more advanced security features in Server 19 that could be getting triggered when this many local accounts are created.
 
@Mike thanks for the suggestion.

Just for kicks, I tried a new AWS 2019 server with 4 and 8GB RAM and also more processors, same result. The scripts runs, add 40K users and everything is working fine up, it's responsive and working after the scripts are done (takes about 30 min to complete), the CPU utils are under 1%. When I restart the machine and I end up with the spinning dots (on AWS I can't see it because RDP isn't up or isn't responding) but when tested with a local VM I can see it there, so it's not the size of the RAM/hardware which appears to be the issue here.

I can't see any PID since it's still booting (the spinning dots). I tried using safe mode and it's the exact same issue, ends up with the spinning dots or a blank screen (depending on what version of safe mode I pick).

Can you think of anything else? Is there to export all the configuration/settings of a windows server? That way I can export the 2021R2 and do a diff against the 2019 and see what settings are different.
 
I have one more insight, I created a new 2019 server, added the 50k new account but then just left the machine without rebooting. I logged out of the RDP session, wait for a few hours and tried to RDP back in and now it's hung on "Please wait for Local Session Manager" and it just times out, the console shows the CPU usage spikes to 100%. So it seems like something with Local Session Manager is causing the issue. I did the same with a 2022 server also and the same results (so it's not specific to 2019). Any ideas?
 
have you tested just making one or two users by hand and restarting?
 
have you tested just making one or two users by hand and restarting?
Thanks, yes I've done that. I've tried testing with increasing numbers, under a 100 it's fine (scripting or by hand). When I run it up to about 3000 then it really slows down, takes 15 minutes to start. So as the numbers keep increasing it's slowing down more, but I don't see that with my older 2012R2 instance. It takes a consistent 3-4 minutes to boot and login to the automatic user login configured, even at 50K users.
 
I have one more insight, I created a new 2019 server, added the 50k new account but then just left the machine without rebooting. I logged out of the RDP session, wait for a few hours and tried to RDP back in and now it's hung on "Please wait for Local Session Manager" and it just times out, the console shows the CPU usage spikes to 100%. So it seems like something with Local Session Manager is causing the issue. I did the same with a 2022 server also and the same results (so it's not specific to 2019). Any ideas?
For this specifically, the actual service might be timing out on start-up. There is a registry edit to try to correct this:

 
Thanks, yes I've done that. I've tried testing with increasing numbers, under a 100 it's fine (scripting or by hand). When I run it up to about 3000 then it really slows down, takes 15 minutes to start. So as the numbers keep increasing it's slowing down more, but I don't see that with my older 2012R2 instance. It takes a consistent 3-4 minutes to boot and login to the automatic user login configured, even at 50K users.
One more possibility here:

 
One more possibility here:

Thanks, I tried this and it didn't work :( but it's given me a few leads to chase up by looking at differences between the registry settings between the 2012R2 and 2019 servers. I'll also try the other link you sent me, this is very helpful - please send me any other ideas you may think of. I suspect it's something to do with LASS or SAM. When it did eventually boot up (after 12 hours), I looked the system logs and there was a 3 hour gap between the kernel boot and next system message and the next message something about the SAM being unavailable.
 
I don't think the SAM database was meant to handle that many local accounts (that's what LDAP or a sql server is meant for). I would suspect the entire SAM db is being loaded or profiles are being pre-fetched.
 
Any ideas on what I can do? Can I move the users to LDAP on the same machine and then use that for integrated Windows authentication via IIS? That's pretty much the only requirement (authenticate via IIS and be able to create the user via a batch/command prompt command).
 
I'm almost convinced it's a SAM issue. When I look at the server system event logs (without any users), the Directory-Services-SAM there are 3 SAM event logs that complete within about 1 second. On the server with the 50k users, those same 3 events are 12 hours apart, the first one right after Wininit and the the next one 12 hours later (when login takes place)

So the first one after Wininit:
> Remote calls to the SAM database are being restricted using the default security descriptor: O:SYG:SYD:(A;;RC;;;BA).
For more information please see Network access - Restrict clients allowed to make remote calls to SAM.

The second one after 12 hours:
> The domain is configured with the following minimum password length-related settings.
MinimumPasswordLength: 0
MinimumPasswordLengthAudit: -1
For more information see Minimum Password Length auditing and enforcement on certain versions of Windows - Microsoft Support.

And the third one immediately thereafter:
> The security account manager is now logging periodic summary events for remote clients that call legacy password change or set RPC methods.
For more information please see KB5004605: Update adds AES encryption protections to the MS-SAMR protocol for CVE-2021-33757 - Microsoft Support.

Any thoughts on how to get SAM to speed up or not load all profiles at boot up?
 
I found the culprit, it's LSASS (using ProcMon), it's iterating through every user account in the registry HKLM\SAM\Domains\Account\Users and for every user it's making 4 calls to the registry



2:59:15.2752917 PM lsass.exe 680 RegEnumKey HKLM\SAM\SAM\Domains\Account\Users\Names SUCCESS Index: 8,897, Name: XXXX
2:59:15.2753102 PM lsass.exe 680 RegOpenKey HKLM\SAM\SAM\Domains\Account\Users\Names\XXXX SUCCESS Desired Access: Read
2:59:15.2753356 PM lsass.exe 680 RegQueryValue HKLM\SAM\SAM\Domains\Account\Users\Names\XXXX\(Default) SUCCESS Type: <Unknown: 51424>
2:59:15.2753501 PM lsass.exe 680 RegCloseKey HKLM\SAM\SAM\Domains\Account\Users\Names\XXXX SUCCESS


So the question is how do I stop LSASS from doing this? For some reason 2012R2 isn't doing it but 2019 and 2022 servers are doing this. Is there a registry setting to control this?
 
Sorry, I was testing something. I apologize for the notification blast you may have just received from it. I'm not sure there is a solution yet.
 
There is no tweaking of LSASS with regards to SAM performance that I am aware of. I would strongly recommend using LDAP or a SQL based authentication system as these are intended for large user bases.
 
I apologize for the confusion earlier. To address your concern about LSASS iterating through every user account in the registry, there isn't a specific registry setting available to control this behavior.

However, given that this behavior is not observed in Windows Server 2012 R2 and there's no need for LSA (Local Security Authority) to load every local user account, you could:

1. Open a support case with Microsoft to have their input on this behavior and perhaps gain insight into a potential fix or workaround.

2. You might also consider adjusting your architecture to use an AD domain with a domain controller or an LDAP server (for example, OpenLDAP) to handle the large number of user accounts. These services can better manage large numbers of users and their authentication needs; it will likely be a better solution for improved performance and control.
 
Apologies if my previous response didn't directly address your concern. To answer your question about stopping LSASS from iterating through every user account in the registry, you might want to look into implementing a Group Policy setting or a custom script that controls the LSASS behavior.

However, there may not be a simple registry setting to stop LSASS from iterating through every user account. A better approach might be to consider restructuring your system to handle a large number of users more effectively. As mentioned earlier, using LDAP or implementing Active Directory could help manage these users more efficiently, and allow for better scalability.

Please let me know if you'd like further assistance or have any additional questions.
 
Back
Top