Hey everybody. I’ve been wanting to write this blog series for a long time, because it tackles one of the greatest misconceptions about Microsoft Remote Desktop Services. Namely, that it can’t scale to tens of thousands of users, so if you’re an organization with 10000 plus employees, or you’re an ISV delivering your app to thousands of customers over RDS, you’ll have to switch to much more expensive solution like Citrix or AVD to scale to that level.
This couldn’t be further from the truth. You CAN scale up your RDS environment so that it can handle 5000, 10000, 20000, or even more users. I know this is true, because I’ve consulted with companies with RDS user bases in excess of 20000 users! They all experienced some growing pains and choke points when scaling up, so the purpose of this blog series is to educate you in advance so you’ll be ready when your RDS deployment grows larger and larger over time. And, as a bonus, many of my tips are useful in RDS deployments of ALL sizes, and will directly improve the responsiveness of your connection brokers and the overall stability of your RDS environment. Watch my new RDPHard video below, and read on to find out exactly how to scale up!
But First, Some Background on RDS Scalability, Since the Overhaul of the Connection Broker in Windows Server 2016
Let’s first go back in time a bit, to when Windows Server 2016 was first introduced. Around the time of its launch, Scott Manchester, who’s now the product group leader for Microsoft 365 and AVD, tasked the RDS team to better optimize the performance of the RDS connection broker. His team at the time built a tool for stress testing the connection broker against login storms, where up to 1000 users sign in at the same time. Scott claimed that a Windows Server 2016 broker could handle login storms of up to 1000 user sign-ins in a very short period, and that they were testing their connection broker with up to 10000 users (watch my video above – it has an excerpt of the Microsoft Mechanics video where he made these claims).
In the real world, though, beyond Scott’s stress testing script, the connection broker doesn’t fair as well dealing with large user bases when running with its default settings. Why?
RDS Scaling Choke Point 1 – Connection Broker Dependencies on Microsoft SQL Server and Stored Procedures
The first reason relates to the connection brokers dependency on Microsoft SQL server. RDS connection brokers use SQL Server to maintain an inventory of how users are load balanced across the session hosts in RDS collections, and which host each user is assigned to in case of a disconnection and reconnection. Small deployments with a single connection broker have an internal version of SQL which is basically SQL Express running on the same VM as the connection broker. Larger deployments, in an RDS High Availability configuration, have a dedicated SQL Server cluster which all brokers consult with for information.
Anyway, Microsoft architected RDS so that the brokers execute a bunch of stored procedures on the SQL Server to get the information they need when a user is trying to initially connect or reconnect to the environment. Unfortunately, as some of my larger customers have discovered, when connection brokers start getting lots of login requests during morning or after lunch login storms, tons of these stored procedures get executed at the same time. Those connection storms often cause deadlocks on the SQL Server, which can result in your brokers going off into lala land, your users being unable to connect, connections taking a really long time to complete, etc. FYI – You can live observe the SQL responsiveness and any pending problems with stored procedures and DB response time using our Remote Desktop Commander Suite solution – this blog article discusses how that works in depth.
Scaling Strategy – Increase the Cost Threshold for Parallelism on SQL Server
So, the question becomes, how can we tune SQL to make it not run home to mama when the connection brokers are calling thousands of stored procedures at the same time? The answer is, increase the Cost Threshold for Parallelism default value on your SQL server from 5 to 25. Now, I’m not going to get into the underlying mechanics of how and why this can benefit the situation, which is better left to hardcore SQL nerds like Brent Ozar, if you want to read his blog article. Adjusting the default value of 5 up to 25 was the magic change that I saw recommended from Microsoft support during a consulting engagement, but your environment may need a smaller or larger number. As with all things, TEST THIS FIRST before rolling it out into production, and be prepared to revert the value if performance worsens instead of improves. But in every environment I’ve consulted in, moving this value up from 5 to 25 has helped as opposed to hurting scalability.
By the way, if you’re running RDS in Azure, and you have your HA connection brokers communicating with Azure SQL, you can experiment with increasing the MAXDOP value, which stands for Max Degree of Parallelism. At a minimum, take a look at this value, because if your deployment is older, your Azure SQL database may have a default value of 1, which means a single serial thread will be executing queries. Newer Azure SQL deployments have a default MAXDOP value of 8.
RDS Scaling Choke Point 2 – WMI Handle Exhaustion on the Connection Brokers
Let’s now move on and talk about another major chokepoint for scalability with the connection broker. That choke point revolves around WMI handle count limits.
Lots of folks with RDS deployments don’t know this, but WMI is the glue the holds many of the RDS infrastructure roles together. Various infra roles like the broker and RDWeb issue lots of WMI queries back and forth while operating. The larger the user base of the farm, the more WMI queries get issued. Since the connection brokers are the “brains” of the farms so to speak, they handle the majority of WMI queries, and Windows imposes a hard quota limit on WMI handles, where 4096 is the default.
If a server VM running a connection broker goes over that 4096 limit, the WMI provider service will crash and then you can end up with a cascading failure on a connection broker. Alas, adding more HA connection brokers to your deployment is often not the fix. This is because in many cases, WMI queries still get issued to a single connection broker in the group which acts as primary for those WMI requests, even though any connection broker in the group can field incoming RDP connection requests as passed on from a load balancer.
You’ll know if this is happening, because you’ll get Event ID 5612 in the Application Event Log from the WMI provider service indicating a crash of the WMI provider service.
Scaling Strategy – Use the WBEMTest.exe Utility to Increase the WMI Handle Count From 4196 to 8192
Fortunately, you can increase the default WMI handle count on your brokers from 4096 to 8192 to combat WMI handle exhaustion on your brokers, which will let them serve a greater number of concurrent users without problems. To do this, you’ll need to launch the little known WBEMTest.exe utility on each of your brokers, and adjust the HandlesPerHost property for WMI.
Once this is done on each of your brokers, restart them during a maintenance window, and this will allow your brokers to scale to more users.
RDS Connection Broker Scaling Conclusions
After tweaking both the Cost Threshold for Parallelism on SQL and WMI handle count limits in your high availability RDS environment, your brokers should be able to handle a substantially larger volume of users during login storm bursts, and higher sustained numbers of peak users when lots of concurrent users are connected. Stay tuned for my next blog posts in this series, where I will talk about properly load balancing gateway servers, the concept of clustering multiple smaller RDS deployments together instead of attempting to build a single large deployment, and how to effectively manage larger RDS environments.
Leave a Reply