This article covers a recent issue we found at one of our customers. Periodic high latency causing Citrix session hangs and general sluggish performance on a Nutanix AHV cluster. It must be said before I start, that this problem is likely to affect other tiered storage solutions but it does have a big effect on Nutanix, especially with NX-1000 series nodes.
Configuration background.
First, some background information on the customers config:
Nutanix cluster – 9 x NX Nodes, 6 x 1000 series and 3 x 3000 series (1 cluster) 20% SSD tier. Running AHV on AOS version 5.1.1.3
Citrix XenApp 7.12 using MCS on 35 x Windows 2016 using WEM version 4.2. XenApp machines with 8 cores and 24GB RAM. NO MCS cache to disk since this is not compatible with AHV
Performance issues.
Performance of this cluster has been historically very good, Citrix does perform very well generally on Nutanix AHV. However, we began to notice issues with login speeds, initially these were as low as 8s, very impressive, however they did start to creep up.
Occasionally we’d have some user experience a >2m login duration. Our Prevensys® monitoring solution clearly showed these slow logins occurred during periods of high Nutanix latency, see screenshot below:
In the top right panel, you can clearly see this user has a 2.5m login duration at 8:50am, which matches the latency seen in the lower left panel. Once the period of latency ended (just after 9am) the user logged out and in again and the login duration was back down to 25s.
Analysis.
A performance data collection was gathered from one of the controller VMs during a high latency period and submitted to Nutanix support. Support came back to us and reported high number of writes to the SSD tier once this tier was 75% used.
Curator (the Nutanix service responsible for housekeeping) then called on the Stargate service (responsible for heavy lifting) to flush old data from the SSD tier down to the spinning disk tier. The substantial number of writes hitting the cluster were already pushing SSD utilisation on the 1000 series nodes (only one SSD) up to 85% and Stargate pushed this to 100% which caused the period of high latency.
So we then began to investigate where these writes were coming from. The IO profile suggested paging of some kind, however none of the VM’s had a memory constraint. A quick look at the XenApp hosts quickly revealed paging was taking place, however over 50% memory was available – this didn’t make sense! We then started to suspect WEM memory optimisation.
The fix.
WEM system optimization configuration was conservative – see screenshot below:
Not excessive by any means. We decided to turn this off and forced an update to see if we had any improvement, whoosh…. immediate impact. All paging stopped along with a 75% load decrease on the SSD tier. You can see this looking at the graph below, as you can see we turned off Working Set Optimization at 11:10
Summary.
We would recommend keeping WEM memory working set optimization off in most environments, unless your memory constrained or you have a poorly written memory hungry application in your environment. If you are in this situation, you may want to keep a very close eye on your IOPS and latency on your storage after enabling working set optimization.
Please share this with your peers and colleagues.