On VMworld at Tuesday I attended the session “Designing vSphere Platforms for Maximum Tier 1 Application Performance #BCA2817” by Mark Achtemichuk and Reza Taheri. I summarized the key take aways from this session below:
vSphere version
When having the option, use at least vSphere 4.1 because of the major performance gain over earlier versions of vSphere.
BIOS/Hardware
When looking at hardware level (CPU) it is important that you pick hardware that has settings like hardware assisted CPU virtualization (Intel VT-x and AMD-V), but this is quite common these days. Another important hardware (CPU) feature is hardware assisted MMU Virtualization (Intel EPT and AMD RVI) this eliminates the need for ESX to virtualize the MMU stack. Also make sure the hardware-assisted virtualization features (VT-x, AMD-V, EPT, RVI) are enabled in the BIOS.
For intel 5500 processor types and newer turn hyper threading on. It will increase performance.
When using virtual machines with 8 or more vCPUs, make sure your underlying hardware makes use of a quad socket processor.
Memory
Swap is killing, when using Swap to SSD make sure you have enough of it before even thinking about memory performance.
Make full CPU/MEM reservations for tier-1 applications, let other less important applications/VM’s fight about the remaining resources.
Networking
Uplink teaming policy (load based), choose the right type of policy in relation to the network. the choice dictates the amount of network bandwidth that will be available.
Network IO Control (NETIOC) should be on by default (when licensing allows it).
When using the network to connect to storage (IP storage), consider separating it from other network traffic (this is true for both physical as well as virtual networking). Consider separate vlans and even separate switches.
Tier-1 application VMs tend to grow more than other VMs, so a large VM is not uncommon. Keep in mind that Vmotion of large VMs requires more bandwidth. So 10 Gbit is not a luxury when using “monster” VMs.
Disable interrupt coalescing when using latency sensitive applications, this is an option to squeeze the last percent out of your host, so very minimal gain and very limited use case (voip, financial transaction etc.).
Storage
Storage may well be the most important part when looking at performance. Mark
Achtemichuk (the presenter) stated that in 99% of the performance problems he encountered, the root cause had something to do with storage.
Follow the storage vendor’s guidelines regarding multipathing.
Storage IO Control (SIOC) should be turned on by default (if licensing allows it)
Make sure your VM’s OS’es are aligned with the underlying storage (Offset). Regarding to VM Alignment take a look at this post by Nicholas Weaver, he built a tool to align your VM’s. UberAlign:
http://nickapedia.com/2011/11/03/straighten-up-with-a-new-uber-tool-presenting-uberalign/
How to check storage performance? Use the esxtop latency counters, in the diagram below they are emphasized: “K” and “D”:
K= Kernel average latency; the amount of time an I/O takes to pass through the hypervisor. Typical values are tens of microseconds, so when there is a bad K-average there is an ESX Host problem; probably contention, probably out of CPU. (a.k.a. storage overhead of your host). Investigate when K-value reaches 1 ms
D= Device average latency; the amount of time and I/O takes leaving the esx host, arriving at the storage array and delivering an acknowledge back. High D average means transport or array problem. Investigate when D-value reaches 10-15 ms. When the D-average is above this threshold of 10-15 ms this might indicate your Storage system is too slow.
The counters are available in different tools as there are: vCenter (you might need to turn up your statistics level), esxtop, vCenter Operations.
These values can help to put an end into the fight between storage and network guys blaming each other for the bad performance.
Guest OS
Use Vmxnet3 as the default virtual network adapter for almost any workload.
Use paravirtualized scsi driver, Pvscsi driver performance is the same as the LSI/SAS drivers but has less CPU impact on the host. Especially larger VMs will benefit, such as SQL databases etc.
Take a look at OS paging; paging within the VM is important because OS paging is making a memory problem a potential storage problem by putting memory in a page file. Size your VM correctly!
Make sure your application is correctly sized and configured.
Take time to evaluate vCenter Operations, it will give you a better view (insights) on performance over time, it is learning what a “normal” state should be. Also it will keep you from staring at esxtop for hours.
Size your most important/critical applications (20%) based on peak usage. Other (less important, 80%) applications are sized based on their average performance without taking peak performance into account.
Some of the Q&As at the end of the show:
Q Is it possible/advised to mix 1 CPU and 24 CPU VM on a single host, or should I reserve special hosts for the different VM’s?
A Mixed is the keyword here, the platform is designed to do just that!
Q Interrupt coalescing, do you have to tune that, or is there only an on or off switch?
A In 99% of all cases it’s “on” or “off”, the product is designed to get as most as possible out of the box
Q Do you recommend putting vNuma on?
A It’s on by default, but only used when the application is numa-aware.
Q How come Tps does not work together with large pages?
A It’s not that it doesn’t work, but it is working in a smart way, it will only be used when it is necessary
Q What about green datacenters, how do you deal with that, because people say power management has a negative impact on performance
A It’s all about choices, do you want performance or do you want to be green?