Storage I/O bottlenecks can have a big impact on virtual environments and can wreak havoc on the performance of the virtual machines within them. The guest operating systems and applications running inside VMs are constantly reading and writing to and from their virtual disks and anything that delays this can slow a VM to a crawl.
Of all the resources a host manages, traditional storage devices are typically the slowest resource because they rely on mechanical spinning hard disks. In addition shared storage arrays are commonly used in virtual environments because of the many features that require shared storage and as a result there is a longer path to get to storage resources. Storage I/O must leave the host through an I/O adapter and traverse a storage or traditional network to get to a storage array. This long data path creates the potential for several choke points or bottlenecks that can occur which can reduce the capacity and speed of storage I/O.
Bottlenecks dictate the speed limit of your storage I/O; for example you may have a very fast storage array but if the path to that storage array has a bottleneck you are not going to be able to take advantage of the speed of the storage array. On the flip side you may have a fast connection to your storage array but if it is not optimally configured the storage array can become a bottleneck as well. The result is that a bottleneck becomes a funnel that limits the speed of data between your hosts and your storage arrays.
Measuring Storage Array Performance
There are two storage statistics that are good indicators of how a storage array is performing; they are IOPS and latency. Let’s first take a look at IOPS.
IOPS stands for I/O Operations Per Second and it is a common measurement of the performance of a storage device. I/O operations occur for every read or write to a disk, so on a busy host there can be thousands of IOPS occurring at any given moment. IOPS can show us how much disk activity is occurring individually on each VM or the combined total for a host datastore. IOPS is an important measurement because every storage device has a limited number of IOPS that it can support. While there are a number of factors that determine the amount of IOPS that a storage device can handle, it is mainly determined using simple math by taking the rotational speed of a drive and multiplying it times the number of drives in a RAID group. The higher the rotational speed of a drive the more IOPS it can handle. A typical 15,000 rpm drive is capable of supporting around 175-210 IOPS; a typical 7,200 pm drive is only capable of supporting around 75-100 IOPS. So a RAID group consisting of six 15K drives would be capable of around 1,050 IOPS (175 X 6). SSD drives which are becoming increasingly popular are not bound by mechanical components and are capable of over 5,000 IOPS.
RAID levels also play a factor in IOPS as there is a RAID penalty to factor in for additional disk writes that slightly decreases the amount of IOPS available. The greater the level of RAID protection the higher this penalty is as I/O has to be written to more disks. If the IOPS statistics on your hosts are high it can indicate that the amount of I/O that is occurring might be greater than the storage device can handle. Re-arranging your workloads so they are balanced evenly across multiple datastores can help eliminate any IOPS hot spots that may be occurring on individual datastores. Sometimes re-architecting your storage configuration by putting more drives in RAID groups can help ensure that the number of IOPS that your VMs are generating does not exceed the number that your storage device is capable of.
Where IOPS is focused on how much disk activity is occurring, latency is focused on how long it takes for a host to read or write data to a storage device. Disk latency is the time it takes for a disk sector to be positioned under the drive head so it can be either read from or written to. Anytime a VM makes a read or write to its virtual disk, that request must follow a long path from the guest OS to the physical storage device. Along that path bottlenecks can occur at different points as the data goes from the guest OS, to a virtual SCSI adapter, through the VMkernel, to a physical I/O adapter and then across a storage network to get to the destination storage device. The total amount of time it takes I/O to make this trip is referred to as total guest latency and is measured in milliseconds. There is several different latency statistics that combine to calculate total guest latency which can help pinpoint which part of the storage sub-system that bottlenecks are occurring in. The below figure illustrates the path that data takes to get from the VM to the storage device and shows the different latency statistics that form total guest latency.
Kernel latency – is the average amount of time spent by VMkernel processing each SCSI command. This value should be as close to zero as possible and less than 1ms.
Queue latency – is the average amount of time each SCSI command spends in the VMkernel queue. This value should also be as close to zero as possible and less than 1ms.
Device latency – is the average amount of time it takes to complete a SCSI command from the physical device. This is frequently the cause of high latency; depending on the storage device type this value should be between 0-10ms.
All of the latency statistics are further split into two sub-statistics for read and write so you can see on exactly which operation latency is occurring.