This is part 2 of my vSphere Performance Data series. Part 1 described the project and my goals with the project.
This post will be my thoughts on retrieving performance data from our vSphere environment. As I described in part 1 our environment consist of 100+ hosts and 4000+ VMs. These are hosted in 3 different vCenters in the same SSO domain (Enhanced Linked Mode). All hosts and vCenters are at version 6. We are in the process of upgrading vCenter to 6.5 as I’ve talked about in a previous post.
Currently we are using Turbonomic in our environment which also retrieves performance data from the vCenters that it uses in it’s main purpose which is to balance the load in our environment. As it retrieves and presents performance data we are using this is our source of perf data in our in-house portals and dashboards. I’ve blogged about this on the Turbonomic community site previously where I talk about how I use their API in our dashboards.
Turbonomic stores their “real-time” data in 10 minute intervals and the trend data is stored hourly with average and peaks.
One of the goals with this project is to come closer to real-time so we don’t lose the spikes that can be difficult to spot in a wider interval.
With this we have the following sources for retrieving data:
- The individual ESXi hosts
- The vCenter servers
- The individual VMs
Depending on the retrieval method the intervals can be quite different with Turbonomic at 10 minutes and the ESXi hosts at 5 seconds.
When retrieving data we want to get as close to real-time as possible but with putting as little extra load on the environment as possible all balanced with a nice-to-have vs need-to-have mentality. As I mentioned in part 1 the amount of records stored in the performance db could be heavily impacted by the number of different metrics retrieved. This will also have to be considered a long with the intervals.
Tool(s) for retrieval
In part 1 I mentioned PowerCLI as the tool for retrieving the data. It might be that one of the other vSphere APIs/SKDs would be more effective but I have very little experience with those. I’ve tried to use the .NET SDK earlier but gave up. With the new REST APIs in 6.5 this might change. We have a LAB vCenter on 6.5 and I will test the API and see how it plays versus using PowerCLI.
So the big challenge will be to retrieve data from 4000 VMs on intervals firstly below 10 minutes as this is our current interval in Turbonomic.
vCenter as source – Get-Stat
Previously I have done some tests with PowerCLI and the Get-Stat cmdlet. I fetched the 20 second intervals for the last 5 minutes and calculated the average and the max value. This was done for around 10 different metrics. With some tweaking I got down to ~1 second per VM. With 4000 VMs this would be 4000 seconds to get the data I want. To get this done within the 5 minute (300 seconds) interval I need to do 4000 / 300 = 13,3 parallel queries. I would round up to 14 and do batches of 300 VMs to have some headroom.
What I’m worried about with this approach is the extra load it would put on the vCenter servers. With that I started exploring pulling data directly from the hosts.
The hosts as source – Get-ESXTop
I knew about the Get-ESXTop PowerCLI cmdlet, and I used LucD’s posts on his exploration of that cmdlet as my starting point.
Luc has as always done a great job and have explained how to use the cmdlet much better than I ever would so please check his posts for further details on the cmdlet:
I quickly found that extracting metrics from the hosts were quite fast. I could get all metrics from a host in under a second. The catch, there’s always a catch, is that you’ll get a LOT of stuff and it doesn’t necessarily correspond to the metrics you are used to from the vCenter performance graphs. The host approach seems to give me the ability to extract data quickly and with minimal load on the environment. It will put extra load on the data processing to get what we need, for instance the different metrics is kept with an ID as a pointer to the VM. To correspond the ID to a VM name I could do a complex grouping of the values extracted or I could do as LucD does, use the Get-ESXCli cmdlet to get a list of names and ID’s to use in filtering and grouping of the ESXTOP data. I’m not entirely sure that I would be able to get this data processing as effective as it needs to in order to get inside the 5 second intervals needed to use the ESXTOP data.
After spending several hours struggling with mapping the output from Get-ESXTop with “normal” vCenter perf graphs and trying to understand how to interpret the data I gave up. In addition to the need for translating the metrics extracted from ESXTop, some of the metrics you see in vCenter could be cummulative or delta values meaning you would have to compare the extracted value to the previous reading. This would in turn give me a challenge as I would have to run 2 extracts to get 2 trailing 5 second readings. And still I have the data processing to look into making as effective as possible.
Even with LucD’s explanations and efforts with his wrappers and modules I ultimately found that I would revert to using Get-Stat against vCenter. I have had some discussions with a couple of colleagues and we have all agreed that this project would be a work-in-progress and that we should keep it simple at first to get some results and then improve as we see where we have pain points and things that could be done better.
With that I digged out my notes and the scripts I worked on back when I looked at Get-Stat previously and started fresh. The next post in the series will be about the Get-Stat approach.