vSphere Performance data – Part 7 – More data

This is Part 7 of my series on vSphere Performance data.

Part 1 discusses the project, Part 2 is about exploring how to retrieve data, Part 3 is about using Get-Stat for the retrieval. Part 4 talked about the database used to store the retrieved data, InfluxDB. Part 5 showed how data is written to the database. Part 6 was about creating dashboards to show off the data. This post adds even more data to the project!

One thing I’ve learned by this project is that when you gather data you are always looking out for other data sources to add! Until now we’ve seen how we have retrieved metrics from all VMs in our environment on 20 second intervals. But what about the ESXi hosts running those VMs? And what about the Storage arrays they use?

Well, of course we need to add those to the mix!

ESXi hosts

Adding data from the ESXi hosts was quite easy. As the Get-Stat cmdlet we used to get the stats from VMs also works on the hosts there was only a matter of checking the available metrics and decide on which ones we wanted.

As we saw when exploring stats for the VMs there are only a subset that are available when omitting the -Realtime parameter

With the -Realtime parameter we get a lot more

We started out with the following metrics:

  • CPU Total capacity and utilization
  • CPU Usage MHz and Percentage
  • CPU Ready
  • CPU Latency
  • MEM Capacity and Consumed
  • Net Receive and Transmit
  • Storage Adapter Read and Write

Then I used the same script already built for the VM polling, modified it a bit, and started polling data.

And by polling host data to the same database as the VM data I found the need for adding the “type” metadata tag. Things like CPU Ready, Latency and Usage has the same metric names and would be treated as the same metric in my graphs. But of course I want to have the ability to filter on the actual type. We could have chosen to name the measurements differently, but we wanted to have it consistent.
As I’ve explained before, adding metadata tags to the measurements is really easy. Just note that updating old points isn’t necessarily that easy so be aware when you filter on previous data with those tags.

The full Host poller script is up on GitHub

Storage

We have a few HPE 3Par Storage Arrays in the environment. These SANs have an API available which can be used to poll data. I won’t go into any details on how to use the API in this post but we are already using it to pull performance data to a traditional SQL database and present it on some dashboards.

As I stated earlier, we want to add more data sources in this project so we can get insights across technologies and components in the environment. So with that I could easily modify my existing script polling SAN performance stats and create a hashtable with 3Par measurements.

We have chosen to push things like IO Read/Write, KB Read/Write, Read and Write Latency and Queue lengths to the Influx DB. These stats are polled on a 5 minute interval. We have also added the Capacity / Used space metrics which will give us these numbers in the same dashboards. Although these doesn’t change that rapidly they give us some great graphs for capacity planning across the environment.

 

New Dashboards

So with new data we could create new Dashboards 🙂

Some examples of new panels created with this data

Host cluster performance
Host cluster performance

 

SAN Performance
SAN Performance

 

Compare Cluster write and SAN latency
Compare Cluster write and SAN latency

 

SAN Capacity
SAN Capacity

With the ability to get an overview of performance statistics on all infrastructure components across the environment we can quickly investigate and understand what is causing bottlenecks and failures.
One example of this is shown in the Cluster Write / SAN latency. After a spike on the Storage adapter we see a small spike in SAN latency. We could now add storage traffic from the VMs as well and see the actual VM(s) that causes the spikes.

We are already discussing and investigating other sources to pull performance data from so I suspect we’ll see a lot more data going forward.

 

Rudi

Working with Cloud Infrastructure @ Intility, Oslo Norway. Mainly focused on automating Datacenter stuff. All posts are personal

Leave a Reply