This is Part 7 of my series on vSphere Performance data.
Part 1 discusses the project, Part 2 is about exploring how to retrieve data, Part 3 is about using Get-Stat for the retrieval. Part 4 talked about the database used to store the retrieved data, InfluxDB. Part 5 showed how data is written to the database. Part 6 was about creating dashboards to show off the data. This post adds even more data to the project!
One thing I’ve learned by this project is that when you gather data you are always looking out for other data sources to add! Until now we’ve seen how we have retrieved metrics from all VMs in our environment on 20 second intervals. But what about the ESXi hosts running those VMs? And what about the Storage arrays they use?
Well, of course we need to add those to the mix!
ESXi hosts
Adding data from the ESXi hosts was quite easy. As the Get-Stat cmdlet we used to get the stats from VMs also works on the hosts there was only a matter of checking the available metrics and decide on which ones we wanted.
As we saw when exploring stats for the VMs there are only a subset that are available when omitting the -Realtime parameter
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
PS C:\> Get-StatType -Entity esx001 cpu.usage.average cpu.usagemhz.average cpu.ready.summation mem.usage.average mem.swapinRate.average mem.swapoutRate.average mem.vmmemctl.average mem.consumed.average mem.overhead.average disk.usage.average disk.maxTotalLatency.latest disk.numberReadAveraged.average disk.numberWriteAveraged.average net.usage.average sys.uptime.latest datastore.numberReadAveraged.average datastore.numberWriteAveraged.average datastore.totalReadLatency.average datastore.totalWriteLatency.average datastore.datastoreIops.average datastore.sizeNormalizedDatastoreLatency.average datastore.siocActiveTimePercentage.average clusterServices.cpufairness.latest clusterServices.memfairness.latest datastore.datastoreReadIops.latest datastore.datastoreWriteIops.latest datastore.datastoreReadOIO.latest datastore.datastoreWriteOIO.latest datastore.datastoreVMObservedLatency.latest disk.deviceLatency.average disk.maxQueueDepth.average datastore.datastoreMaxQueueDepth.latest |
With the -Realtime parameter we get a lot more
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
PS C:\> Get-StatType -Entity esx001 -Realtime cpu.coreUtilization.average cpu.costop.summation cpu.demand.average cpu.idle.summation cpu.latency.average cpu.readiness.average cpu.ready.summation cpu.reservedCapacity.average cpu.swapwait.summation cpu.totalCapacity.average cpu.usage.average cpu.usagemhz.average cpu.used.summation cpu.utilization.average cpu.wait.summation datastore.datastoreIops.average datastore.datastoreMaxQueueDepth.latest datastore.datastoreNormalReadLatency.latest datastore.datastoreNormalWriteLatency.latest datastore.datastoreReadBytes.latest datastore.datastoreReadIops.latest datastore.datastoreReadLoadMetric.latest datastore.datastoreReadOIO.latest datastore.datastoreVMObservedLatency.latest datastore.datastoreWriteBytes.latest datastore.datastoreWriteIops.latest datastore.datastoreWriteLoadMetric.latest datastore.datastoreWriteOIO.latest datastore.maxTotalLatency.latest datastore.numberReadAveraged.average datastore.numberWriteAveraged.average datastore.read.average datastore.siocActiveTimePercentage.average datastore.sizeNormalizedDatastoreLatency.average datastore.totalReadLatency.average datastore.totalWriteLatency.average datastore.write.average disk.busResets.summation disk.commands.summation disk.commandsAborted.summation disk.commandsAveraged.average disk.deviceLatency.average disk.deviceReadLatency.average disk.deviceWriteLatency.average disk.kernelLatency.average disk.kernelReadLatency.average disk.kernelWriteLatency.average disk.maxQueueDepth.average disk.maxTotalLatency.latest disk.numberRead.summation disk.numberReadAveraged.average disk.numberWrite.summation disk.numberWriteAveraged.average disk.queueLatency.average disk.queueReadLatency.average disk.queueWriteLatency.average disk.read.average disk.totalLatency.average disk.totalReadLatency.average disk.totalWriteLatency.average disk.usage.average disk.write.average hbr.hbrNetRx.average hbr.hbrNetTx.average hbr.hbrNumVms.average mem.active.average mem.activewrite.average mem.compressed.average mem.compressionRate.average mem.consumed.average mem.decompressionRate.average mem.granted.average mem.heap.average mem.heapfree.average mem.latency.average mem.llSwapIn.average mem.llSwapInRate.average mem.llSwapOut.average mem.llSwapOutRate.average mem.llSwapUsed.average mem.lowfreethreshold.average mem.overhead.average mem.reservedCapacity.average mem.shared.average mem.sharedcommon.average mem.state.latest mem.swapin.average mem.swapinRate.average mem.swapout.average mem.swapoutRate.average mem.swapused.average mem.sysUsage.average mem.totalCapacity.average mem.unreserved.average mem.usage.average mem.vmfs.pbc.capMissRatio.latest mem.vmfs.pbc.overhead.latest mem.vmfs.pbc.size.latest mem.vmfs.pbc.sizeMax.latest mem.vmfs.pbc.workingSet.latest mem.vmfs.pbc.workingSetMax.latest mem.vmmemctl.average mem.zero.average net.broadcastRx.summation net.broadcastTx.summation net.bytesRx.average net.bytesTx.average net.droppedRx.summation net.droppedTx.summation net.errorsRx.summation net.errorsTx.summation net.multicastRx.summation net.multicastTx.summation net.packetsRx.summation net.packetsTx.summation net.received.average net.transmitted.average net.unknownProtos.summation net.usage.average power.energy.summation power.power.average power.powerCap.average rescpu.actav1.latest rescpu.actav15.latest rescpu.actav5.latest rescpu.actpk1.latest rescpu.actpk15.latest rescpu.actpk5.latest rescpu.maxLimited1.latest rescpu.maxLimited15.latest rescpu.maxLimited5.latest rescpu.runav1.latest rescpu.runav15.latest rescpu.runav5.latest rescpu.runpk1.latest rescpu.runpk15.latest rescpu.runpk5.latest rescpu.sampleCount.latest rescpu.samplePeriod.latest storageAdapter.commandsAveraged.average storageAdapter.maxTotalLatency.latest storageAdapter.numberReadAveraged.average storageAdapter.numberWriteAveraged.average storageAdapter.read.average storageAdapter.totalReadLatency.average storageAdapter.totalWriteLatency.average storageAdapter.write.average storagePath.commandsAveraged.average storagePath.maxTotalLatency.latest storagePath.numberReadAveraged.average storagePath.numberWriteAveraged.average storagePath.read.average storagePath.totalReadLatency.average storagePath.totalWriteLatency.average storagePath.write.average sys.resourceCpuAct1.latest sys.resourceCpuAct5.latest sys.resourceCpuAllocMax.latest sys.resourceCpuAllocMin.latest sys.resourceCpuAllocShares.latest sys.resourceCpuMaxLimited1.latest sys.resourceCpuMaxLimited5.latest sys.resourceCpuRun1.latest sys.resourceCpuRun5.latest sys.resourceCpuUsage.average sys.resourceFdUsage.latest sys.resourceMemAllocMax.latest sys.resourceMemAllocMin.latest sys.resourceMemAllocShares.latest sys.resourceMemConsumed.latest sys.resourceMemCow.latest sys.resourceMemMapped.latest sys.resourceMemOverhead.latest sys.resourceMemShared.latest sys.resourceMemSwapped.latest sys.resourceMemTouched.latest sys.resourceMemZero.latest sys.uptime.latest vflashModule.numActiveVMDKs.latest |
We started out with the following metrics:
- CPU Total capacity and utilization
- CPU Usage MHz and Percentage
- CPU Ready
- CPU Latency
- MEM Capacity and Consumed
- Net Receive and Transmit
- Storage Adapter Read and Write
Then I used the same script already built for the VM polling, modified it a bit, and started polling data.
And by polling host data to the same database as the VM data I found the need for adding the “type” metadata tag. Things like CPU Ready, Latency and Usage has the same metric names and would be treated as the same metric in my graphs. But of course I want to have the ability to filter on the actual type. We could have chosen to name the measurements differently, but we wanted to have it consistent.
As I’ve explained before, adding metadata tags to the measurements is really easy. Just note that updating old points isn’t necessarily that easy so be aware when you filter on previous data with those tags.
The full Host poller script is up on GitHub
Storage
We have a few HPE 3Par Storage Arrays in the environment. These SANs have an API available which can be used to poll data. I won’t go into any details on how to use the API in this post but we are already using it to pull performance data to a traditional SQL database and present it on some dashboards.
As I stated earlier, we want to add more data sources in this project so we can get insights across technologies and components in the environment. So with that I could easily modify my existing script polling SAN performance stats and create a hashtable with 3Par measurements.
We have chosen to push things like IO Read/Write, KB Read/Write, Read and Write Latency and Queue lengths to the Influx DB. These stats are polled on a 5 minute interval. We have also added the Capacity / Used space metrics which will give us these numbers in the same dashboards. Although these doesn’t change that rapidly they give us some great graphs for capacity planning across the environment.
New Dashboards
So with new data we could create new Dashboards 🙂
Some examples of new panels created with this data




With the ability to get an overview of performance statistics on all infrastructure components across the environment we can quickly investigate and understand what is causing bottlenecks and failures.
One example of this is shown in the Cluster Write / SAN latency. After a spike on the Storage adapter we see a small spike in SAN latency. We could now add storage traffic from the VMs as well and see the actual VM(s) that causes the spikes.
We are already discussing and investigating other sources to pull performance data from so I suspect we’ll see a lot more data going forward.