vSphere Performance data – Part 5 – The script

This is Part 5 of my series on vSphere Performance data.

Part 1 discusses the project, Part 2 is about exploring how to retrieve data, Part 3 is about using Get-Stat for the retrieval. Part 4 talked about the database used to store the retrieved data, InfluxDB. This one will do some actual work and will retrieve data from vCenter and write it to the database.

The last post showed how to write some data to the performance database InfluxDB through its API. As Powershell is good at interacting with APIs this is what I will use for writing the data.

In Part 3 of this series I explored the Get-Stat cmdlet and started working on a script which would extract the wanted metrics.

Continuing that script I looked at how to write the data through the InfluxDB API.

I realized in the previous post that I wouldn't post the metrics one by one for each of my 4000 VMs, but rather bulk insert them which the API supports (and recommends for performance reasons).

After testing a couple of different methods of pushing bulk data I found out that since the InfluxDB API had their "line-protocol" which is one line for each point separated by a newline I could store my points in a hashtable and then post that to the API.

Metadata tags

I also had some considerations on what metadata I needed. As I discussed in the last post it's easy to add more, or stop using, some metadata tags but if you rely on them for filtering you will encounter some obstacles.

After some trial and error I ended up with (for now anyways) the following tags:

  • VM Name
  • VM Id
  • Host Name
  • Host Id
  • Cluster Name
  • Cluster Id
  • vCenter Name
  • vCenter Id
  • CompanyCode
  • StatInterval
  • Unit
  • Type (the Type tag was added after a while, I'll explain why in a later post)

Continuing my CPU Ready example from an earlier post, one point could look like:

cpu_ready,type=vm,vm=VM2012,vmid=VirtualMachine-vm-123,companycode=ZZ,host=lab-esx-001,hostid=HostSystem-host-123,cluster=Cluster1,clusterid=ClusterComputeResource-domain-c123,platform=vcenter-001,platformid=vCenter213,unit=percent,statinterval=20 value=3.5 1500142436000000000

A couple of remarks on the metadata. I could omit the Id tags, but thinking a bit ahead it might be useful to have those as a key when integrating with other sources/apps. The CompanyCode is just a two or three letter combination to identify VMs by Company.

So, as I mentioned above, I would put these lines in a hashtable and then post the entire hashtable to the InfluxDB API.

Hashtables

Hashtables in Powershell is quite easy to use.

First you create it:

PS C:\> $tbl = @()

Then, add stuff to it:

PS C:\> $tbl
PS C:\> $tbl += "hi"
PS C:\> $tbl += "i"
PS C:\> $tbl += "contain"
PS C:\> $tbl += "stuff"
PS C:\> $tbl
hi
i
contain
stuff
PS C:\> $tbl.Count
4

Ok, that was easy. Now let's fetch some stats.

Retrieve the Stats!

I'm not going to detail every step of my script, but I'll briefly mention some of them. The full script is available over on GitHub.

The script takes a couple of parameters. Of course the vCenter to connect to, the cluster (this is actually optionable) and optionally the count of VMs and the amount to skip. Based on the size of the cluster I have to split the number of VMs processed on multiple scripts. Therefore I have the two parameters to decide which set of VMs to process.

This is done with the -First and -Skip parameters in the Select-Object cmdlet:

if($vmcount -gt 0){
    $vms = $vms | Sort-Object name | Select-Object -First $vmcount -Skip $skip
}

Another parameter is $samples. This is used to decide how many records to retrieve from the Get-Stat cmdlet. It defaults to 15 which aligns to 5 minutes of 20 second interval metrics (3 per minute x 5 minutes)

After building the list of VMs to process I create the hashtable and define the metrics I want to retrieve:

#Table to store data
$tbl = @()

#The different metrics to fetch
$metrics = "cpu.ready.summation","cpu.latency.average","cpu.usagemhz.average","cpu.usage.average","mem.active.average","mem.usage.average","net.received.average","net.transmitted.average","disk.maxtotallatency.latest","disk.read.average","disk.write.average","disk.numberReadAveraged.average","disk.numberWriteAveraged.average"

Then I'll traverse all the VMs with a foreach.

For each VM I first create some variables for the metadata. Then I fetch the stats

$stats = Get-Stat -Entity $vm -Realtime -MaxSamples $samples -Stat $metrics

After getting all those stats for that VM I foreach over the stats.

Firstly I check what instance the stat is for. I discussed multiple instances per metric in Part 3. Now I'm only interested in the aggregated one

$instance = $stat.Instance

#Metrics will often have values for several instances per entity. Normally they will also have an aggregated instance. We're only interested in that one for now
if($instance -or $instance -ne ""){
    continue
}

Then I create the variables for the stat. The unit is a metadata tag, the actual value and the Timestamp. I created a quick function for calculating the correct Unix Nanosecond timestamp which is referenced at the start of the script.

$unit = $stat.Unit
$value = $stat.Value
$statTimestamp = Get-DBTimestamp $stat.Timestamp

if($unit -eq "%"){
    $unit="perc"
}

Now it's time to do a switch over the MetricId of the stat. This is done to get the correct name for the measurement. I have to do this because we decided to define a different name for the metric/measurement than that from vCenter. One of the reasons is to align metric names acrosse technologies. If you want you can just go with the vCenter name and skip this part. For the CPU Ready I'll also calculate the percentage (it comes as milliseconds from vCenter):

switch ($stat.MetricId) {
    "cpu.ready.summation" { $measurement = "cpu_ready";$value = $(($Value / $cpuRdyInt)/$vproc) }
    "cpu.latency.average" {$measurement = "cpu_latency" }
    "cpu.usagemhz.average" {$measurement = "cpu_usagemhz" }
    "cpu.usage.average" {$measurement = "cpu_usage" }
    "mem.active.average" {$measurement = "mem_usagekb" }
    "mem.usage.average" {$measurement = "mem_usage" }
    "net.received.average"  {$measurement = "net_through_receive"}
    "net.transmitted.average"  {$measurement = "net_through_transmit"}
    "disk.maxtotallatency.latest" {$measurement = "storage_latency"}
    "disk.read.average" {$measurement = "disk_through_read"}
    "disk.write.average" {$measurement = "disk_through_write"}
    Default { $measurement = $null }
}

Finally it's time for adding the stat to the hashtable

if($measurement -ne $null){
    $tbl += "$measurement,type=vm,vm=$vname,vmid=$vid,companycode=$companycode,host=$hname,hostid=$hid,cluster=$cname,clusterid=$cid,platform=$vcenter,platformid=$vcid,unit=$unit,statinterval=$statinterval value=$Value $stattimestamp"
}

After traversing all the stats I do an extra traverse to calculate the IOPS read and write values because these have no aggregated instance and have to be grouped.. I haven't found a good way to do this inside my foreach yet, but that's something I have on my todo list.

$stats | Where-Object {$_.metricid -eq  "disk.numberReadAveraged.average"} | Group-Object timestamp | ForEach-Object {$tbl += "disk_iops_read,type=vm,vm=$vname,vmid=$vid,companycode=$companycode,host=$hname,hostid=$hid,cluster=$cname,clusterid=$cid,platform=$vcenter,platformid=$vcid,unit=iops,statinterval=$statinterval value=$($_.group) $(Get-DBTimestamp $_.name)"}
$stats | Where-Object {$_.metricid -eq  "disk.numberWriteAveraged.average"} | Group-Object timestamp | ForEach-Object {$tbl += "disk_iops_write,type=vm,vm=$vname,vmid=$vid,companycode=$companycode,host=$hname,hostid=$hid,cluster=$cname,clusterid=$cid,platform=$vcenter,platformid=$vcid,unit=iops,statinterval=$statinterval value=$($_.group) $(Get-DBTimestamp $_.name)"}

When finished traversing all the stats for all the VMs I disconnect from vCenter and now I'm ready to post my data.

Post data to the API

First I'll build the URI which consists of the name or IP of the InfluxDB server and the portnumber. I get those from parameters to the script it self. The database name is hardcoded in my script as of now, but could also easily be passed as a parameter. The post URI needs to be in the form: http://influxserver:port/write?db=dbname

#Build URI for the API call
$baseUri = "http://" + $dbserver + ":" + $dbserverPort + "/"
$dbname = "performance"
$postUri = $baseUri + "write?db=" + $dbname

Now I'm good to go and can post my data. All I have to do is to pass my hashtable to the body, and I'll join each entry with a newline:

#Write data to the API
Invoke-RestMethod -Method Post -Uri $postUri -Body ($tbl -join "`n")

That's all there's to it!

I decided to do some measuring of my jobs as well so I created another measurement as well, this data is also perfect for the Influx time-series database.

This measurement is all about the actual scriptjob, what cluster/vcenter it ran on (aka Target), the name of the server that ran the job, how many VMs and of course the number of seconds the job took.

#Build qry to write stats about the run
$pollStatQry = "pollingstat,poller=$($env:COMPUTERNAME),unit=s,type=vmpoll,target=$($targetname) runtime=$($runtimespan.TotalSeconds),entitycount=$($vms.Count) $(Get-DBTimestamp -timestamp $start)"

#Write data about the run
Invoke-RestMethod -Method Post -Uri $postUri -Body $pollStatQry

Currently there are no error handling on the API calls, which is something I need to look at. That's also a risk when doing a one-time post of all the metrics for this run. If it fails the data is lost. But for now things are working as they should so I'll add it to my to-do for later.

As mentioned, the full script is available over on GitHub

The script jobs

One thing I haven't discussed yet is how I run the script against my 4000 VMs. This was a big concern of mine. Either way there will be some extra load on vCenter. This could be a separate post (this one is pretty long already...), but I think it's key to understanding how the script was built so here goes...

I had a couple of options. I could use Powershell jobs, or workflows, which would be able to run parallell queries against vCenter. I have previously done some work with this and found it to be somewhat complex and not that flexible. It also introduces several scopes inside the script so I would have to connect to vCenter many times inside the script anyways. Therefore I did my initial Proof of Concept with multiple scheduled tasks running the script against different vCenters and/or clusters which is passed as parameters.

That turned out to work fine, I couldn't see much extra load on vCenter. On a side note, we have been migrating to the 6.5 VCSA as well during this time and our vCenters are behaving much better than before so the timing for this project has been perfect :-)

After a small period of verifying that there wasn't too much extra stress put on the environment I gradually put up more script tasks. I've also played around with the number of VMs in each job and monitored the runtime accordingly.

What we have running now is at the most 180 VMs per job. This is averaging on 3.2-3.5 minutes. Slightly over my estimate on 1 second per VM. Actually I have found that the Windows 2016 servers running these jobs takes more time on loading the PowerCLI modules than my 2012 servers used in my initial testing so the actual time per VM is lower than 1 sec. I'll have to investigate that when I find time for it.

I could propably raise the number of VMs per job a bit and get away with fewer script jobs. This is a balance though. When they are running the foreach VM/stat loop they are stressing the vCPU a lot and more scripts per server would perhaps slow all of them down, whereas fewer scripts will need more time to finish.

The scheduled script jobs way of running this isn't perfect. I have 4 different servers running these scripts now and those have to be managed as well as all the script jobs. For the time being the number of VMs per job is passed as a parameter to the job, and if I want to adjust that I might have to adjust multiple jobs. I might go back and have another look at Powershell jobs to see if that would be more efficient both performance wise and management wise.

Further ahead I have some other ideas which would be really interesting to check out. I would think that this kind of compute jobs could be a perfect thing to containerize...? That will definitely be a different blog post.

This page was modified on March 29, 2019: Fixing links and markdown syntax