vSphere Performance data – Monitoring VMware vSAN performance

In my blog series on building a solution for monitoring vSphere Performance we have scripts for pulling VM and Host performance. I did some changes to those recently, mainly by adding some more metrics for instance for VDI hosts.

This post will be about how we included our VSAN environments to the performance monitoring. This has gotten a great deal easier after the Get-VSANStat cmdlet came along in recent versions of PowerCLI.

We will build with the same components as before, a PowerCLI script pulling data and pushing it to an InfluxDB time-series database and finally visualizing it in some Grafana dashboards.

This post won’t discuss much details about the different VSAN metrics. There are lots of articles out there which does that.

When you explore VSAN metrics through PowerCLI you’ll find that it doesn’t neccessarily correspond to what you can get from the graphs in the vSphere client UI. William Lam has written a great blog post about it which I encourage you to read. There’s also a GitHub repo with the mappings and sample scripts.

In short there are two cmdlets we will use, Get-VsanStat and Get-VsanSpaceUsage

The metrics

In VSAN there’s a lot of metrics available on many levels. You’ll have the Cluster, VirtualDisk, VMHost, VirtualMachine, VsanDisk, VsanDiskGroup, VsanIscsiLun and VsanIscsiTarget entities which all have their metrics. In our solution we decided to start out with only the Cluster and DiskGroup level.

What you’ll also notice when dealing with these metrics is that there’s also some different Metric Types. For instance the Cluster entity has both Backend and VM Consumption metric types. Again, for more details and reference head over to Mr Lam’s article referenced above.

The metrics we found interesting and what we wanted to pull was:

Cluster

  • VMConsumption.ReadThroughput
  • VMConsumption.AverageReadLatency
  • VMConsumption.WriteThroughput
  • VMConsumption.AverageWriteLatency
  • VMConsumption.Congestion
  • VMConsumption.OutstandingIO
  • VMConsumption.ReadIops
  • VMConsumption.WriteIops
  • Backend.ResyncReadLatency
  • Backend.ReadThroughput
  • Backend.AverageReadLatency
  • Backend.WriteThroughput
  • Backend.AverageWriteLatency
  • Backend.Congestion
  • Backend.OutstandingIO
  • Backend.RecoveryWriteIops
  • Backend.RecoveryWriteThroughput
  • Backend.RecoveryWriteAverageLatency

Diskgroup

  • Performance.ReadCacheWriteIops
  • Performance.WriteBufferReadIops
  • Performance.ReadCacheReadIops
  • Performance.WriteBufferWriteIops
  • Performance.ReadThroughput
  • Performance.WriteThroughput
  • Performance.AverageReadLatency
  • Performance.AverageWriteLatency
  • Performance.ReadCacheReadLatency
  • Performance.ReadCacheHitRate
  • Performance.WriteBufferFreePercentage
  • Performance.WriteBufferWriteLatency
  • Performance.Capacity
  • Performance.UsedCapacity

For the Cluster we also pull capacity through the Get-VsanSpaceUsage cmdlet

So, off to building out the scripts.

The scripts

We’ll build different scripts for Cluster and Diskgroup metrics respectively.

In the case of the Cluster script we decided to think of the VSAN cluster in the same way as one of our Storage Arrays where we are already pulling metrics. This means that we already have some measurements in the InfluxDB which we can reuse.

The scripts are built like the scripts pulling the normal metrics, we take some input parameters like the vCenter, Cluster etc. We’ve kept the concept of Targets as this is used for measuring the run time of the different scripts. The VSAN targets gets a “VSAN_” prefix to the targetname.

When exploring the retrieval of metrics through PowerCLI I found that if you specify the metric(s) you want to retrieve you won’t always get results. And if you get results you might get an error together with the results

We decided to do the * wildcard and retrieve all metrics to get results and no errors. Of course we might get some metrics we don’t need, but we can live with that.

The metrics will be retrieved based on the entity we are working with.
For clusters we will work with the cluster entity directly. Notice that we also pull the Disk usage for the cluster.

The Get-VsanStat accepts both a -StartTime and a -EndTime parameter. We’ll only use the -StartTime as the -EndTime by default will correspond to “now”. Note that we pull only the last 5 minutes of metrics.

For the Diskgroups we will first pull the Diskgroups for the given cluster, before we’ll traverse the Diskgroups and retrieve metrics for each Diskgroup. Notice that we’ll replace whitespace with an underscore as the InfluxDB API won’t allow blanks in tag values.

So, with lots of metrics to process we will use the same logic for processing the metrics and building the array that later will be posted to the InfluxDB API.

One thing to notice is that some of the VSAN metrics uses microseconds as a unit while vSphere normally uses milliseconds. We will do a conversion on this so the metrics uses the same unit in our database.

For each of the metrics we will, as we do in the vSphere stat scripts, do a switch statement on the metric name to give it our own measurement name, and to specify a different unit if needed. As mentioned we will reuse some of the measurement names as we have for our other Storage arrays. For instance the Frontend (VMConsumption) Readthroughput will get the kB_read measurement name.

Finally, after deciding the measurement name and potentially calculating a value and change the unit, we will add the metric in the correct format to the output array. For more information about this format and how it is used check out the blog series mentioned above.

For the cluster script we do also track the capacity/disk usage and add that to the array. Notice that in this case we have only one measurement, vsan_diskusage, and then we add the different values as fields.

After processing all metrics and the disk usage (for the cluster) we will use Invoke-Restmethod and post the array to the InfluxDB API. Check out my post on doing this for the vSphere metrics for more details on that.

The scripts are run as scheduled tasks every 5 minutes. The full scripts are found over on GitHub.

The Dashboards

So, now we pull the VSAN metrics from vCenter and put it in our InfluxDB database. Now we will use Grafana to build some dashboards for VSAN. Check out my previous post on Grafana dashboards for  details on how you build dashboards and the different components. Note that all dashboards have been built on Grafana v4 and the row based grid. In v5 you can get more creative with a more fluid design.

We have built four different dashboards for our VSAN environment. Three for the cluster entity and one for Diskgroups. The three cluster dashboards are split in an Overview dashboard and one for Backend metrics and one for VMConsumption/Frontend metrics.

First our Overview dashboard with some single stats on the top and than some graphs with combined Front- and Backend metrics. Notice that this (and the other Dashboards have links to the other VSAN dashboards for easy navigation).

The Front- and Backend dashboards have some additional metrics respectively. Notice that all Cluster dashboards have a VSAN dropdown menu at the top where you can focus on specific clusters.

Cluster Backend
Cluster Backend

 

Cluster Frontend
Cluster Frontend

 

The Diskgroup dashboard focuses obviously on Diskgroups and this dashboard has both Diskgroup and Host as available variables in addition to Vsan cluster. The two extra variables adjusts according to the chosen Cluster.

Diskgroup
Diskgroup

The dashboards are also available on GitHub.

Summary

Hopefully this post has shown how you can easily build out an existing Influx/Grafana solution with additional metrics, or if you are new to Influx/Grafana it has shown how you can start out monitoring your VSAN environment with open-source tools.

Thanks for reading, Happy monitoring!

 

Rudi

Working with Cloud Infrastructure @ Intility, Oslo Norway. Mainly focused on automating Datacenter stuff. All posts are personal

27 thoughts to “vSphere Performance data – Monitoring VMware vSAN performance”

  1. How of all the PS Scripts and Grafana Dashboards i have found, these look the best and are exactly what i need, however i am having trouble with your script at the Invoke-RestMethod, and i am getting this error:

    Invoke-RestMethod : The remote server returned an error: (400) Bad Request.
    At C:\Scripts\vSpherePerfData-master\vsan_diskgroup_performance_poller.ps1:187 char:1
    + Invoke-RestMethod -Method Post -Uri $postUri -Body ($newtbl -join “`n”)
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand

    Invoke-RestMethod : {“error”:”unable to parse ‘pollingstat,poller=ESTEMPSPCLI,unit=s,type=vsanpoll,target=VSAN_Virtual SAN Cluster runtime=25.4821059,entitycount=0 1531501769000000000’: invalid field format”}
    At C:\Scripts\vSpherePerfData-master\vsan_diskgroup_performance_poller.ps1:193 char:1
    + Invoke-RestMethod -Method Post -Uri $postUri -Body $pollStatQry
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand

    Any Thoughts or suggestions?

    Thank you

    1. Hi Chris
      Thanks for reading and thanks for your comment.
      I’ve come across the 400 Bad Request reply from the Influx API many times and usually there’s an error in the data you’re sending.

      In your example I believe both API calls results in an error.
      From the error message to the last API call, to the pollingstat measurement, as specified in the error message: {“error”:”unable to parse ‘pollingstat,poller=ESTEMPSPCLI,unit=s,type=vsanpoll,target=VSAN_Virtual SAN Cluster runtime=25.4821059,entitycount=0 1531501769000000000’: invalid field format”} it seems you have a whitespace in your “target” tag, this is not supported.
      The Influx line protocol expects that what follows after a whitespace is a field. You need to add some logic to the script that replaces whitespace in your target name (i.e. your cluster name) with a different character, for example an underscore.
      You could do this by adding the following after the assignment of the $target variable (that would be after line 71 in my script available on GitHub):
      $target = $target.replace(” “,”_”)

      Likewise I would guess that the first API call (with the actual metric) data fails because of the same reason, but now because the cluster tag has a whitespace. Because the $cluster variable is used to retrieve objects from the vCenter I would preferrably have a different clustername variable where you handle the whitespace and then change the “$newTbl += ..” line to use this variable instead, or you could change the cluster variable right before you iterate through the stats, that would be on line 112 in my GitHub script: $cluster = $cluster.replace(” “,”_”).

      Hope this helps. Please don’t hesitate to reach out if you need any more assistance!

      Regards,
      Rudi

      1. Thank you so much for the reply, so i beleive i made the correct changes:

        Here:
        #Set targetname if omitted as a script parameter
        if($targetname -eq $null -or $targetname -eq “”){
        if($cluster){
        $targetname = $cluster
        }
        else{
        $targetname = $vcenter
        }
        $Targetname = “VSAN_” + $Targetname
        $Targetname = $Targetname.replace(” “,”_”)
        }
        #Connect to vCenter

        And

        Here:
        $metricsVsan = “*”
        $cluster = $cluster.replace(” “,”_”)
        foreach($dg in $diskGroups){

        This is the Output i get:

        Scope ProxyPolicy DefaultVIServerMode InvalidCertificateAction DisplayDeprecationWarnings WebOperationTimeout
        Seconds
        —– ———– ——————- ———————— ————————– ——————-
        Session UseSystemProxy Multiple Ignore True 300
        User Multiple
        AllUsers Ignore
        1.4173859
        1.3209158
        1.3253615
        1.3062864
        1.3041084
        1.2973426
        1.3249351
        1.2925056
        1.4653921
        1.3024112
        Run took 14.9694984 seconds
        Invoke-RestMethod : The operation has timed out
        At C:\Scripts\vSpherePerfData-master\vsan_diskgroup_performance_poller.ps1:189 char:1
        + Invoke-RestMethod -Method Post -Uri $postUri -Body ($newtbl -join “`n”)
        + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo : NotSpecified: (:) [Invoke-RestMethod], WebException
        + FullyQualifiedErrorId : System.Net.WebException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand

        Do you think maybe its an issue with my InfluxDB at this point?

        Thanks again for your assistance!! I am really excited to get these working!

        1. Looks like it times out writing to the API? I was looking for commands to test writing remotely to the API and have not had much luck. Any suggestions?

          1. Hi Chris
            It does indeed look like it times out when writing to the database, but from your first comment it seemed that you had connection to it. Is running on the same machine?
            Are you able to write to it manually from the same machine running the script? You could create just a simple line for a test measurement like i.e.
            Invoke-Restmethod -Uri $postUri -Method Post -Body “apitest,exampletag=test value=10”

            Another possibility is to run the script manually, you could try to post only one line just to test. Instead of doing -Body ($newtbl -join “´n”) you could do -Body $newtbl[0]

  2. RUDI,

    I couldn’t reply to your post so i had to start a new one. The database is not on the same server i am running this script. I am running the scipt on a 2012R2 Server and connecting to a Ubuntu 16.04 server running Influx/Grafana.

    1. Ok, Chris.
      Did you try the Invoke-Restmethod manually? Please also verify that the InfluxDB service is running and that there is no firewall on the Ubuntu server blocking the API call

      1. RUDI,

        I ran this: Invoke-Restmethod -Uri $postUri -Method Post -Body “apitest,exampletag=test value=10” and got the timed out error.

        I ran this from the Ubuntu Server: curl -sL -I localhost:8086/ping
        This was the result:
        HTTP/1.1 204 No Content
        Content-Type: application/json
        Request-Id: 056cb698-8862-11e8-8c31-000000000000
        X-Influxdb-Build: OSS
        X-Influxdb-Version: 1.6.0
        X-Request-Id: 056cb698-8862-11e8-8c31-000000000000
        Date: Sun, 15 Jul 2018 19:05:21 GMT

        The firewall status is Inactive, i apologize for all the questions, i feel like im so close to getting this to work

          1. UGH!, i am sitll having an issue, i think its fixed and then can’t run it without an error here is my whole script do you mind taking a look, this is with the changes you suggested:

            <#
            .SYNOPSIS
            Script for pulling performance metrics from vCenter VSAN diskgroups and writing to an Influx database
            .DESCRIPTION
            The script will pull performance metrics for VSAN Diskgroups from vCenter and writes to an Influx
            timeseries database.
            .NOTES
            Author: Rudi Martinsen / Intility AS
            Created: 02/04-2018
            Version 0.1.1
            Revised: 06/04-2018
            Changelog:
            0.1.1 — Cleaned unused variables (aa362)
            0.1.0 — Fork from Host poller (aa362)
            .LINK
            http://www.rudimartinsen.com/2018/04/06/vsphere-performance-data-monitoring-vmware-vsan-performance/
            .PARAMETER VCenter
            The vCenter to connect to
            .PARAMETER Cluster
            The Cluster to get Hosts from. If omitted all Hosts in the vCenter will be fetched
            .PARAMETER Targetname
            Optional name of the target for use as a Tag in the Influx record
            .PARAMETER DBServer
            IP Address or hostname of the Influx Database server
            .PARAMETER DBServerPort
            TCP port for the DB server, Defaults to 8086 which is the default Influx port
            .PARAMETER LogFile
            Path to the logfile to write to
            #>
            param(
            #[Parameter(Mandatory=$true)]
            $VCenter = “vdi-vcenterb01.domain.com”,
            #[Parameter(Mandatory=$true)]
            $Cluster = “Virtual SAN Cluster”,
            $Targetname,
            $Dbserver = “localhost”,
            $DbserverPort = 8086,
            $LogFile
            )
            #Function to get the correct timestamp format for Influx
            function Get-DBTimestamp($timestamp = (get-date)){
            if($timestamp -is [system.string]){
            $timestamp = [datetime]::ParseExact($timestamp,’dd.MM.yyyy HH:mm:ss’,$null)
            }
            return $([long][double]::Parse((get-date $($timestamp).ToUniversalTime() -UFormat %s)) * 1000 * 1000 * 1000)
            }

            if(!$LogFile){
            $scriptDir = Split-Path -Path $MyInvocation.MyCommand.Definition -Parent
            $LogFile = “$scriptDir\log\vcpoll_vsan_diskgroup.log”
            }
            $start = Get-Date

            #Import PowerCLI
            Import-Module VMware.VimAutomation.Core
            Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -ParticipateInCeip:$false -Scope Session -Confirm:$false

            #Vstatinterval is based on the realtime performance metrics gathered from vCenter which is 20 seconds
            $statInterval = 300

            #Set targetname if omitted as a script parameter
            if($targetname -eq $null -or $targetname -eq “”){
            if($cluster){
            $targetname = $cluster
            }
            else{
            $targetname = $vcenter
            }
            $Targetname = “VSAN_” + $Targetname
            $Targetname = $Targetname.replace(” “,”_”)
            }

            #Connect to vCenter
            try {
            $vc_conn = Connect-VIServer $vcenter -ErrorAction Stop -ErrorVariable vcerr
            $vcid = $vc_conn.InstanceUuid
            }
            catch {
            Write-Output “$(Get-Date) : Couldn’t connect to vCenter $vCenter. Script was started at $start” | Out-File $LogFile -Append
            Write-Output “$(Get-Date) : Error message was: $($vcerr.message)” | Out-File $LogFile -Append
            break
            }

            if($cluster){
            $clusterObj = Get-Cluster $cluster -ErrorAction Stop -ErrorVariable clustErr #| Get-VMHost -Server $vcenter | Where-Object {$_.ConnectionState -ne “NotResponding”}
            }
            else{
            Write-Output “$(Get-Date) : Couldn’t get cluster” | Out-File $LogFile -Append
            Write-Output “$(Get-Date) : Error message was: $($clustErr.message)” | Out-File $LogFile -Append
            break
            }

            if(!$clusterObj){
            Write-Output “$(Get-Date) : Cluster not found. Exiting…” | Out-File $LogFile -Append
            break
            }

            $diskGroups = Get-VsanDiskGroup -Cluster $cluster

            #Table to store data
            $newtbl = @()

            #The different metrics to fetch. Because some errors with specifying the metrics we’re using the wildcard instead..
            #$metricsVsan = “Performance.ReadCacheWriteIops”,”Performance.WriteBufferReadIops”,”Performance.ReadCacheReadIops”,”Performance.WriteBufferWriteIops”,”Performance.ReadThroughput”,”Performance.WriteThroughput”,”Performance.ReadCacheReadLatency”,”Performance.AverageReadLatency”,”Performance.AverageWriteLatency”,”Backend.ReadThroughput”,”Backend.AverageReadLatency”,”Backend.WriteThroughput”,”Backend.AverageWriteLatency”,”Backend.Congestion”,”Backend.OutstandingIO”,”Backend.RecoveryWriteIops”,”Backend.RecoveryWriteThroughput”,”Backend.RecoveryWriteAverageLatency”
            $metricsVsan = “*”

            foreach($dg in $diskGroups){

            }

            #Disconnect from vCenter
            Disconnect-VIServer $vcenter -Confirm:$false

            #Calculate runtime
            $stop = get-date
            $runTimespan = New-TimeSpan -Start $start -End $stop

            Write-Output “Run $run took $($runTimespan.TotalSeconds) seconds”

            #Build URI for the API call
            $baseUri = “http://” + $dbserver + “:” + $dbserverPort + “/”
            $dbname = “statsdemo”
            $postUri = $baseUri + “write?db=” + $dbname

            ####TODO: Test API access / error handling

            #Write data to the API
            Invoke-RestMethod -Method Post -Uri $postUri -Body ($newtbl -join “`n”)

            #Build qry to write stats about the run
            $pollStatQry = “pollingstat,poller=$($env:COMPUTERNAME),unit=s,type=vsanpoll,target=$($targetname) runtime=$($runtimespan.TotalSeconds),entitycount=$($vmhosts.Count) $(Get-DBTimestamp -timestamp $start)”

            #Write data about the run
            Invoke-RestMethod -Method Post -Uri $postUri -Body $pollStatQry

        1. RUDI,

          Ok i got it to run, but not to the remote server, i may just have to rebuild that. I started over with the script and installed influxdb locally on the windows server. I was able to run the script with to the local server influx db 🙂 I plan to use the rest of your scripts from this GITHUB.

    1. Hi Chris
      So the 400 error is normally caused by something wrong in the data that is posted. It could be just one of the lines, while the others get posted. Do you have any rows/points in Influx?

      Can you add this before you post and show me a screen shot of the output:
      Write-Output ($newtbl -join “`n”)

        1. Ah, so the cluster name is also used for the «SAN» tag so either you need to remove that or do the same as we did with the $cluster variable (the replace stuff)

          1. Ok so i did this under

            #Build variables for “metadata”
            I changed this: $san = $clusterObj.Name to
            This:
            $san = $clusterObj.Name.Replace(” “,”_”)

            And this is the Write-Output($newtbl -join “‘n”)

            http://txt.do/dzzhz

      1. Invoke-Restmethod -Uri $postUri -Method Post -Body “apitest,exampletag=test value=10” does work.

        On a side note, i have the vsan_cluster_performance_poller.ps working.

        I am going to start over on the diskgroup_performance_poller.ps1 and do one change at a time.

  3. Ok i have some interesting info for you: If i run the vsan_cluster_performance_poller.ps1, it will fail with this:

    PS C:\Scripts\Working VSAN Scripts> C:\Scripts\Working VSAN Scripts\vsan_cluster_performance_poller – A.ps1

    Scope ProxyPolicy DefaultVIServerMode InvalidCertificateAction DisplayDeprecationWarnings WebOperationTimeout
    Seconds
    —– ———– ——————- ———————— ————————– ——————-
    Session UseSystemProxy Multiple Ignore True 300
    User Multiple
    AllUsers Ignore
    You cannot call a method on a null-valued expression.
    At C:\Scripts\Working VSAN Scripts\vsan_cluster_performance_poller – A.ps1:111 char:1
    + $stats = Get-VsanStat -Entity $clusterObj -Name $metricsVsan -StartTime $lapStar …
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvokeMethodOnNull

    Run took 13.9598983 seconds

    Then i run diskgroup_performance_poller.ps1 and that one will still fail as usual, however if i then run:
    vsan_cluster_performance_poller.ps1
    in the same PS terminal then it will be successful and i will get data in Grafana.

    I created to scheduled tasks and they show they run successfully in the task scheduler which of course they do since it doesn’t register there is an error, but i get no info in Grafana.

    This only works when i use windows powershell ise

  4. Hi Chris
    I would guess in the first case that you didn’t have the $cluster variable set or the script couldn’t find the specified cluster hence you have a null value when querying for stats.
    After the diskgroup script the cluster variable is set and when you’re running the script again in the same console it will succeed.

    Try to run the scripts in separate terminals, but be sure that you run it with the proper input parameters if they’re not set in the script.

    I would also suggest that you try to run the diskgroup script with the $newtbl[0] in the invoke-Restmethod call to check if you can post something at all

    As I said before the 400 bad request is normally because there is something wrong in the values you send. Either a white space or another illegal character for Influx input
    It might be an idea to output the $newtbl to a file so you can check the input

    1. Thanks for the reply, i did finally get vsan_cluster_performance to work on its own.

      #Get the stats
      $stats = Get-VsanStat -Entity $clusterObj -Name $metricsVsan -StartTime $lapStart.AddMinutes(-5)
      $space = Get-VsanSpaceUsage -Cluster $cluster

      The above line did not have a variable of $lapstart so i created a $lapstart variable and placed it with the $start variable. I guess i could have just changed $lapstart to start??

      $start = Get-Date
      $lapStart = Get-Date

      #Import PowerCLI

  5. Yes. And those are used to calculate the amount of time the script uses and that is also pushed to the InfluxDB in the final Invoke-Restmethod. You won’t need that just to see the performance metrics

    Great that you finally got things to work!

Leave a Reply

Your email address will not be published. Required fields are marked *