Posts

Leverage Windows Analytics for Modern Ops – Part 1

, ,

Windows Analytics

Often, I have come across clients asking to stabilize their environment.  As I have gone from client to client, environment to environment I have noticed a common theme.  What is stability?  How do we define or measure stability?  Often stability is defined by ticket numbers, perspective or which user is the loudest.  None of these measures are actual truths of data.  They are multiple data sources that can be important; However, are not an absolute source of truth.  The next question is usually well how do we measure systemic issues accurately, the answer of course is data, but which data?  In this article we are going to review leveraging, corelating data with Windows Analytics, utilize Power BI to manipulate the data into a dashboard.  At this end of this article you will be able to rely on Modern Ops to drive system stability.

Leverage Windows Analytics for Modern Ops

Windows Analytics uses the Windows Telemetry process which is native in Windows 10 collects various data on the endpoint and uploads the data to Azure Storage for your enterprise Tenant.  There are several useful tables, below are examples of the tables and what we are primarily looking at in each table.

  • DHDriverRelability
    • DriverName: The name of Driver that led to the crash
    • Version: The Version of the Driver
  • DHOSReliabiltiy
    • Manufacturer: Device Manufacturer
    • Model Data: Device Model – System Model from BIOS or MSInfo32
    • Computer: Computer Name
    • KernelModeCrashFreePercentforIndustry: Percent of devices in other organizations that have similar device OS version, build, manufacture, model, etc that have not crashed in last two weeks
  • DHOSCrashData
    • Computer – Computer name which Crashed
    • KernelModeCrashFailureID: Hash of the failure
    • KernelModeCrashBugCheckCode: The Code of the Blue Screen
    • KernelModeCrashCount: Number of times the device crashed with the Failure ID.

Some of this data looks redundant, but key notes from my observations:

  • DriverKernelModeCrashCount for DHDriverReliability seems to be the amount of errors/crashes from the event log but not necessarily blue screens.
  • KernelModeCrashCount for DHOSCrashData seems to be a blue screen crash which includes failed driver, the Crash code & the hash for the failure id of the blue screen

The reason for this theory and this is important is because the amount of volume that is generated by driver crashes/errors in the event log is significantly larger than the volume of blue screens.

  1. For Example, video driver igdkmd64 will show 159 Crash counts from DHOSCrashdata in a 7 day period where the same query using the DHDriverRelability table will show 352 counts

With that being said, lets do some crash hunting. We will start with drivers and using a very simple table view.

DHDriverReliability

We will be focusing on Crash counts, so we will add a simple line to this table

DHDriverReliability | where DriverKernelModeCrashCount >= 1

Leverage Windows Analytics for Modern Ops

Hmmm it looks like igdkmd64.sys is having an issue. A quick google search tells me this is a video driver. Lets further investigate this by clicking the Chart Item > Selcect Pie > Sort by Driver Name > Driver Kernel Crash Mode Count

Leverage Windows Analytics for Modern Ops

Wow, it looks this environment is plagued with video driver issues.

What could be causing this? With using another table we can dig a little deeper

Querying the following (below) will show the actual blue screen bug check code and driver name that appears on the blue screen. The reason why this is important is because a blue screen such as indirectkmd.sys is related to video crashing which could be a domino effect potentially on igkmd64.sys getting flagged

DHOSCrashData | where KernelModeCrashCount >= 1

 

I will repeat the same steps above regarding the pie chart.

Leverage Windows Analytics for Modern Ops

 

The out come shows I need to get to collecting some logs and analyzing the dump files to see what the faulting module is for indirectkmd.sys crashing. With these simple steps you essentially are using data to drive systemic issues to be resolved.

 

Here are some simple queries and business use cases:

1. DHOSCrashData | summarize number=count() by KernelModeCrashFailureId, DriverName, DriverVersion, KernelModeCrashBugCheckCode | sort by number desc | render table

A. Used for tracking total number of blue screen crashes, this gives a good holistic view summarizing crash counts by BSOD

2. DHOSReliability | summarize MyOrgPercentCrashFreeDevices = avg(iff(KernelModeCrashCount >= 1 and KernelModeCrashCount <= 10000, 0, 1)), CommercialAvgPercentCrashFreeDevices = avg(KernelModeCrashFreePercentForIndustry), NumberDevices = dcount(ComputerID) by Manufacturer
| sort by NumberDevices desc | render table

A. This is by far one of my favorites, this shows the stability on devices by looking at the total amount of devices that have not crashed and compares it to devices at other enterprises that are similar models, OS build, and OS patched level. Key business case is this can disprove blaming the “image” by having the ability to state the commercial average is 96% and we are operating at 94%, where as if you are operating at 82% with a commercial average of 96% that would indicate you have some issues that need to be resolved.

3. DHOSReliability| where Model contains "insertmodel"| render table | sort by NumberDevices desc | summarize MyOrgPercentCrashFreeDevices = avg(iff(KernelModeCrashCount >= 1 and KernelModeCrashCount <= 10000, 0, 1)), CommercialAvgPercentCrashFreeDevices = avg(KernelModeCrashFreePercentForIndustry), NumberDevices = dcount(ComputerID) by Manufacturer, Model

A. Break down of Query 2 by specifying a Model

4. DHOSReliability | summarize MyOrgCrashRate = avg(KernelModeCrashCount > 0), CommercialAvgCrashRate = 1-avg(KernelModeCrashFreePercentForIndustry), NumberDevices = dcount(ComputerID) by OSVersion| sort by NumberDevices desc | render table

A. Org Crash Rates compared to commercial crash rates based on the Windows 10 Build

5. DHOSCrashData| summarize sum(KernelModeCrashCount > 0) by TimeGenerated | order by TimeGenerated asc | where sum_ >=50 | where DriverName contains "insertdrivername"

A. Sum of crash counts for a specific driver name by timeline ie 11 days. This is good for trending data.

6. DHDriverReliability | where TimeGenerated > ago(28)| join kind= leftouter (   | where Manufacturer == "Lenovo"  DHOSReliability | project-away ComputerID, DriverPercentCrashFreeDevicesForIndustry, HardwareType | where DriverKernelModeCrashCount >= 1 | project-away OSRevisionNumber, ComputerID, ConfigMgrClientID, KernelModeCrashCount) on Computer

A. Same as Query 5 except not filtering a specific driver and shows the last 28 days of crashes for a particular hardware manufacturer.

The beauty of using windows analytics is the simple nature of the language and the power you can get from hunting down systemic issues and resolving them in your environment. In part 2 of this topic, I will review adding this data into power bi for a self-service dashboard based on some of the above queries. I will post this sometime mid-November. Check back for additional queries and data visualizations

 

Chad Arvay , Chris Buck

 

ALSO CHECK: Bitlocker SSD Vulnerability