HinvChangeLog – The 600gb DB Backlog
Recently I was doing some investigating on why a Configuration Manager database was consuming what appeared to be an excessive amount of space. This was a new environment to me, and I did know there was a lot of custom hardware inventory going on. But my gut said a lot could not be leading to this large of a database. Typically I will run a few queries to gather information about the database. These are just some queries sourced off the internet, I did not write them, but I do keep them handy!
-- Database file size SELECT [TYPE] = A.TYPE_DESC ,[FILE_Name] = A.name ,[FILEGROUP_NAME] = fg.name ,[File_Location] = A.PHYSICAL_NAME ,[FILESIZE_MB] = CONVERT(DECIMAL(10,2),A.SIZE/128.0) ,[USEDSPACE_MB] = CONVERT(DECIMAL(10,2),A.SIZE/128.0 - ((SIZE/128.0) - CAST(FILEPROPERTY(A.NAME, 'SPACEUSED') AS INT)/128.0)) ,[FREESPACE_MB] = CONVERT(DECIMAL(10,2),A.SIZE/128.0 - CAST(FILEPROPERTY(A.NAME, 'SPACEUSED') AS INT)/128.0) ,[FREESPACE_%] = CONVERT(DECIMAL(10,2),((A.SIZE/128.0 - CAST(FILEPROPERTY(A.NAME, 'SPACEUSED') AS INT)/128.0)/(A.SIZE/128.0))*100) ,[AutoGrow] = 'By ' + CASE is_percent_growth WHEN 0 THEN CAST(growth/128 AS VARCHAR(10)) + ' MB -' WHEN 1 THEN CAST(growth AS VARCHAR(10)) + '% -' ELSE '' END + CASE max_size WHEN 0 THEN 'DISABLED' WHEN -1 THEN ' Unrestricted' ELSE ' Restricted to ' + CAST(max_size/(128*1024) AS VARCHAR(10)) + ' GB' END + CASE is_percent_growth WHEN 1 THEN ' [autogrowth by percent, BAD setting!]' ELSE '' END FROM sys.database_files A LEFT JOIN sys.filegroups fg ON A.data_space_id = fg.data_space_id order by A.TYPE desc, A.NAME;
With this query I was able to see that the database was split into 10 files totaling to approximately 970gb of used spaced. That’s quite a bit of data! This environment does have a fair amount of clients in it, but I don’t think it is enough to justify this amount of data.
So if I don’t quite think the database should be this big, let’s identify WHAT is causing this!
WHAT Is Causing This!
Typically the majority of the data in an SCCM database is Hardware Inventory. It can be useful to determine which parts of Hardware Inventory are taking up the most space in your database. Another query that I did not write can help get this information for us.
-- Table size sorted by total space in MB SELECT t.name AS TableName, i.name as indexName, sum(p.rows) as RowCounts, (sum(a.total_pages) * 8) / 1024 as TotalSpaceMB, (sum(a.used_pages) * 8) / 1024 as UsedSpaceMB, (sum(a.data_pages) * 8) / 1024 as DataSpaceMB FROM sys.tables t INNER JOIN sys.indexes i ON t.object_id = i.object_id INNER JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id WHERE t.name NOT LIKE 'dt%' AND i.object_id > 255 AND i.index_id <= 1 GROUP BY t.name, i.object_id, i.index_id, i.name ORDER BY 6 DESC
Depending on what all is enabled in your environment I’d typically expect to see INSTALLED_EXECUTABLE_DATA as one of the largest tables, or INSTALLED_SOFTWARE_DATA, maybe RecentlyUsedApps_DATA to mention a few potential large sources of data. You also could expect HinvChangeLog to be up there on the list as well. What I did not expect was the HinvChangeLog table to be almost 200gb.
Ok, I may have found some of the problem then! But, 970-200 = STILL A BIG DATABASE.
STILL A BIG DATABASE
Other things consume space in a database, not just your tables. Those wonderful indexes that are usually created to improve query efficiency also consume space. Alright, let’s query for the size of those with a query I didn’t write.
-- Index size sorted by total space in kb SELECT OBJECT_NAME(i.OBJECT_ID) AS TableName, i.name AS IndexName, i.index_id AS IndexID, 8 * SUM(a.used_pages) AS 'Indexsize(KB)' FROM sys.indexes AS i JOIN sys.partitions AS p ON p.OBJECT_ID = i.OBJECT_ID AND p.index_id = i.index_id JOIN sys.allocation_units AS a ON a.container_id = p.partition_id where [i].[is_primary_key] = 0 -- fix for size discrepancy GROUP BY i.OBJECT_ID, i.index_id, i.name ORDER BY 4 DESC
And the results are in!
This environment’s indexes for the HinvChangeLog table were massive. This echoed the size of the table itself. Each index was about 100-120gb. We are starting to get somewhere now! 970-200-110-110=APPROACHING REASONABLE.
Just out of curiosity I figured why not check how many rows this giant table has. This could give me some insight into the problem, and a metric to look back at as well.
SELECT SUM(row_count) FROM sys.dm_db_partition_stats WHERE object_id=OBJECT_ID('HinvChangeLog') AND index_id IN (0,1)
Phew. Only 2.75 billion rows. That’ll clear up in no time.
Surely Something Should Be Cleaning This Up…
Right? This should be getting maintained by something. These records should be getting deleted. Especially if they are aged.
Alright, it should be getting maintained. The trusty product group has this covered. Suppose I should make sure that thing is running successfully. The schedule is set for every Saturday at 12:00 AM, and it should retain 30 days of data. Is 30 days of data really a 200gb table and 200gb+ in indexes?!?! Let’s check the status of these maintenance tasks. The status of the last run can be found in the SQLTaskStatus table, as well as in the status messages. Let’s query the SQLTaskStatus table and make sure everything is working as intended.
SELECT * FROM SQLTaskStatus
Oh… alright. Well, I see that the ‘Delete Aged Inventory History’ task has a ‘CompletionStatus’ of 1, and all the other tasks have a 0. Based on the number, I’m going to guess that 1 means it failed. Maybe it just failed once? We can clean up a bit of aged inventory and call it a day! Let’s see just how old this data is in the HinvChangeLog.
SELECT TOP 1 TimeKey FROM HinvChangeLog ORDER BY TimeKey ASC
Under good conditions I would expect this to return a date-time value that is no more than 37 days old, give or take some hours. The maintenance task is set to retain 30 days worth of data and it runs every 7 days. Honestly I don’t know what I was anticipating… but I can tell you what I found!
The data in the HinvChangeLog was dating back as far as November 2018. Keeping in mind I was doing this cleanup in late July of 2019 I determined that I had more than 37 days worth of data.
Cleanup! Cleanup! But how?
My first thought at this point is to check the logs. Specifically, I checked smsdbmon.log on the site server. This log did show the failure happening, but that was about as much information as it showed. It appeared to be running the maintenance task for an hour and then forcing it to stop. Now, this task does clean up various other tables. All the other bits of cleanup were succeeding, but the HinvChangeLog cleanup specifically was failing.
In order to get a lot more information on what is really going on with this cleanup task I cranked up the logging a bit. For this I jumped over to a lab environment.
The above registry settings are universal to all SCCM components. The two Archive related settings are awesome to have enabled and they have minimal overhead. You can simply specify a path in ArchivePath, and flip ArchiveEnabled over to 1. Now you will start retaining 10 timestamped log files which will greatly assist any troubleshooting efforts.
The other setting, SqlEnabled, is not something you will leave on. This generates a ton of logging. It is pretty common to hear about referencing the smsprov.log to get SQL query information for actions performed in the console. This registry setting turns that all the way up to 11. Every component will start logging the SQL queries for actions behind the scenes. As you can imagine that gets a bit noisy. Remember to turn it off!
Good to note that these settings will require you to restart the SMS_EXECUTIVE service for some logs to start reflecting the changes.
To dig in even more the above registry edits will add further logging beyond the SQL commands. These settings are all per component. I’m in the SMS_DATABASE_NOTIFICATION_MONITOR key which is the component responsible for the maintenance task that I am troubleshooting. I’m flipping DebugLogging to 1, and LoggingLevel to 0. With this done you can either restart SMS_EXECUTIVE, or restart the specific component you’re editing the logging settings for.
The next time the maintenance task runs you’ll notice the log file is markedly more verbose. I choose to force this cycle to run by setting the RunNow value for the task in SqlTaskStatus to 1. This forces the task to run within the next 30 minutes.
Ignore the succeeded part… that isn’t happening in production. The gold here is the name of the stored procedure in use! spRemoveOldHinvChangeLog, this is the information I can use to start cleaning things up. The non-verbose logs in production show that this is failing and now verbose logs let me know what is being used to perform the cleanup.
So, I have a rough idea of how old the data is in my HinvChangeLog. I’m going to start chipping away at this table a month at a time using the builtin stored procedure designed to do so. It has certainly been doing it’s job, just timing out a bit too soon.
Timing Out A Bit Too Soon
It turns out this task was timing out a lot too soon. I started running the cleanup as below.
DISCLAIMER: THIS IS SOMETHING THAT SHOULD LIKELY BE DONE UNDER THE GUIDANCE OF MICROSOFT SUPPORT. I TAKE NO RESPONSIBILITY FOR THE USE OF THESE SQL COMMANDS IN YOUR ENVIRONMENT. PERFORM EXTENSIVE TESTING AND UNDERSTAND THE REPERCUSSIONS.
DECLARE @oldestDate datetime = '10/01/2018 12:00:00' EXEC spRemoveOldHinvChangeLog @oldestdate
This initial run did not take too long. There was not a lot of data remnants this old. But, I started moving this date up, ’11/01/2018 12:00:00′, ’12…..’ and so on. These subsequent runs took a VERY long time to run. Some of them as long as 24 hours. Typically I would kick off the cleanup after hours, or on weekends. The stored procedure returns the number of rows deleted upon completion. Each run was removing hundreds of millions of rows. Keep in mind this table started at about 2.75 billion rows.
So, I ran this and ran this until I reached…
DECLARE @oldestDate datetime = (DATEADD(day,-30, GETDATE())) EXEC spRemoveOldHinvChangeLog @oldestdate
Which is effectively what the maintenance task is doing with the setting of 30 days. I’ve reached my goal! Only problem with all this is that I have only cleaned up the mess. What caused this mess!?!
What Caused This Mess!?!
The HinvChangeLog table is now down to about 350 million rows. This is great! The cleanup has removed almost 2.5 billion rows. But something has to be generating these records at a very high volume! The table is down to a slightly more manageable size. Let’s summarize the data.
Within HinvchangeLog we have an important column, GroupKey. This key let’s us join v_GroupMap in to provide a nice readable display name for the inventory class. While we are at it, InvClassname is also the SQL view for that class. Worth noting, this query may take a few minutes to run if you have a lot of data. You could add in a WHERE filter with some date constraints, which I’ll do later. I did write this query by the way! Short and sweet. It will return the count of records per HINV class in the HinvChangeLog table.
Running this returned a handful of inventory classes that far exceeded the rest. Among them, ‘Shortcut File’ which I typically do not recommend inventorying. This was a quick and easy decision to disable. There was no useful action happening with the shortcut inventory data. We would see this table drop off, and stop causing excessive inventory within 30 days.
What was more interesting though were the custom inventory classes that were topping the list for row count as well. Remember all that custom inventory I mentioned way in the beginning of all this? It turns out that many of these custom hardware inventory classes were coming from custom Configuration Items that were running a script to dump data into WMI. Awesome way to generate custom hardware inventory! The problem? There was a ‘ScriptLastRan’ property in these WMI classes and it was being inventoried as well! The end result of these is every machine generating new unique records for all these hardware inventory classes on every HINV collection.
So we determined that we could stop inventorying this ScriptLastRan property on many of the inventory classes. It was generally not useful information, as there is a timekey that indicates when hardware inventory was collected which will give a rough idea of the age of the collected data. Now with these properties unchecked the volume of hardware inventory data dropped drastically. I waited about 8 days to let the machines receive the updated machine policy, and start reporting up their spiffy, and lean hardware inventory. With that time passed I ran the below query to show my summary from earlier in the month so I can compare it to the past 7 days.
The results are in and they look great! Some examples of reduction in inventory volume are below.
- Machine Certificate Inventory
- Previously – 33,167,320
- Now – 1,926,918
- Application Error Inventory
- Previously – 16,114,254
- Now – 2,130,760
Those are some good looking results! Based off just these two classes alone the environment is generating about 45 million less HinvChangeLog records per week, or 180 million less per month.
Outside of row count, the database is now staying around 350gb instead of 970gb!!
I think we found the source, and solution for this backlog! This long drawn-out troubleshooting post might’ve been a bit much, but hey, I hope it helps someone out!