Troubleshooting and Upgrading AD FS FarmsTroubleshooting and Upgrading AD FS Farms

WaaS – Overview of WaaS the Wolverine Way

I remember the first attempt I made at upgrading a group of devices at work.  It was the Autumn of 2017 and I was trying to upgrade a group of roughly 20 Build 1511 devices since support was running out on that build.  I had been to the Midwest Management Summit (MMS) earlier that Spring and believed I had a pretty good handle on how it should be done.  I had attended a number of sessions and listened to the merits of using an In-Place Upgrade (IPU) task sequence as opposed to Servicing Plans.  I created my IPU sequence and submitted the deployment to our Change Management process.  I was so confident that I didn’t balk at the decision to schedule the deployment during the week when I was going to be in Florida attending Microsoft Ignite.  The evening of the deployment arrived…

… and it was a disaster.  While everyone was attending the evening’s party I was back at my hotel trying frantically to figure out what had gone wrong.

The failure of the deployment that night rekindled and fed the fires of those that wanted to use the Long-Term Servicing Branch (LTSB), now referred to as the Long-Term Servicing Channel SKU.  I spent the next several weeks fighting that battle as everything Windows 10 related was placed on hold.

So, what happened?

Arrogance.  Short-sightedness.  Failure to have a complete plan.  The problem was that the folks running these older builds were IT staff, our earliest adopters of Windows 10.  They had installed it on their laptops and when the time came to upgrade them they were either offline, or connected remotely over VPN.

The failure to upgrade 20-some laptops ended up derailing our entire Windows 10 project.


[30-Dec-2019 Update]

Niall asked about some additional details on what exactly happened.  It’s an excellent question, so here goes.

VPN Connected Laptops

We had a few issues, the biggest of which were users that connected over VPN during the deployment and the IPU was attempted over that connection.  Of the 20-some laptops, if I remember correctly more than a half of them had this problem.  Part way through the task sequence when the upgrade rebooted the VPN connection was lost.  We were not pre-caching content but were instead downloading content on demand.  So, all of the content required to perform the tasks post-reboot was unavailable.  On top of that we did not set the status message retry.  The end result was that the upgrade of the OS completed, but the remainder of the task sequence failed, and it took a LONG time for it to fail.  This left the users with laptops that while running the newer build of Windows 10 had a number of issues with software that didn’t work.

While we had other issues that prevented the upgrade from running, most notably users just not powering on their laptops during the deployment; the VPN problem was the biggest catastrophe.  Since the devices were left in a broken state, that was the failure that everyone remembered.  The end users were very angry, and rightly so.  The fallout from that was so bad that to this day, two years on from that, we are still not allowed to run the IPU over a VPN connection.

What could we have done differently?

There are two things that, could I do it all over again I would change.  The first would be to (if I’m going to run it over a VPN connection at all) pre-cache all of the content in advance.  Get the content locally and set the task sequence variable SMSTSDisableStatusRetry  to prevent the excruciatingly long retry attempts for every step once the VPN connection is dropped.  The second would be a higher priority on communicating info regarding the IPU deployment to the end users.  I think that there might have been a single, simple email announcement.  In hindsight a targeted campaign of communication should have happened.  We knew who the users were, so it would have been very simple to send them repeated messages or even better since it was a small group, contact them in person and explain what was going to happen and when.

Had we just done those two things I strongly believe that we would be much further along on our WaaS journey.


MMS – Spring, 2018

The following Spring, I again attended MMS.  I had been able to convince leadership to continue using Current Branch and arrived at the Mall of America with hopes to find a better way of doing the upgrades.

I was not disappointed.  It was 16-May-2018 and I walked into a session called “Windows as a Service in the Enterprise Part 1” run by Mike Terrill, Gary Blok and Keith Garner.  It was this session where the proverbial clouds parted.  I found a new passion and I realized what I had to do.

The underlying message of that session, and the WaaS process designed by them, was simple really.  Design a process to catch problematic devices before you hit them with the upgrade.  Find and address problems in advance of the IPU so that you have the best odds of succeeding.  The IPU sequences is simply one of the last steps in a device’s journey through WaaS.

I then spent the next 18 months designing a WaaS process that would meet the needs of the hospital and various medical centers that make up the environment in Michigan Medicine.

What did I come up with?

After several versions, what I designed is a 4 stage process.  I’m a racing fan (Formula 1, WEC, Formula E), so I’ll liken it to a race weekend.

Needs and Requirements

As a health system we have a very diverse working environment.  We have the main hospital with 24×7 services, Emergency Department, and operating rooms.  We have remote clinics and specialty centers spread all over the state of Michigan.  We also have flight crews at facilities for our Survival Flight helicopters and jets.  Then of course, we have business staff and the always difficult IT staff.  Oh, and we also have the Medical School to contend with as well.

To accommodate all of this we have a number of use cases that we work around.  They of course, will not match up with every other industry, but hopefully some of the concepts will cross over and be useful.  For example, it was decided to provide a mechanism for users to defer the execution of the upgrade at the mandatory run time.  This gets back to the special circumstances of a hospital environment.  If a doctor is with a patient when the IPU is set to run, they are allowed to defer running so it would not negatively affect patient care.  Another example would be categories of machines that we do not want to display any notification to the end user.  These could be public facing devices, or devices that are power managed.

I’ll go into all of the details in later posts, for now let’s start with a general overview.

The Four Phases

Like I said I’m a racing fan, so I’ll be using some race weekend analogies.  My design is also very collection heavy, with a good number of collections in place solely to be used for reporting purposes.  For example, each application that has failed to pass compatibility testing will have its own collection.  This is used in an SSRS report that the desktop teams can use to help explain why any given device is not upgrading.  The report will show that it is a member of a collection “Application Exclusion – Bad App v 1.0”.

Phase 0 – Reference

I liken this to the scrutineering done at the very start of a race weekend.  There is no racing going on, but the cars are being inspected to ensure that they meet the rules and regulations.

In my design this “Phase 0” is intended to filter out the devices that we have no intention of ever upgrading.  It is during this phase that we catch devices running on unsupported hardware, or running Windows 7, or those that are already running on the current production build (or newer).

Phase 1 – Pre-Assessment

Here we get into qualifying for the race.  We start this phase with a pool of devices that we intend on upgrading.  This would be the devices that passed Phase 0.  To qualify for the race, they will have to pass a series of tests.

During this phase we catch devices that:

  • Have never submitted hardware inventory
  • Have not submitted hardware inventory within the last 14 calendar days
  • Are inactive in Active Directory
  • Have insufficient free disk space
  • Have insufficient memory

We query the data already collected by Configuration Manager as part of hardware inventory.  This portion is all behind the scenes.  Devices that pass these tests move on to the next phase…

Phase 2 – Compatibility Scan

On a race weekend this would be the warmup right before the race itself.  Here we are going to reach out and touch the devices for the first time.  We’ll run a Compatibility Scan sequence and leverage the built-in /ComptScan feature of Windows 10 setup in an attempt to identify any issues that Microsoft has cataloged.  If everything passes here, then we are off to the races.

Phase 3 – IPUs

This is the big race and the ultimate goal of the process.  If a device has passed all prior phases, then it should offer us the best chances for a successful upgrade.

From here devices are randomly distributed across a number of collections which control when they will receive the upgrade, along with the user notification that begins 14 days prior to the deadline.  The notifications are handled using a combination of the PowerShell App Deploy Toolkit and Martin Bengtsson’s Windows 10 Toast Notification Script, plus some custom PowerShell and fancy collection variables.

Exclusions

We have 2 types of exclusions.  Either one of them will result in a black flag (disqualification) of the device from any point in the process.

  1. Application exclusions
  2. “Ad hoc” exclusions

If a device finds itself in either one of the above exclusion types it will be withdrawn from the WaaS process, regardless of how far along it may have been.  It may be scheduled to run the upgrade tonight, but these exclusions will pull it right out that IPU collection.

Application Exclusions

There will be an individual collection for every application that has failed to pass the application compatibility testing for the intended Windows 10 build.  Again, this is primarily for reporting purposes and allows the desktop teams to get an idea as to why a device isn’t being upgraded.

Ad Hoc Exclusions

These are the special use cases for our environment.  As with the application exclusions there are individual collections for each exclusion group.  For example, we have a collection of all of the devices in the Emergency Department, one of all the devices in the Operating Rooms and another for all of Labor and Delivery.  Executing the upgrade on these machines is something that will be done manually and requires the desktop teams work around the schedules of these various clinical groups.

Reporting Sample

Here is an example of the report that I keep referring to.

In this example, the provided computer name is a member of all of the above WaaS related collections.

The important ones to note are:

WaaS_PROD_1809_Phase0-Reference_Exclusions_11-Pathology

The device was one of the ones identified as being “critical” for the Pathology department.  These are excluded from the overall process and will not fall into the randomized scheduling.

WaaS_PROD_1809_Phase3-IPU_02_IPU_ManualPullOnly

Since it was identified as being “critical” and was being excluded from the overall process, it falls into a collection that offers the IPU for manually triggering.

WaaS_Scheduled_1809_Phase3-IPU_01-27-2020 – 031500

We are also offering custom scheduling of small pockets of devices.  Pathology is a good example as they worked with their desktop team to pick a date and time that the IPU could be performed without impacting the department’s workload.  In this case, on 27-Jan-2020 at 3:15 am.

 

In upcoming posts I’ll go into detail on how these phases are structured, how the collections interact as well as how the user notifications are handled and finally the structure of the Compatibility Scan and In-Place Upgrade sequences.

 

Mike Marable

I am the OSD lead and a senior engineer for the Configuration Manager client group at Michigan Medicine (formerly the University of Michigan Health System). We manage 40,000 systems throughout the health system and medical school with Configuration Manager. I have been doing OS deployments for nearly 25 years, 14 of which have been with Michigan Medicine. I’ve lead the engineering efforts of moving our OS Deployment from a custom solution to Configuration Manager, our Windows XP to Windows 7 and Windows 7 to Windows 10 efforts. My passion over the last 2+ years has been Windows as a Service.

Add comment

1 × 1 =

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Follow us

Don't be shy, get in touch. We love meeting interesting people and making new friends.

%d bloggers like this: