Back to blog

Technical Writing

Automating Windows Patch Management with Ansible AWX

How a regulated analytics platform moved from weekend manual patching to supervised AWX workflows with stronger reliability and audit evidence.

April 17, 2026
SREAutomationCompliance

Monthly patching had become a weekend tax.

The environment supported a large enterprise analytics platform: Windows servers, domain controllers, SQL servers, RDS hosts, a handful of Linux machines, and the full application stack around them. The old process depended on SolarWinds Patch Manager, WSUS approvals, and a lot of manual remediation.

Once a month, multiple engineers spent four to five hours on a Saturday pushing patches forward, chasing failures, rebooting machines, validating services, and dealing with the occasional system that got stuck badly enough to require rollback and re-apply work.

The toolchain technically worked, but the operating model did not. Reporting was weak, audit evidence lagged, downtime windows were tight, and Microsoft patching was flaky enough that human attention became the safety net.

That is a bad place for a platform team to live.

Rebuilding The Automation Platform

The replacement was Ansible AWX running on AKS, backed by Azure PostgreSQL Flexible Server, storage accounts for retained artifacts, and Key Vault for secrets.

I inherited a non-functional AWX attempt from a previous team and started over. That meant building the infrastructure, Kubernetes deployment, authentication model, teams, roles, inventories, playbooks, and initial operating patterns from the ground up.

The first goal was patching. The larger goal was a general automation platform that could safely manage a sprawling hybrid environment.

Windows connectivity started with WinRM and CredSSP and later moved toward PowerShell Remoting. Inventories were built automatically from Azure Resource Manager and vSphere, with AWS added later. Because the environment spanned more than 60 Azure subscriptions and six datacenters that were not cleanly interconnected, networking became one of the hardest parts of the project.

We solved that with Ansible Receptor and in-network Linux hosts acting as remote executors. Later, hop nodes reduced direct network exposure and gave the system a better scaling model for remote execution.

The Patching Workflow

The patching workflow became fully automated.

AWX ensured patch pre-download, stopped application services, installed approved patches from WSUS, rebooted machines, waited for hosts to return, validated services, and restarted the application stack. Workflow jobs prioritized the most critical systems first, including domain controllers and SQL machines, before moving through application servers, RDS hosts, and ancillary systems.

Guardrails mattered. We split SQL cluster nodes and domain controllers so redundant systems were not patched in unsafe combinations. We patched development before production. We controlled concurrency so failures did not fan out faster than the team could understand them.

Most failures were handled by automatic retries and attempted rollback. That covered roughly 98 percent of the fleet. A second job run usually handled the remaining systems. Only rare cases required direct engineer intervention.

The initial scope was 987 servers. The platform later expanded by roughly 800 more servers from on-premises colocation environments, bringing the managed fleet to about 1,700 systems.

Evidence As A Side Effect

One of the quiet wins was auditability.

Instead of patch evidence being reconstructed after the fact, AWX produced execution history as part of the work. Job events streamed into the central SIEM, initially SecureWorks, and patch evidence was retained in blob storage.

That changed the shape of compliance work. The team could show what ran, where it ran, what succeeded, what failed, and what required follow-up. The patching process became easier to operate and easier to audit at the same time.

We also used Ansible roles to baseline systems against security standards. Those roles managed TLS and cipher settings, PowerShell script security, audit policies, registry changes, local configuration drift, and installation of security tooling. On RDS systems, the automation also helped manage profile sprawl by archiving old profiles after 90 days.

The result was not just faster patching. It was a more consistent security posture across a large server estate.

From Weekend Work To Supervised Operations

Once the workflows were stable, execution moved to an offshore team. Instead of multiple engineers spending Saturday on patching, the offshore team could run supervised AWX workflows while the platform team retained ownership of the automation, guardrails, and escalation path.

Over time, the platform became useful beyond patching. Application teams received limited access to maintenance and task playbooks. The NOC could handle ad hoc software installs. Engineers used AWX for evidence collection, rapid configuration checks, service operations, and other repeatable work that did not need to become another weekend ritual.

Access was shaped through team and role boundaries. The platform team could run any job. Application teams had inventory visibility and job access appropriate to their systems. That kept the automation useful without turning it into an uncontrolled execution surface.

What Changed

The biggest change was trust.

There was understandable skepticism at first. A previous attempt had failed, and engineers knew how painful bad patch automation could be. But the results changed the conversation quickly. The system reduced weekend toil, improved reliability, tightened audit evidence, and created a repeatable way to conform large numbers of systems with a small team.

This is the part of automation work that matters most to me: it is not just about saving time.

Good automation changes the operational posture. It makes work more consistent. It gives teams better evidence. It reduces the number of humans needed for repetitive execution while making the remaining human review more valuable.

Patch automation is reliability work. In a regulated environment, it is also compliance work. When the system is designed well, those are not competing goals. They reinforce each other.