Automated Windows IIS Provisioning with EC2 Image Builder and Systems Manager

  • Thread Author
Ziff Davis’s engineering team and AWS partnered to replace an ad hoc, error-prone Windows server provisioning process with an automated, repeatable pipeline built on EC2 Image Builder and AWS Systems Manager — delivering consistent IIS hosts, simplified patching, and faster recovery while preserving legacy COM-dependent applications. The collaboration turned undocumented, manually assembled servers into versioned AMIs with scheduled, tag-driven deployments and post-launch automation, materially reducing configuration drift and operational toil while improving disaster-recovery and scale-out options for production workloads.

Futuristic holographic diagram of cloud deployment and security pipelines.Background​

Managing Windows-based web servers at scale remains one of the tougher operational problems for organizations that run legacy applications on IIS, especially when those applications rely on older platform features such as COM components, custom ODBC drivers, and machine- or user-scoped certificates. Ziff Davis operated several brands providing telephony services and hosted legacy .NET/IIS applications that required stable, identical host configurations across development, QA, and production — a requirement they could not meet with manual server builds and spotty documentation.
The chosen approach was pragmatic: build a common base image that contains shared OS-level roles and drivers, then create platform-specific AMIs from that base. Use an automated image pipeline that rebuilds and validates images on a schedule, and use Systems Manager automation documents plus State Manager associations to consistently launch, configure, and tag instances from the latest validated AMI. This marries the lifecycle management capabilities of EC2 Image Builder with Systems Manager’s orchestration and State Manager scheduling capabilities for recurring deployments and maintenance. The Image Builder service is designed precisely for this use case: automating image creation, testing, distribution, and lifecycle of AMIs.

Why this matters: the operational problem set​

The pain points Ziff Davis faced are typical of many enterprises that operate Windows server fleets:
  • Inconsistent environments across dev/QA/prod that produce misleading test results and brittle deployments.
  • Little or no documentation of server creation steps, making exact reproduction difficult.
  • Manual patching and ad hoc imaging that consume engineering time and introduce human error.
  • Legacy dependencies (COM components, ODBC drivers, certificates) that complicate automation and require special handling.
Those challenges drive the need for idempotent, documented, and automated image and instance pipelines — the exact scenario EC2 Image Builder and Systems Manager are built to address. The Image Builder service centralizes image pipelines and automated testing, while Systems Manager provides runbook-style automations and the ability to schedule and enforce configuration at scale.

Overview of the automated solution​

At a high level, the solution combines two AWS services into a reproducible workflow:
  • EC2 Image Builder — author image recipes and pipelines that install Windows Server roles, IIS features, drivers, and common components; run tests; and publish AMIs into one or more regions. This provides versioned AMIs that serve as the canonical server images for all environments.
  • AWS Systems Manager — use Automation documents (SSM Automation runbooks) to dynamically find the latest AMI, launch instances (via the aws:runInstances action), and run post-launch configuration steps such as domain join, static IP assignments, certificate installation, and IIS configuration. Use State Manager to schedule and run those Automation documents on a cadence (for example, weekly or cron-driven) to refresh instances or create new ones as needed. Systems Manager Automation supports a rich set of actions (including aws:runInstances and aws:executeScript) that allow the entire workflow to be expressed as a reusable runbook.
Key solution features that enabled Ziff Davis to reach their goals:
  • Dynamic AMI discovery (query DescribeImages + tag/name filters, sort by CreationDate) so automation always uses the latest validated AMI rather than a hard-coded AMI ID.
  • Tag-driven lifecycle and progress tracking so instances advertise their configuration state and failure conditions for easy troubleshooting.
  • Post-launch configuration via Systems Manager steps (domain join, IIS configuration, SSL installation), with logs streamed to CloudWatch for triage.
  • Scheduled automation associations using State Manager cron expressions so instance creation or refresh happens predictably. State Manager supports cron expressions and rate expressions suitable for weekly or more complex schedules.

Step-by-step: how the pipeline is built and runs​

1. Define requirements and standardize components​

Before automating, the team inventoried required Windows roles, IIS features, COM/ODBC dependencies, and third-party installers. The inventory drove the Image Builder recipe decisions and helped identify which items were safe to bake into an AMI and which should be applied at first-boot.
  • Bake into the AMI:
  • OS-level hotfixes and approved Windows Server updates
  • Core Windows Server roles and features used by all platforms
  • NVMe/ENA drivers and the SSM Agent to ensure reliable post-boot telemetry and commands
  • Defer to post-launch configuration:
  • SSL certificates (to avoid NT SYSTEM scoping issues during image builds)
  • Machine-specific licensing or secrets
  • Environment-specific network settings and static IPs
The choice to defer certificate installation to post-launch reflects a practical reality when imaging Windows: certificates and certain user-context artifacts can cause permission or sysprep-related issues if installed under the image-build account context. That operational caveat is a frequent theme in Windows image automation and should be tested carefully for each environment. (Flag: while this pattern is widely recommended in practice, confirm exact certificate handling needs for your applications and test in a nonproduction pipeline.)

2. Create a common base image pipeline with EC2 Image Builder​

Use EC2 Image Builder to author an image pipeline:
  • Create an image recipe composed of:
  • AWS-managed components for Windows Server features and baseline hardening.
  • Custom components for company-specific configuration: ODBC drivers, registry tweaks, additional installers.
  • Configure build and test stages:
  • Launch a build instance, run components, then run automated validations (SMT/PowerShell tests) to ensure services come up.
  • Configure distribution:
  • Share the created AMI to target accounts, regions, or via RAM as required.
  • Schedule recurring builds or trigger builds when patch updates are required.
Image Builder also tracks lineage and has test integrations so images are only published if they pass the configured validation steps. This avoids accidental deployment of broken AMIs.

3. Dynamically select and launch the latest AMI with Systems Manager Automation​

Rather than hard-coding AMI IDs, the automation runbook finds the latest AMI at runtime:
  • Use the DescribeImages API (via aws:executeAwsApi or an aws:executeScript step) with filters for owners, name patterns, or tags to collect candidate AMIs, then sort by CreationDate and pick the most recent available AMI. The EC2 describe-images API supports tag-based filters and returns CreationDate in the result payload.
  • Use the aws:runInstances Automation action to launch the instance from the selected AMI. The runbook can immediately tag the instance so downstream steps and monitoring can track lifecycle progress (for example, tags like Provisioning=InProgress, Provisioning=Failed, or Role=IIS-Web).
  • Example Automation actions used: aws:runInstances (launch), aws:runCommand (post-launch commands), aws:executeScript (Python/Powershell logic), aws:createTags. Systems Manager’s Automation action set explicitly includes these actions and supports passing outputs between steps.

4. Post-launch configuration using Systems Manager​

After launch, the runbook performs:
  • Domain join via the SSM agent (or via AWS Directory Service integration).
  • Network configuration: apply environment-specific static IPs where needed.
  • Certificate installation and binding to IIS site SSL bindings (deferred from image build to avoid SYS-level permission pitfalls).
  • IIS application configuration: create sites, virtual directories, app pools, and import of any environment-specific configuration files.
  • Staggered processing and automated retries: schedule tasks with waits (aws:sleep) and tag on failure to prevent automated re-processing without human review.
All steps log to CloudWatch (Automation supports streaming outputs), and the automation can set tags such as ProvisioningStatus=Ready or ProvisioningStatus=Failed for operational visibility.

5. Schedule and enforce with State Manager​

Make instance creation and configuration repeatable by scheduling the Automation document through State Manager:
  • Create an association that references the Automation document.
  • Use State Manager cron expressions (for example, cron(0 0 ? SUN ) for weekly runs) to make the pipeline predictable.
  • State Manager supports advanced cron expressions (nth-day-of-month, last-day-of-month, offsets) for flexible maintenance windows and cadence.

Operational benefits observed​

Ziff Davis reported tangible operational outcomes after deploying this integrated approach:
  • Consistent environments — dev/QA/prod servers were created from the same latest validated AMIs, reducing environment-specific bugs and making QA results reliable.
  • Faster patching — scheduled image rebuilds with pre-applied patches reduced per-server patch cycles and eliminated a large share of manual patching work.
  • Improved disaster recovery — standardized AMIs and automation made it possible to recreate servers quickly in alternative AZs or regions.
  • Better observability and faster troubleshooting — CloudWatch logs and tagging made it clear where failures happened in the automation pipeline.
  • Cost and time savings — automation reduced manual effort, shortened provisioning times, and minimized human-introduced configuration drift.
Those outcomes align with the core promises of an image-based, automation-first server provisioning model. The migration and validation advice from related operational guides stresses the same checks (boot mode, driver compatibility, NVMe/ENA presence, and preflight testing) that Ziff Davis baked into their approach.

Strengths of the approach​

  • Repeatability and auditable image lineage — EC2 Image Builder keeps recipes, components, and build history; teams can roll back to a prior AMI if a new pipeline introduces regressions. This enforces an auditable path from recipe -> image -> instance.
  • Policy enforcement and sharing — Image Builder integrates with Organizations and RAM, enabling accounts to restrict launches to approved AMIs only.
  • Native AWS automation primitives — Systems Manager Automation runbooks are powerful, supporting aws:runInstances, aws:executeScript, aws:runCommand, and stateful orchestration semantics, plus robust scheduling via State Manager.
  • Operational hygiene baked in — tagging conventions, CloudWatch-based logs per automation step, and scheduled rebuilds help make the environment maintainable and less dependent on tribal knowledge.

Real risks, limitations, and things to watch​

While this design solves many operational problems, engineers must pay attention to real risks:
  • Windows-specific quirks and sysprep — Windows image builds require careful sysprep handling. Image Builder performs sysprep under certain conditions and some provisioning actions (especially user-context actions like certificate installation or user-profile provisioning) can fail or produce unexpected state if attempted during the image build. The pragmatic mitigation is to perform machine-specific steps during post-launch configuration. This behavior is widely reported and requires testing for each pipeline. (Caution: treat certificate-in-image claims as environment-dependent and validate in a staging pipeline.)
  • Driver and boot-mode mismatch — For older Windows Server versions or specialized images, verify boot-mode (UEFI vs Legacy BIOS) and NVMe/ENA drivers before migration or when selecting target instance families. Running bcdedit or querying the firmware type is a recommended preflight check for migrations. If you choose the wrong boot mode or miss required drivers, instances will fail to boot. Operational migration checklists emphasize running these checks early.
  • AMI discovery fragility — While dynamic AMI selection via DescribeImages is powerful, it depends on reliable tagging and naming conventions and pagination-aware CLI/script logic. Be careful with paginated API responses and make sure the selection logic sorts by CreationDate (or a semantic version tag) to avoid selecting a stale AMI. The EC2 DescribeImages API supports tag filters but does not guarantee ordering, so sort client-side by CreationDate when necessary.
  • Secrets and certificates handling — Avoid baking sensitive material into AMIs. Use secure post-launch injection methods (SSM Parameter Store with encryption, AWS Secrets Manager, or an integration with a corporate PKI) to install certificates or machine secrets at first boot. Certificates can be tricky when installed under different accounts (NT AUTHORITY\SYSTEM vs a service account), so test the binding and key permissions in a staging runbook. (Flagged: this is a best-practice recommendation and needs local verification.)
  • IAM and permission scoping — Automation runbooks that call EC2 and other APIs must run with a carefully scoped assume role (the Automation assume role) that grants only the permissions needed. Misconfigured roles lead to failures or over-broad privileges, so follow least-privilege principles and monitor automation role activity.
  • Automation limits and concurrency — Systems Manager Automation has account-level quotas (concurrent workflow limits, step timeouts). Plan concurrency and error-handling logic into runbooks to prevent a flood of simultaneous image launches from overwhelming limits.

Troubleshooting and runbook guidance​

When automation fails, the following troubleshooting checklist accelerates recovery:
  • Inspect CloudWatch logs for the specific Automation execution — Image Builder and Automation expose step-level logs.
  • Check tags applied to the instance; a ProvisioningStatus=Failed tag should include an error code or step name.
  • Validate SSM Agent presence and connectivity on the instance; Image Builder builds depend on SSM checks during test phases.
  • If an instance fails to boot, validate boot mode (bcdedit output) and verify required drivers (NVMe/ENA/Xen PV) are present.
For migration scenarios, use the community and AWS tooling recommendations: VM Import Checker or the MGN Toolkit, which identify common migration blockers (disk counts, BitLocker, free root space). Include these checks in any pre-migration automation.

Practical recommendations and best practices (operational checklist)​

  • Version your Image Builder recipes and components in source control and tag builds with a semantic version and build metadata.
  • Always include automated tests in your image pipeline and fail distribution for non-passing images.
  • Use tags to mark AMIs with their pipeline, version, and build-date so Systems Manager can reliably find the right AMI.
  • Use IAM assume roles for Automation that are narrowly scoped; audit their usage regularly.
  • Defer machine- or environment-specific secrets and certificates to first-boot injection via SSM Parameter Store or Secrets Manager, and establish a secure, auditable mechanism for the certificate installation step.
  • Validate Windows-specific steps (sysprep, driver installs, registry tweaks) in an isolated staging pipeline; test boots across targeted instance families and regions (Nitro vs Xen differences can be critical).

Critical analysis: what worked well, and what remained risky​

What worked well:
  • The combination of EC2 Image Builder and Systems Manager created a reproducible, auditable pipeline that removed a large volume of manual work and tangles of undocumented steps.
  • Scheduling and stateful automation reduced configuration drift and made patch management more straightforward.
  • Tagging and CloudWatch logs substantially improved operational visibility and troubleshooting speed.
What remained risky:
  • Windows image creation carries inherent complexity (sysprep behavior, certificate scopes, boot-mode/driver requirements) that can produce subtle failures if pipelines are not exhaustively tested.
  • Dynamic AMI discovery is powerful but brittle without disciplined tagging, naming, and pagination-aware selection logic.
  • Organizational buy-in is required: automation changes process and responsibility — early engagement with application owners is essential to avoid surprises (for example, when an app expects a machine identity or hardware-bound license).

Conclusion​

Ziff Davis’s work with AWS demonstrates a pragmatic, repeatable path for organizations that must both support legacy Windows/IIS workloads and adopt modern cloud operational practices. By using EC2 Image Builder for reproducible image builds and AWS Systems Manager (Automation + State Manager) for dynamic AMI selection, launch orchestration, and post-launch configuration, the team removed manual steps, eliminated much of the configuration drift, and established a documented, audited pipeline for consistent server creation and recovery.
This architecture is not a silver bullet — Windows-specific behaviors (sysprep, certificate handling, driver/boot-mode compatibility) and the operational discipline required to manage AMI naming and automation roles must be addressed deliberately. However, when those caveats are handled through staged testing, conservative separation of image vs. runtime responsibilities, and careful IAM scoping, the result is a robust, scalable server provisioning model that reduces downtime, simplifies DR, and provides a clear migration path toward modernized cloud architectures. For teams wrestling with legacy Windows fleets, the Image Builder + Systems Manager pattern is a pragmatic way to get predictable, auditable servers while buying time to modernize applications incrementally.

(Operational references informing this piece include AWS documentation for EC2 Image Builder and Systems Manager as well as community and migration checklists that emphasize driver, boot-mode, and preflight validation for Windows workloads. Examples of migration and validation guidance that align with the approach outlined are reflected in migration runbooks and community guidance on preflight checks and imaging caveats. )

Source: Amazon Web Services Automating server creation with EC2 Image Builder and AWS Systems Manager: A collaboration between AWS and Ziff Davis | Amazon Web Services
 

Back
Top