Replacing static SSH keys: a 90-day plan

Most teams know the static SSH key situation is bad before they try to fix it. The keys are everywhere: in authorized_keys on every host, on engineer laptops, in CI runners, in old ~/.ssh/ directories nobody has touched in two years. The leaver problem is real, the audit story is fragile, and you cannot answer the question "who can SSH to production right now" without running a script across the fleet and hoping it returns truthful data.

The fix is well known: short-lived SSH certificates from a CA. Engineers authenticate to the CA via SSO, the CA hands back a four-hour cert, the hosts trust the CA, and the cert expires on its own. The problem is the migration. You cannot flip the switch on a 200-host fleet on a Tuesday and expect to keep your job. This article is the ninety-day plan that gets you from one to the other without an outage, without an engineer revolt, and with a rollback at every step.

The plan is calibrated for a fleet of 50 to 500 hosts and 10 to 75 engineers. Smaller fleets compress the timeline; larger fleets stretch it, but the phase order does not change. This is the version that has worked for Linux Identity customers, and the version that has worked for the teams who built it before there was a product.

Phase 0 (week 0): pre-flight

Before week one, you need three artefacts. None of them require code changes. Skipping any of them is the most common reason these rollouts stall in week six.

An exec sponsor and a written one-pager. The one-pager says what you are doing (replacing static SSH keys with short-lived certs), why (audit, leaver risk, SOC 2 CC6.1), the rollback story (CA runs in non-enforcing mode for four weeks), and the success metric (zero static keys in authorized_keys on production hosts, sixty days after start). The sponsor is whoever owns the "we got a finding" conversation with the auditor. Usually CTO or VP Eng.
A maintenance window policy. The rollout itself does not require any host downtime. The config changes are additive until the cutover, and even the cutover is a service reload, not a restart. But you need a written escalation path for the day something goes wrong: who do you wake up, what is the rollback command, what is the comms channel.
Buy-in from the loudest engineer. Every team has the person who is most opinionated about SSH. They have a custom ~/.ssh/config with eighteen Host blocks and they are going to be the first to complain when something breaks. Pull them in early, walk them through the design, and have them help you write the engineer-facing docs. If they sign off, the rest of the team will follow.

Phase 1 (weeks 1–2): inventory

You cannot remove static keys you do not know about. The first two weeks are entirely about discovery. The output is a single spreadsheet (or one database table) with one row per (host, authorized public key) pair.

#!/bin/bash
# inventory-keys.sh — run on every host, output normalized CSV
# Columns: hostname, user, key_type, fingerprint, comment, source_file

for user_home in /root /home/*; do
  user=$(basename "$user_home")
  ak="$user_home/.ssh/authorized_keys"
  ak2="$user_home/.ssh/authorized_keys2"
  for f in "$ak" "$ak2"; do
    [ -f "$f" ] || continue
    while IFS= read -r line; do
      # skip comments and blank lines
      [[ -z "$line" || "$line" =~ ^# ]] && continue
      fp=$(echo "$line" | ssh-keygen -lf - 2>/dev/null | awk '{print $2}')
      type=$(echo "$line" | awk '{print $1}')
      comment=$(echo "$line" | awk '{$1=""; $2=""; print substr($0,3)}')
      echo "$(hostname),$user,$type,$fp,$comment,$f"
    done < "$f"
  done
done

Run that across your fleet with whatever you have (Ansible ad-hoc, SSM, a one-off Fabric script). Aggregate the output into one CSV. The first surprise is the row count: a 200-host fleet often has 3,000 to 8,000 authorized-key entries, most of them duplicates. The second surprise is the keys you cannot identify. The comment field says ec2-user@bastion from 2021 and the engineer who created it left the company eighteen months ago. Those are the keys you are doing this for.

Cluster the keys by fingerprint and triage each unique key into one of four buckets:

Current engineer: still employed, still uses SSH. Will get a cert in phase 3.
Service account / CI: a non-human caller. Needs a short-lived cert pulled by the workload, not by a human. Track separately.
Former engineer / unknown: remove at end of phase 1. This is the win.
Break-glass: a small set of keys for the outage scenario where the CA is down. Keep these, but move them to a separate authorized_keys file under tight access control.

The first cleanup pass — removing the "former engineer / unknown" bucket — is where you produce your first audit artefact. The diff alone is usually enough to satisfy a CC6.6 finding about access removal. Do it before you write a single line of CA code.

Rollback for phase 1

The inventory script is read-only; nothing to roll back. The cleanup removes lines from authorized_keys; commit each host's old authorized_keys file to a private git repo before editing, so reverting a host is cp old-authorized_keys /home/user/.ssh/authorized_keys. Test the revert on one host on day one. Do not trust a rollback you have not exercised.

Phase 2 (weeks 3–4): stand up the CA

By now you have a clean inventory and a written rollback. Time to stand up the CA. The technical details are in the OpenSSH CA pillar; the rollout question is where the CA lives.

Two patterns work. Pick one based on your team size and security posture.

Self-hosted CA on a hardened VM. A single VM in your VPC runs the signing service. The CA private key lives in your cloud provider's KMS (AWS KMS, GCP Cloud KMS, Azure Key Vault) with a strict access policy. Pros: full control, no third-party dependency. Cons: you own the uptime, the KMS policy, the audit log shipping, and the SSO integration.
Hosted CA (Linux Identity, Smallstep, Teleport, etc). The vendor runs the signing infrastructure; you bring SSO and host enrollment. Pros: a quarter of the operations work. Cons: a fourth-party in the path of every SSH login, and a vendor bill.

For a Series A startup with no dedicated security engineer, hosted wins on time-to-value. For a regulated workload (HIPAA, FedRAMP, certain financial services) self-hosted may be required. Either way, week three is when you stand it up.

The end of week four ships two deliverables. First, the CA can sign a user cert end-to-end: an engineer authenticates with SSO, gets a cert back in ~/.ssh/, and can SSH to a single test host. Second, the documentation a new engineer needs to onboard. That includes the one-line install command, the SSO flow they will see, and the troubleshooting matrix for the three errors they will hit (clock skew, expired cert, principal mismatch).

Rollback for phase 2

The test host trusts the CA via a TrustedUserCAKeys line in sshd_config. Rolling back means removing that one line and running systemctl reload sshd. The existing authorized_keys entries on the host still work the whole time; nothing changes for engineers who have not yet enrolled. Test the revert on the test host on day one of phase 2.

Phase 3 (weeks 5–7): pilot expansion

Phase 3 is the friendliest phase. You roll the CA-trust config out to a small pilot subset of the fleet — typically 10 to 20 percent of hosts, chosen to be representative — and you onboard a pilot group of engineers (5 to 10 people, ideally including the loudest engineer from phase 0).

The critical design point: every host in the pilot accepts both short-lived certs and the existing static keys. Nothing is removed yet. The new auth path is purely additive. Engineers in the pilot use the cert path; everyone else continues with static keys; if anything breaks, the cert engineers fall back to their old key for the day.

# sshd_config additions on pilot hosts — additive only
TrustedUserCAKeys /etc/ssh/ca_users.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
RevokedKeys /etc/ssh/revoked_keys.krl

# DO NOT remove this line yet. Static keys still work in parallel.
# AuthorizedKeysFile .ssh/authorized_keys

Watch four metrics during the pilot:

Time-to-first-SSH for a new engineer. Goal: under five minutes from running the install command.
Daily cert sign volume. Should approximate engineer count × ~2 (morning sign-in plus one re-auth).
Failed SSH attempts on pilot hosts. Distinguish between "cert expired, engineer needs to re-auth" (normal) and "cert valid, host rejected anyway" (config bug, fix immediately).
The complaint volume in the engineering Slack channel. If there are no complaints, the pilot is not running. If there is a steady trickle, you are calibrated.

A week into the pilot, write down every paper-cut: confusing error messages, missing docs, an SSO redirect that loops on Safari, the engineer whose laptop clock skewed by 90 seconds and got a cryptic cert-not-yet-valid error. Fix the top three. Do not move to phase 4 until the support burden is under one ticket per engineer per week.

Rollback for phase 3

Same as phase 2: remove the TrustedUserCAKeys line, reload sshd. Because static keys never stopped working on pilot hosts, an engineer never gets locked out; the worst case is a few minutes of pure-static-key operation while you debug. Time the rollback on day one of phase 3 and confirm it lands across the pilot in under fifteen minutes.

Phase 4 (weeks 8–10): fleet rollout

With the pilot stable, push the CA-trust config to the rest of the fleet. The mechanism depends on your config management:

Ansible / Chef / Puppet: add the snippet, run a canary, watch for one cycle, then full fleet.
Linux Identity agent: the agent applies the CA-trust config and verifies sshd -t succeeds before reloading; failures bail out and report.
Manual SSH (do not do this at fleet scale, but if you must): wrap the change in a script that runs sshd -t, then systemctl reload sshd, then a verification SSH from a known-good source.

Push the engineer-side enrollment in parallel. By the end of week 9, every engineer should have a cert-issuing flow they can use. Track enrollment on a dashboard: percentage of engineers who used a cert in the last seven days. Tell the holdouts directly. The one or two engineers who refuse to enrol are almost always the ones who already have a personal workaround that bypasses the new path; chase those down.

Two things to watch closely in this phase:

Edge hosts you forgot. The bastion you set up for a specific customer in 2022, the developer's personal devbox, the network appliance with a Linux underneath. Every one of these has the old auth flow. The inventory from phase 1 should have caught them; if not, week 8 is when they surface.
CI and service accounts. The non-human callers need certs too. The flow is different: a CI job authenticates to the CA via OIDC (GitHub Actions OIDC, GCP workload identity, IAM Roles Anywhere) and gets a cert scoped to the job. Build this path in parallel with the human rollout, not after. Otherwise CI keeps using a long-lived key past cutover, which kills the audit story.

Rollback for phase 4

The whole fleet now trusts the CA, but static keys still work. Rollback is the same operation as phase 3, executed across the fleet via your config management or the agent. Test a full-fleet rollback on a non-production environment in week 8; do not skip this.

Phase 5 (weeks 11–12): cutover

The last two weeks. With every host trusting the CA and every engineer enrolled, you flip the switch: stop accepting static keys.

The mechanism is one line of sshd_config on each host. There are three increasingly aggressive options:

Empty the authorized_keys file on each host. Soft cutover: sshd still reads the file but finds it empty. Reversible per host. Operationally simplest. Pick this one.
Set AuthorizedKeysFile none: tells sshd to skip the file lookup entirely. Slightly more visible in the config. Reversible.
Remove the ~/.ssh/ directory: aggressive, hard to roll back, do not do it. Keep the directories so an engineer who needs the file for git operations still has it.

Do the cutover in waves over week 11. Start with the canary group (10 percent of fleet), watch for 24 hours, then non-production (50 percent), watch for 24 hours, then the remaining production. End of week 11: zero static keys on any host.

Week 12 is verification and audit-artefact production. Re-run the phase 1 inventory script. The output should be empty. Save the empty CSV; that is the audit artefact, signed and dated. The same script becomes a recurring control: run it weekly, page someone if it ever returns rows again.

Rollback for phase 5

The hardest phase to roll back, by design. If the CA itself is down, the break-glass keys (preserved in phase 1) let designated SREs in. If the cutover causes a per-host problem, you restore the host's authorized_keys from the git repo created in phase 1. Test that restore on a non-production host on the first day of phase 5. The whole rollback should be under fifteen minutes per host.

Day 91: what you have

By day 91, three things have changed.

First, the audit story. The answer to "who can SSH to production right now" is "everyone whose SSO is active and whose IdP group includes prod-ssh". Removing access is one click in the IdP and propagates to the entire fleet within the next cert renewal cycle (default four hours). The auditor sees a list of authorized principals — twelve engineers, by name — instead of a Google Sheet with three hundred fingerprints of mysterious provenance.

Second, the leaver problem is solved. When an engineer leaves, you disable their SSO account. Within four hours their last cert expires and they cannot issue a new one. No host-by-host cleanup. No worrying about the laptop they kept. No fingerprints to track down in authorized_keys files across a 300-host fleet.

Third, you have a foundation for everything else. JIT sudo (covered in the JIT sudo pillar) lands on top of the same SSO and audit pipeline. Host certificates ride on the same CA. SOC 2 CC6.1 and CC6.6 evidence drops out of the cert issuance log without writing a single script.

What goes wrong, and what to do about it

Three categories of failure show up across these rollouts:

The pilot is too small. You pilot with three engineers on three hosts, everything works, and at week 8 you discover that fifteen of your hosts have something weird (a custom PAM stack, a different SSH port, a chroot jail). Pick the pilot subset deliberately to include the weird hosts, not just the friendly ones.

Phase 4 stalls on CI. Engineers enrol, hosts trust the CA, and then the team that owns CI says "we'll handle our keys next quarter". By the time next quarter arrives the rollout has lost momentum and a long-lived CI key is still in production. Build the CI cert path in week 8, not week 13. Use the same OIDC pattern your IdP already supports; the cost is one CI job change, not a re-architecture.

The CA outage scenario is untested. The CA goes down at 2 AM on a Saturday and the on-call cannot SSH in because the only keys on the host are short-lived certs that expired. The fix is the break-glass keys from phase 1: a small number of static keys, on a small number of hosts (jump boxes), in a separate file (AuthorizedKeysFile /etc/ssh/break_glass_keys), with their use logged and reviewed monthly. Test the break-glass path quarterly; it is the only thing standing between you and a full outage if the CA dies.

Adapting the plan

Ninety days is a default, not a constant. Smaller fleets and tighter teams compress it; 30 days is feasible for a 25-host startup with a single platform engineer. Larger fleets stretch it; 180 days is reasonable for 1,500 hosts, especially if there are multiple business units or compliance environments to coordinate. The phase order does not change. What changes is dwell time inside each phase, because each phase is gated on a real-world signal (clean inventory, working pilot, low support volume, untouched fleet trust) and not on a calendar date.

The temptation to compress phase 3 (pilot) is the most common rollout-killer. The pilot is uncomfortable: it feels slow, you are running two paths in parallel, and you are tempted to push to cutover so you can "finish". Resist this. The pilot is the only phase that surfaces the genuinely surprising failures — the engineer whose laptop has a 90-second clock skew, the host whose sshd was patched to a custom build by an SRE in 2019, the SSO redirect that loops on a specific browser version. Find those in the pilot, not in production cutover.

Where to go from here

The companion piece on OpenSSH CA in production covers the configuration details you will ship in phases 2 through 4. The Series A overview frames why this matters before you have a dedicated security engineer. The SOC 2 pillar maps the artefacts produced by this rollout to the specific Trust Services Criteria your auditor will reference. And the JIT sudo pillar covers the next layer up, which is what most teams ship in the quarter after this one.

If you are starting this rollout and want a partner that handles the CA, the agent, the KRL distribution, and the audit-log shipping out of the box, Linux Identity is free under ten hosts and five users. It is purpose-built around exactly this migration. The CA design choices documented above are the choices the product makes by default.