SSH key management for Series A startups

There is a specific moment, somewhere between a 5-person engineering team and a 15-person one, when static SSH keys go from minor housekeeping to a real liability. Three triggers usually surface it: a SOC 2 readiness check, the offboarding of someone who had production access, or the arrival of a third-party consultant who needs root on a handful of hosts for two weeks. Each of those events asks the same question. Who currently holds a private key that can log in to production, and how do we revoke it before close of business?

If the honest answer involves grepping authorized_keys files across an Ansible inventory and hoping the inventory is complete, this article is for you.

How key sprawl actually accumulates

Most startup fleets do not begin with a deliberate SSH policy. They begin with one founder, one bastion, and a personal key pair pasted into /root/.ssh/authorized_keys on the first prod box. The second engineer joins and adds their key. The Terraform module that bootstraps EC2 instances grows a user_data stanza that drops both keys onto every new host. A few months later, the on-call rotation needs a shared service account, so someone generates a 4096-bit RSA key called deploy@prod and copies it into Vault.

By month nine, the picture is more interesting. CI uses one set of keys to push artefacts to a release host. A monitoring vendor wants read-only SSH to a forwarder VM, so a second key is created. The new junior SRE generates their own ed25519 key and adds it. The contractor who wrote the database migration scripts had a key for six weeks and may or may not still have it on their laptop. Nobody has revoked it because nobody is sure which hosts it lands on.

Each of those choices was reasonable in isolation. Together they create the inventory problem. The fleet has grown to roughly 40 hosts, three IAM-style roles (root, deploy, app), and somewhere between 12 and 24 distinct key pairs. Of those keys, the team can confidently account for maybe two-thirds. The remaining third is what makes the SOC 2 conversation uncomfortable.

The exact moment static keys become a liability

Three moments matter more than the others.

The SOC 2 readiness scan. CC6.1 (logical access controls) and CC6.6 (boundary protections) require that access to systems is restricted to authorised users and is removed when no longer needed. An auditor who is paying attention will not accept "we removed their GitHub access" as evidence that an ex-employee can no longer SSH into a production database replica. They will ask for the list of authorised principals on that host, the date each was added, and the revocation evidence for everyone who left. Static authorized_keys files struggle on every one of those axes.

The offboarding. When an engineer with prod access leaves, the right answer is "their access ended at 17:00 today." The default answer is "we removed their key from the Ansible repo and re-ran the playbook this morning." The two are not the same. Anyone holding a copy of the private key still has it. Hosts that were down during the playbook run still trust it. The on-call who pulled an emergency cherry-pick on Tuesday never re-ran site.yml, so the laptop in storage still works.

The consultant. A third-party engineer needs root on six hosts for the next ten business days. With static keys, the only way to give them access is to mint a key pair and append it. The only way to remove their access at the end of the engagement is to remove the key and trust the configuration management run. The path of least resistance is to leave it. That is how 40 percent of fleets end up with at least one trusted key whose owner has not worked there in a year.

The shape of a fix: SSH CA + SSO + per-host approval

The pattern that works at this size is an SSH Certificate Authority, an identity provider, and a short certificate lifetime. RFC 4253 defines the SSH transport layer; OpenSSH 5.4 (2010) added the certificate format on top. With OpenSSH 8.4 or later on both ends, you can issue a certificate that names a principal, restricts which hosts will accept it, and expires in four hours.

The flow has three parts.

An engineer signs in to your IdP (Okta, Google Workspace, Microsoft Entra). The OIDC ID token names the user and their groups.
A CA service verifies the token signature, checks group membership against the access policy, and signs a short-lived certificate. The CA private key never leaves a hardware-backed key store.
The engineer runs ssh user@host. OpenSSH presents the certificate. The host's sshd validates it against the trusted CA public key it received at provisioning. No authorized_keys entry is consulted.

The mental model: keys become things you generate per session, not artefacts you store. Revocation becomes a property of the calendar (the cert expired), not an operations task (run a playbook). Authorisation lives in the IdP, which is already the source of truth for offboarding.

What you actually configure

On the host side, two files change. sshd_config points at the CA public key, and authorized_principals lists which identities are permitted on this host.

# /etc/ssh/sshd_config
TrustedUserCAKeys /etc/ssh/ca_users.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
RevokedKeys /etc/ssh/revoked_keys

# /etc/ssh/auth_principals/ubuntu
sre@example.com
oncall@example.com

On the client side, the certificate arrives in ~/.ssh/ alongside the private key, and OpenSSH picks it up automatically when the filenames match (id_ed25519 + id_ed25519-cert.pub). The engineer runs ssh user@host as before.

On the CA side you have one decision to make that matters more than the others: certificate TTL. Four hours is the common default. It is short enough that a stolen cert is rarely worth stealing, and long enough that nobody is re-authenticating mid-deploy. The minimum useful value is probably one hour; below that the user experience suffers without much marginal security benefit because the IdP session is already cached.

Static SSH keys vs SSH CA: a side by side

Dimension	Static keys	SSH CA
Time to revoke an ex-employee	Minutes to hours (run config mgmt; hope inventory is complete)	Immediate (IdP deprovision halts new cert issuance; existing cert expires within TTL)
Audit evidence for SOC 2 CC6.1	Authorized_keys file snapshots, change history in Git, manual reconciliation	Cert issuance log keyed on IdP identity, signed by the CA
Key custody	Many laptops, sometimes vaults, sometimes CI	One CA private key in a hardware KMS; no per-user secrets on disk
Adding a host	Push key list via config management	Place CA public key once; bake into the AMI / cloud-init
Adding a user	Generate key, append to authorized_keys, re-run playbook	Add to IdP group; first cert issued on next login
Cost at 50 hosts, 15 engineers	Free, plus the cost of the SOC 2 finding	Free under the Linux Identity Starter plan

One concrete migration

A 12-person infrastructure team running 38 EC2 instances on AWS, two GCP projects, and a single self-hosted GitLab. They had four named environments (dev, staging, prod, sandbox), four Ansible roles, and roughly 16 keys in active rotation. The migration took 11 weeks.

Week 1 to 2 was inventory. They wrote a one-page Ansible task that ran on every host: cat /home/*/.ssh/authorized_keys /root/.ssh/authorized_keys, hashed each key with sha256, and posted the result back to a Google Sheet. The sheet had 38 rows of hosts and 16 columns of keys. The names of two of those keys nobody recognised, so they flagged them UNKNOWN_OWNER and treated them as untrusted for the rest of the migration.

Week 3 to 6 was a pilot. They picked five non-prod hosts and three engineers. They turned on the CA, added the CA public key to TrustedUserCAKeys, and configured AuthorizedPrincipalsFile. authorized_keys stayed in place as a fallback. The three engineers used the new flow for a week. Two real issues surfaced: ssh -A agent forwarding behaved differently with cert agents, and Ansible (which still used static keys) needed an exception path. Both were resolvable.

Week 7 to 10 was expansion. They moved all non-prod hosts onto the CA. They added a CI service account that received certs via OIDC workload identity rather than a static key. They deprecated 9 of the 16 static keys.

Week 11 was the cutover. They removed authorized_keys from prod hosts (kept a break-glass static key in a sealed envelope in the office safe; nobody has reached for it). The two unknown keys were retired with no incident, which retroactively confirmed they had not been in use. The SOC 2 auditor in October accepted the cert issuance log as evidence for CC6.1 access reviews.

Common objections and what they are worth

"We already use Vault for SSH." HashiCorp Vault's SSH secrets engine has two modes. The OTP mode is dead-ended (no new features since 2019). The signed-certificates mode is exactly the SSH CA pattern in this article, just self-hosted with Vault as the CA. If you are running it and it works, great. If you are not yet running it, the question is whether you want to run a Vault cluster and pay for SSO integration via Enterprise, or use a managed service. Both produce the same RFC 4253 certificate at the host.

"Our hosts cannot reach the internet." Air-gapped hosts do not need to reach the internet for cert validation. The CA public key is placed on the host once, at provisioning. Certificate validation is local: sshd verifies the signature against the public key it already holds. The internet-touching part is the engineer's laptop, which fetches a cert from the CA over HTTPS. The host never makes an outbound call.

"Engineers will hate it." The most common worry, and the one most consistently wrong. Engineers spent the last decade learning to live with ephemeral cloud credentials from aws sso login. Short-lived SSH certs that arrive transparently behind ssh user@host feel like the same pattern. The behaviour change is one initial linuxid login per session. If you make that command instant (cached IdP session, browser already open), nobody notices.

What this does not solve

An SSH CA is not an everything-bagel. It does not, by itself, audit privileged commands once a session is open; that is a sudo and PAM problem. It does not enforce least privilege on which hosts an engineer can reach; you still need an access policy (typically a list of principals per host or per host class). And it does not eliminate the need for a break-glass path when the IdP is down. You will want a sealed, manually-signed long-lived cert on physical media for the day Okta has a regional outage.

What it does solve is the inventory problem. The set of currently-authorised humans becomes the set of currently-active IdP accounts in the relevant groups. Revocation becomes the same operation as deactivating any other login. The SOC 2 auditor gets a single log instead of seven config-management snapshots. The contractor leaves and the access is gone within four hours without anyone having to remember.

Where to go next

If you are convinced of the direction and want the configuration details, read the OpenSSH CA production guide. If you need a week-by-week rollout schedule with rollback procedures, the 90-day plan maps onto a typical Series A team. If you are doing this because of an audit, the SOC 2 article spells out exactly which controls this pattern satisfies.

Linux Identity ships the IdP-to-CA flow as a managed service with a five-minute install. The Starter tier is free up to 10 hosts and 5 users; see pricing or the product overview.