Harden SSH Without Locking Yourself Out of Production

Key-only auth, Fail2ban tuning, and an access protocol that stops one wrong login from banning your own deploy box.

Derrick S. K. SiaworJune 1, 20268 min read

Aisle of black server racks with red and blue cabling in a data center — Photo · Brett Sayles / Pexels

SSH is the front door to every server you run, and the internet knows it. Within minutes of a fresh box getting a public IP, automated bots are throwing username and password combinations at port 22, working through dictionaries, never stopping. It is the same relentless automated probing your login endpoint faces from credential-stuffing botnets, aimed at the OS instead of the app. Hardening SSH is not optional. But the hardening itself carries a risk that catches even experienced engineers: a configuration mistake or a too-aggressive ban rule can lock you out of your own server, and if SSH is your only way in, you are now staring at a provider's recovery console trying to undo it.

To harden SSH without locking yourself out, disable root login and password authentication so bots have nothing to guess, authenticate with keys only, and configure Fail2ban with an allowlist for your own IPs plus a sane ban window so an aggressive rule never bans you. Always keep a second verified way in, such as the provider's recovery console, before you tighten anything.

The work, then, is two things at once. Shut the door on the bots, and never shut it on yourself. Here is how to do both.

Key-only authentication is the single biggest win

The two most impactful changes you can make are disabling root login over SSH and turning off password authentication entirely. Together they remove the attack vector that nearly every brute-force bot relies on, because a bot guessing passwords against a server that does not accept passwords is wasting its time completely.

Key-based authentication replaces the password with a cryptographic key pair. The private key lives on your machine and never leaves it; the public key sits on the server. Authentication proves you hold the private key without ever transmitting a secret that can be guessed. A 256-bit key is not in any dictionary and cannot be brute-forced in any timeframe that matters, which is the whole point.

In sshd_config, the relevant directives are:

PubkeyAuthentication yes
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
PermitRootLogin no

PasswordAuthentication no is the headline. The other three close the side doors: PAM and keyboard-interactive paths can let a password prompt back in even when you thought you disabled it, so turn them off too. PermitRootLogin no means an attacker who somehow obtained root's key still cannot log in directly as root; they have to come in as an unprivileged user and escalate, which is one more barrier and one more audit trail.

The order of operations that keeps you in

This is where people lock themselves out, and the cause is always the same: disabling password authentication before confirming the key actually works. The sequence matters.

Safe SSH hardening order of operations: verify key login in a second session before disabling password auth

Generate your key pair and copy the public key to the server's authorized_keys while password auth is still enabled.
Open a second SSH session and log in using the key. Do not close your first session. This is the rule that saves you: keep a known-good session open the entire time you are making changes, because as long as that session lives, you can always undo a mistake.
Only after you have confirmed a fresh key-based login works in the second session do you edit sshd_config to set PasswordAuthentication no.
Restart the daemon with systemctl restart sshd, then open a third session to verify key login still works with the new config.

A few traps live in this sequence. The directive has to be actually uncommented and active, not just present commented-out. The daemon has to be restarted, because editing the file changes nothing until sshd re-reads it. And modern systems often have config in include directories like /etc/ssh/sshd_config.d/, where a file can override your main config and silently re-enable what you just disabled. Check those includes; a setting in your main file that a drop-in overrides is a setting that is not in effect.

If something does go wrong, the second open session is your fix. If you closed it, the provider's web-based recovery console is your fallback to edit the config back. The difference between a thirty-second correction and a stressful recovery session is whether you kept that session open. The same keep-a-rollback-path-open instinct is what makes a deploy script that rolls itself back when health checks fail safe to run unattended.

Fail2ban, tuned so it bans bots and not you

Key-only auth stops the bots from succeeding, but it does not stop them from trying, and the noise in your auth logs is constant. Fail2ban watches those logs and temporarily bans IP addresses that fail to authenticate too many times, which cuts the noise and adds a second layer in case password auth ever gets re-enabled by accident. Turning that flood of failed-login lines into something you can actually act on is its own discipline, covered in turning noisy server logs into alerts you actually trust.

Configure it in a .local file, never by editing the shipped jail.conf. The shipped config gets overwritten on package updates, so any change you make there evaporates the next time the system updates. Create /etc/fail2ban/jail.local and put your settings there:

[sshd]
enabled  = true
port     = ssh
filter   = sshd
logpath  = /var/log/auth.log
maxretry = 3
findtime = 600
bantime  = 3600
ignoreip = 127.0.0.1/8 ::1 <your-static-ip>

That reads as: ban an IP that fails three times within ten minutes (findtime = 600), and keep it banned for an hour (bantime = 3600). Reasonable defaults that frustrate a brute-force bot without much collateral.

The line that prevents the self-inflicted lockout is ignoreip. Fail2ban does not know your laptop from an attacker's botnet, so if you fat-finger a login a few times from your own office, it will happily ban you with everyone else. The ignoreip list tells it which addresses to never ban. Always include the loopback addresses, and add the static IP of your management machine, office network, VPN exit, or jump host. Forgetting to add your own IP here is the single most common way administrators lock themselves out with Fail2ban, and the deeper version of that story, how one wrong SSH user locks your whole server out, is worth reading before you tune these numbers. If your address is dynamic, lean on a jump host with a static IP and allowlist that instead.

This is precisely the kind of detail that separates a server someone set up once from one that is actually run. We carry it through every box under management as part of disciplined server administration, because a Fail2ban rule that can ban your own deploy pipeline is worse than no rule at all. If you also terminate TLS at this box, the same care applies to serving unlimited subdomains from one Cloudflare origin certificate.

The access protocol that survives a team

Hardening one server is a config exercise. Keeping a fleet hardened, with multiple people and automated deploys reaching in, is an operational one, and it needs a protocol that does not depend on anyone's memory.

The principle that prevents the worst incidents: never retry a failed SSH login with different credentials. One wrong attempt is a typo. Several wrong attempts in a few minutes is exactly the pattern Fail2ban is built to ban, and there is no programmatic way to un-ban an IP once it is locked out. You have to go to the provider's console. So the rule for any automation and any human is: if a login fails, stop, verify the exact user, key, host, and port against your single source of truth, and only then try again. Do not iterate through guesses. That discipline is what keeps a deploy box from banning itself in the middle of a release.

Keep credentials in one authoritative place, not in scattered shell history and half-remembered notes. A connection harness that reads the right key and user for each host, with the passphrase handled once, removes the guesswork that leads to the failed attempts that trigger the ban. The goal is that no human and no script ever has to guess how to reach a server, because guessing is what gets you banned. The same single-source-of-truth discipline is what makes scripts the source of truth so you never fix production by hand, and it is exactly what keeps an agent you hand your deploy pipeline to from banning the box mid-release.

What hardened actually looks like

A properly hardened SSH setup is quiet. Password authentication is off, so the relentless dictionary attacks fail instantly and never get a foothold. Root login is off, so even a leaked key cannot walk straight in as root. Fail2ban thins out the persistent offenders and bans the noise, while ignoreip makes sure it never bans the people and pipelines that are supposed to be there. And the access protocol means a wrong key never cascades into a self-inflicted lockout.

The bots will keep knocking. They always do. The difference is that on a hardened box, the knocking is just background noise in a log file instead of a real risk, and the one risk that remains, locking yourself out, is the one you have specifically engineered around. Keep the second session open, allowlist your own IP, and never retry blind. Those three habits are most of the gap between SSH that protects you and SSH that traps you.

infrastructure ssh security devops

All of the Journal