Highly available ssh tunnels

Posted on February 9, 2022 by Logan McGrath

Estimated reading time: 6m 42s

In my previous post, Reasons why my website is offline, I complained about systemd giving up when it fails to maintain ssh tunnels. In this post, I complain about systemd a bit more and how I gave up and stopped using it for managing my ssh tunnels.

Recall from my last post

My website is hosted from my closet computer, an old PC tower that sits in my bedroom closet
I have no static IP address
I maintain a remote ssh tunnel from a bastion server hosted on a linode to my closet computer
My host names’ DNS point to the IP address of the bastion server
The bastion server acts as a proxy into the ssh tunnels to the closet computer
The closet computer handles the traffic it receives

This is how it looks:

         +-BASTION-SERVER--------------\
         | <=proxy=> | <===remote tunnel===> +-CLOSET-COMPUTER---\
SSH ---> | *:22----> | localhost:10022 | --> | *:22 SSH          |
HTTP --> | *:80----> | localhost:10080 | --> | *:80 goto 443 duh |
HTTPS -> | *:443---> | localhost:10443 | --> | *:443 my website  |
         `-----------+-----------------+     `-------------------+

Failure, `systemd`, and duct tape

When my home internet connection is down, the remote ssh tunnels into the closet computer fail to reopen. I have been using systemd on closet computer to manage these tunnels, and systemd will eventually stop trying to open the tunnels if the network remains inaccessible. This retry-then-give-up behavior that systemd exhibits is to safeguard against situations such as the thundering herd problem and this Stackoverflow question highlights the specific levers used to configure systemd’s retry behavior. To paraphrase, systemd tries to recover within a fixed amount of tries before it gives up. Given that there is a retry limit, I can’t reasonably extend the interval between tries so that my webserver recovers from a multi-hour outage without personally intervening. Moreover, I have to be physically next to the computer to recover it when this happens.

I can’t make any guarantees about my internet connection’s uptime and that I’m not at risk of unleashing a thundering herd against my own infrastructure. As a solution, systemd therefore does not make sense for keeping my ssh tunnels tunnels open.

As a better solution, one both held together by duct tape and recommended by Vlad, I have instead leveraged a simple cronjob:

*/5 * * * * ssh closet.thisfieldwas.green true || ssh -fN bastion.thisfieldwas.green

And added the following ssh configuration to the autossh user that opens the tunnels:

Host bastion.thisfieldwas.green
    Port 22
    RemoteForward 10022 localhost:22
    RemoteForward 10080 localhost:80
    RemoteForward 10443 localhost:443

Host closet.thisfieldwas.green
    ProxyCommand ssh jump_user@bastion.thisfieldwas.green -W 127.0.0.1:10022

This simple setup works by testing every five minutes whether the ssh tunnel to closet computer is open by ssh’ing in by its external hostname via ssh closet.thisfieldwas.green true. There is an account named jump_user that is allowed to ssh into the tunnels open on the bastion server and all ssh requests to closet computer proxy through this user account. If an ssh connection cannot be made, then the remote tunnels are opened via ssh -fN bastion.thisfieldwas.green. This setup works when I reboot the closet computer, if I unplug my router, or if my internet connection drops.

I don’t have to worry about my husband Corey accidentally tripping over the power cable because I know that he will plug it back in. When the closet computer has powered back on, the tunnels open shortly after, and I’m offline for about five minutes for a small hiccup. For a self-hosted webserver, I think this is acceptable.

@TODO

A better way to recover from bad configuration

I’ve managed to disable my ssh tunnels three times now by changing host keys, the autossh user’s keys, or the jump_user’s allowed keys. I don’t have a quick solve for these instances beyond manually stepping through and fixing each ssh key error as it comes up, and I would much prefer a solution that’s no more manual than running a single command.

Better uptime, but from the closet

While using a cronjob to keep ssh tunnels open is a step towards higher availability for my website, I still have some distance to cover before my maximum downtime is one minute or less per incident. For instance, last Sunday, Feb 6, 2022 at 12pm PST, a 3.1 earthquake struck; its epicenter barely a mile from my home and it felt like a large truck had collided with the house! That earthquake could have cut my internet connection for multiple days.

I am also a bad Californian and underprepared for a worse earthquake, but that’s another matter.

I could simply host my website from the bastion server, but physically it’s located in Fremont, CA, where it could still be knocked offline by a Bay Area earthquake. And sure, linode has other locations, but simply hosting from anywhere that isn’t my closet would deprive me of a good yak shave.

As an exercise, I want to duct tape my own high availability solution together while still serving my website from my closet computer as the primary node. It is fun to tell people that my professional site is coming out of the closet, after all.

Please share with me your thoughts and feedback!