Homelab Failure Domains
Everything is in shambles.
One machine is dead and probably not coming back. Services are randomly scattered amongst the survivors. My Kubernetes project is kinda-sorta paused.
I'm unnerved and want to get it all sorted out, but I need to do some thinking out loud first.
My infrastructure currently consists of:
-
Omicron, a Kubernetes cluster that is just barely functioning. It is currently running this site, VMSave, and basically nothing else. Consists of
hypnotoad
,crushinator
, and a control plane VM running on a Wyse 5070 running Proxmox. -
Nimbus, a Kubernetes cluster that is not functioning at all. I tried building out a GitOps-driven cluster as my second attempt and everything was going swimmingly until
nibbler
, a historically unreliable piece of hardware that when it works has more cores and memory that the rest of my infra combined, fell over again in yet another inexplicable way. -
Lrrr, a box with an Intel N150 and 32GB of memory running a VM on top of Proxmox that is hosting almost everything that was previously on the two Kubernetes clusters.
This jumbled state of affairs is basically due to a series of impulsive hardware purchases and "oh that's neat, let's do that" infrastructure changes.
Let's talk about failure domains.
I think of a failure domain as a set of risks and mitigation strategies as applied to a particular instance of a service.
The canonical example in the software-as-a-service world is "production", i.e. the instance of the service that the customers touch. The one that makes the money. The primary risk is the money going away if the service goes down.
A SaaS shop may have a staging environment, where changes get tested before they hit production. The main risk in staging is inconveniencing your coworkers, but the consequences of that to the company are much less impactful.
Each developer in then hopefully has one or more of their own environments in which to actually make the software. These are practially risk free to the company as a whole, only inconvenicing one developer if something goes awry.
Overcomplicated home infrastructure doesn't map neatly into the same failure domains as a SaaS business, of course, but they still exist.
When I think about the users of the services in my home I imagine a sort of abstract "household delight" score. Points accrue implicitly when things are running fine and people are able to use the things I'm trying to provide. Points get deducted when they notice things aren't working or when they see me stomping around grumbling about full hard drives and boot errors.
By that logic I have three different failure domains (actually four but we'll get to that):
-
Critical production. The absense of service would be immediately noticed and commented upon, often affecting the comfort of the occupants of the house. Examples: network, DNS, Home Assistant and friends, IoT coordinators.
-
Production: The absense of service would be noticed eventually but even an extended outage wouldn't cause hardship. Examples: Jellyfin, Sonarr and friends, paperless-ngx.
-
Lab: I'm the only one affected by things breaking in the lab. A playground for testing and fucking around.
The fourth failure domain that doesn't neatly map into the above is production services for external users. VMSave and this site are the big ones but there are a few smaller things too.
When I'm brutally honest with myself I have to recognize that the biggest common source of failure in every domain is me. Trying things, adding hardware, replacing software, messing around, testing in production.
Often my partner will remark "I don't understand how things just fail!" They usually don't. Failure is an immediate or delayed result of me changing something without considering the impact.
So. What to do.
Obviously first I need to deliniate the lab from everything else. Separate hardware for sure, maybe even hide it all behind another router and subnet.
For production, one plan would be to just put everything critical and production on the one docker VM and let it be. The machine isn't struggling overall but Jellyfin isn't super great because the N150 doesn't quite have the oomph necessary to transcode some of the stuff we have in real time.
Another plan would be to split them onto two machines running docker VMs. This would reduce the churn on critical production and reduce the chances of a change messing things up.
Yet another plan would be to spin up a separate Kubernetes cluster for each, moving right along the overcomplicated continuum.
The thing is, Kubernetes makes sense to me now that I've worked with it in anger a little. I really think for my application it makes sense, and the problems with Nimbus come down to nibbler
being flaky and k8s trying to self-heal without enough resources available.
I don't know what to do about external production. My intent was to have it at home out of principle (or maybe out of spite) but it would probably be better to have it in an isolated cloud environment.
The one Docker VM is working ok, but it's mixing failure domains which makes me uncomfortable. For now, things are how they are and I can't let myself worry about it too much.
Links in the footer if you have comments or ideas. I'd love to hear them.