contents

Blogs / Which Kubernetes Distribution Should You Choose? Lessons From My Failure.

Which Kubernetes Distribution Should You Choose? Lessons From My Failure.

August 17, 2023 • Matthew Duong • Kubernetes;Self Hosting • 4 min read

Background

Last weekend, my home Kubernetes cluster blew up. I was running a 6-node cluster for some extensive workloads for coderone (ai game tournament) and headbot (generative ai avatars), and suddenly, things went south. Here's my journey from the collapse to the rebuild, and what I learned in the process.

Gif showing Coderone Ai Game Programming Tournament

The Downfall: Anomalies in my MicroK8s Cluster

I've been using MicroK8s for the past three years, initially drawn to it as an optional add-on for Ubuntu installations. However, issues started to emerge when all of a sudden I noticed some of these symptoms:

Pods and workloads all became stuck in an unknown state or an error state.
Random network errors cropped up.
Basic kubectl commands took 30-50 seconds to complete and many were timing out even though I was debugging on the local network.
Further digging revealed that these issues were symptoms of a corrupted control plane data storage.
Attempts to rebuild the cluster by removing and re-adding nodes presented in three further symptoms:
Nodes would never show up in the cluster even though the microk8s join command succeeded.
I was unable to add new nodes since the existing configuration kept saying the node already existed in the cluster but never showed up in the cluster status.
The join was successful but the node would never ready up.

An image showing the MicroK8s logo by Canonical

Digging Deeper: Kubernetes Distributed Storage

Upon further investigation, I found that MicroK8s uses dqlite for storing cluster configuration. Though lightweight, dqlite isn't as mature as etcd, the data store that most production-grade / high availability distributions use.

Dqlite: The Lightweight Contender

What is Dqlite?

Dqlite stands for "Distributed SQLite," and it aims to extend the well-known SQLite database engine to a clustered environment. SQLite itself is an embedded SQL database, highly praised for its reliability and efficiency in a wide range of applications, from mobile devices to web servers.

Why Dqlite?

Simplicity: One of the most appealing aspects of dqlite is its simplicity. It's easier to install and get up and running than many other distributed databases.

Resource Efficiency: Dqlite is lightweight and has lower CPU and memory requirements compared to etcd, making it a popular choice for smaller clusters or edge computing.

Limitations

Maturity: Dqlite is not as mature as etcd. This can translate to less community support and fewer features geared toward high availability and data consistency. Limited Use-Cases: Given its lighter weight, dqlite may not be the best option for more extensive, production-grade clusters requiring high availability and fault tolerance.

Etcd: The Production-Grade Choice

What is Etcd?

Etcd is a distributed key-value store used as Kubernetes’ backing store for all cluster data. Developed by CoreOS, it's now maintained by the Cloud Native Computing Foundation (CNCF).

An image showing the etcd logo by CoreOS

Why Etcd?

High Availability: Etcd is built with high availability in mind. It can tolerate machine failures, network partitions, and will elect a new leader automatically if the current one fails.

ACID Compliant: Etcd offers ACID properties, ensuring data consistency across the cluster. This is crucial for applications that require transactions to be processed reliably.

Limitations

Resource Intensity: Etcd clusters require fast hardware to run (typically an ssd is recommended). There are explicit warnings about running etcd on the sd card of a raspberry pi.

Complexity: The setup can be complex and might require a better understanding of its operational aspects, including cluster configuration, data backup, and regular maintenance.

Exploring Alternatives: K3s and RKE2

After pulling the plug on MicroK8s, I explored other distributions—k3s and RKE2, to be precise.

K3s: The Good and The Bad

K3s seemed like a good fit at first, but my efforts to set it up in high-availability mode were not successful. By default, K3s uses dqlite for single-node setups and switches to etcd for high-availability setups. I was easily able to get a single node running, however I was unable to setup the high availability configuration (for three nodes). So I moved on to RKE2.

An image showing the K3S logo by Rancher

The RKE2 Experience

Out of the box RKE2 promised:

FIPS 140-2 Compliance: Designed with a focus on security, RKE2 meets the requirements for US government projects.

Air-Gap Support: It can run in environments without direct internet access, offering flexibility for secure or isolated deployments.

Windows Node Support: Unlike many distributions that focus solely on Linux nodes, RKE2 extends its support to Windows nodes. Best of all the setup worked the first time around (it took me about 5 minutes to setup and join 6 nodes in the cluster). 3 master nodes and 3 worker nodes.

An image showing the RKE2 logo by Rancher

Learnings and Takeaways

The main takeaways from this experience is that each flavour of kubernetes serves a different purpose. If you only intend to run a single node and want to experiment then I would highly recommend microk8s.

My take on Microk8s

Out of the box it is highly configurable with many free addons that are usually quite finicky to install:

metallb to replace the cloud load balancers you get with eks, aks, gke
nvidia-gpu operator to run your ai workloads
nginx ingress to have your site exposed to the public internet immediately.

However the shortcomings become apparent almost immediately when you want to scale up and run in high availability. Your network quickly becomes choked and your cluster may just suddenly break.

An image of buff obama 'swolebama' generated by headbot ai

My take on K3s I’ve had quite some success with K3s where hardware is limited. I made a tutorial on running your own blog from a raspberry pi. It promises to be the solution for iot and is supposed to be able to run on beefier hardware in high availability mode. However I could not configure it out of the box.

My take on RKE2

Up to this point, RKE2 has been running smoothly for me, easily handling the workloads that brought down my MicroK8s setup. The installation process was even simpler than with MicroK8s. However, it does require more initial configuration for features that come pre-configured in MicroK8s.

Matthew Duong's blog

Which Kubernetes Distribution Should You Choose? Lessons From My Failure.

Background

The Downfall: Anomalies in my MicroK8s Cluster

Digging Deeper: Kubernetes Distributed Storage

Dqlite: The Lightweight Contender

What is Dqlite?

Why Dqlite?

Limitations

Etcd: The Production-Grade Choice

What is Etcd?

Why Etcd?

Limitations

Exploring Alternatives: K3s and RKE2

K3s: The Good and The Bad

The RKE2 Experience

Learnings and Takeaways

My take on Microk8s

My take on RKE2

Let's connect