Service Breakdown in my Kubernetes Cluster: Steps, Solution, Learning

In the last year, I setup a Kubernetes Cluster in the cloud. It hosts my blog and the two apps Lighthouse and ApiBlaze. Right in the middle of my holiday, I received a notification email that ApiBlaze is down, and then on Friday, also my blog and Lighthouse were not available anymore. This articles discusses how I approached the problem, found a solution, and what I learned.

This article originally appeared at my blog.

Upptime Monitoring

An additional feature of Upptime IO that I became aware of recently is its deep integration with Github. On the day that my services broke down, I did not only get a notification email, but also automatically a Github issue was created. These issues allow you to publicly communicate about an outage and to give users of your service a single point of information.

So at the end of my holiday week, I acknowledged the issues, and started to investigate the root causes.

Identify Current Status

To interact with the Kubernetes cluster, I work directly with the kubectl command line tool. So, I wanted to check the status of the Pods with kubectl get pods --all-namespaces and to see the events that were happening in the cluster with kubectl get events --all-namespaces. However, I could not connect to the cluster with my local machine: An error message (that I did not record) stated that the certificate to connect to the cluster is not valid anymore. Instead, I logged into my master node, ran the kubectl command there - they connect directly to the local running Kubernetes API, and therefore could be executed. I saw that all pods for my main applications were down.

But why?

Identify Problems

  • The master node was quickly consuming all disk space. About 5Gb of system logfiles were written in the course of 24 hours — error messages from the Kubernetes system!
  • The master node error messages indicated two problems: It could not communicate with the worker nodes to resolve K8S resources, and it could not schedule workload on its own machine or at the worker nodes.
  • The master node hosted an instance of my blog — which it should not do, all application hosting should happen on the worker nodes
  • The worker nodes had the status of not ready, which means they were not reporting their status to the master node
  • The worker nodes Kubernetes log files showed several error messages, again I saw the “certificate is invalid” message. Apparently, because of an internal certificate problem, the nodes could not communicate with each other

The following picture summarizes my problem understanding:

  • For some reason, Kubernetes internal node communication was disrupted
  • All nodes produced massive amount of error messages, consuming disk space
  • The master node started to schedule workloads on itself

Now, how do I resolve these issues? And what are my priorities?

Re-Enable Node Communication: 1st attempt

My first attempt to get the worker nodes running normally was to restart the Kubernetes service. Using the systemctl restart k3s-agent command line utility on each worker nodes did not resolve the problem. The communication problems persisted, and I could see a very clear error message: Error: 'x509: certificate has expired or is not yet valid

Searching for this error in the context of Kubernetes and K3S, I found Github issues, blog posts, and stack overflow posts. Since these articles and post were often in the context of a very specific application, it was very hard to generalize the specific problems into a solution for my particular problem. Eventually — I really don’t have a better word for it — I found a hint to delete the central Coredns certificate of the Kubernetes control plane, and then to restart the Coredns pod so that a new certificate is generated. I followed this approach, and then …

Interception: No Disk Space Available

This is a crucial learning. I failed to recognize the immediate issue of running out of disk space while trying to make the cluster operational again. I ignored a tactical problem while working on the strategic challenge.

Ok, the master node is operational again, lets resolve the communications problem.

Re-Enable Node Communication: 2nd attempt

On the worker nodes, I issued systemctl restart k3s-agent again. And very soon, the nodes were reported as ready again!

Cleanup

Now I just need to scale up the deployments. But a crucial thing was missing: My private Docker registry that hosts the Docker images. For my private applications, I decided to not use a persisted volume with the registry. But the registry pod was gone — and so where all of its images. Therefore, I needed to re-build all images, upload them, and then scale the deployments.

During this step, I also modified each application deployment spec to prevent that they are scheduled on the master node again. This can be done with a nodeAffinity configuration in the deployment spec as shown here:

apiVersion: apps/v1
kind: Deployment
metadata:
# ...
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: lighthouse-web
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k3s-node1
- k3s-node2
containers:
- name: lighthouse-web
image: docker.admantium.com/lighthouse-web:0.3.1
...

After re-building all images and pushing them to the registry, my blog, lighthouse and ApiBlaze applications were finally reachable again.

Conclusion

IT Project Manager & Developer