Renewing certificates in a Kubernetes cluster
Monday 06, June 2022   |   Post link

If your Kubernetes cluster has suddenly stopped working, it might be because of expired certificates. I spent almost the entire Sunday afternoon troubleshooting this only because of extremely poor error reporting by Kubernetes services (kubelet) and useless documentation found online.

First what symptoms

The three VMs hosting the Kubernetes cluster started up as usual and to my horror running kubectl gave me:

The connection to the server x.x.x.:6443 was refused - did you specify the right host 
or port

Nice, right?

Initial thoughts

Now this machine has not been touched for 5 days, everything was working fine till I shutdown 5 days back so what could be the problem? Looking at the problem I started troubleshooting kubectl. I moved from kubctl to api-server and finally to the Kubelet service. I noticed that the kubelet service was failing with no information hinting why. I read numerous posts each claiming to fix the problem. I noticed the version of Kubernets was quite out of date: 1.14 when I have 1.21.1. Many of the command mentioned in multiple blog posts were the same or had slight variations of each other like this command (link):

$ sudo kubeadm alpha kubeconfig user --org system:nodes --client-name system:node:$(hostname)
Command "user" is deprecated, please use the same command under "kubeadm kubeconfig"
required flag(s) "config" not set
To see the stack trace of this error execute with --v=5 or higher

What worked

What worked is a mishmash of steps from multiple blog posts. Some of the steps may even be redundant. However I am placing these steps exactly how I used them hoping they will work for you. Note: do not try this in production and it would be a great idea to take a snapshot of your master node before trying these commands.

Before jumping into copy-pasting the commands

Before jumping right in let's at least be sure the problem is expired certificates.

$ sudo kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

admin.conf                 May 22, 2022 19:16 UTC                                  no
apiserver                  May 22, 2022 19:16 UTC          ca                      no
apiserver-etcd-client      May 22, 2022 19:16 UTC          etcd-ca                 no
apiserver-kubelet-client   May 22, 2022 19:16 UTC          ca                      no
controller-manager.conf    May 22, 2022 19:16 UTC                                  no
etcd-healthcheck-client    May 22, 2022 19:16 UTC          etcd-ca                 no
etcd-peer                  May 22, 2022 19:16 UTC          etcd-ca                 no
etcd-server                May 22, 2022 19:16 UTC          etcd-ca                 no
front-proxy-client         May 22, 2022 19:16 UTC          front-proxy-ca          no
scheduler.conf             May 22, 2022 19:16 UTC                                  no

ca                      May 20, 2031 19:16 UTC   8y              no
etcd-ca                 May 20, 2031 19:16 UTC   8y              no
front-proxy-ca          May 20, 2031 19:16 UTC   8y              no

As we can see all the certificates expired on May 22 so its quite likely this is what is causing all the problem.

Backing up the old certs and configs

Many of the articles had this steps so here it is:

$ mkdir -p $HOME/k8s-old-certs/pki
$ sudo /bin/cp -p /etc/kubernetes/pki/*.* $HOME/k8s-old-certs/pki
$ sudo /bin/cp -p /etc/kubernetes/*.conf $HOME/k8s-old-certs
$ mkdir -p $HOME/k8s-old-certs/.kube
$ sudo /bin/cp -p ~/.kube/config $HOME/k8s-old-certs/.kube/.

Renewing the certificates

$ sudo kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed 

The certiciate used by kubelet

You'll find four files /var/lib/kubelet/pki/. One of them is kubelet.crt. This file has also expired if we check with openssl:

$ sudo cat /var/lib/kubelet/pki/kubelet.crt | openssl x509 -noout -enddate
notAfter=May 22 18:16:16 2022 GMT

I tried to read about this file but there seems to be a lot of confusion what this file is used for, in fact there is disagreement as to whether this file is even needed at all.

Deleting old certificates

Stopping kubectl was not mentioned in any of the articles but I did it anyway:

$ sudo systemctl stop kubelet
$ sudo rm /etc/kubernetes/kubelet.conf
$ sudo ls /var/lib/kubelet/pki
-rw------- 1 root root 2830 May 22  2021 kubelet-client-2021-05-22-19-16-17.pem
lrwxrwxrwx 1 root root   59 May 22  2021 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2021-05-22-19-16-17.pem
-rw-r--r-- 1 root root 2266 May 22  2021 kubelet.crt
-rw------- 1 root root 1675 May 22  2021 kubelet.key

$ sudo rm /var/lib/kubelet/pki/kubelet-client-2021-05-22-19-16-17.pem
$ sudo rm /var/lib/kubelet/pki/kubelet-client-current.pem    
$ sudo rm /var/lib/kubelet/pki/kubelet.crt
$ sudo rm /var/lib/kubelet/pki/kubelet.key

One blog post mentioned that simply restarting all Kubernetes services should fix the problem after deleting the above files. Needless to say that did not work.

The silver bullet

This specific command regenerated the kube config file. The only article which mentions this step is this one.

$ sudo kubeadm init phase kubeconfig kubelet
$ sudo systemctl start kubelet

After this step we at least have kubelet service starting up. However on trying to run any kubectl commands, we see:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
error: You must be logged in to the server (the server has asked for the client to provide credentials)    

Now it looks like there is something which is preventing the client (kubectl) talking to the api server - a slight improvement.

Updating the client data configs

This part of the answer came from here. We need to copy two strings from /etc/kubernetes/admin.conf to ~/.kube/config. Replace the strings for "client-certificate-data" and "client-key-data" in ~/.kube/config using values present /etc/kubernetes/admin.conf. Let's now run kubectl for the last time:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

What about the minions?

Running get nodes will results in something like this:

$ kubectl get nodes
NAME         STATUS     ROLES                  AGE    VERSION
k8s-master   Ready      control-plane,master   378d   v1.21.1
k8s-node1    Ready      <none>           378d   v1.21.1
k8s-node2    NotReady   <none>           378d   v1.21.1

We got the master node working with renewed certificates but not the worker nodes. Again this information was not readily available and multiple links, posts and some imagination led to the solution. After trying a lot of stuff, I finally asked myself 'why not just rejoin the cluster?' and that worked. First run the following command on the master node to get the join command:

$ sudo kubeadm token create --print-join-command
kubeadm join --token scc32h.8ycsccrktmjyaruv --discovery-token-ca-cert-hash sha256:4b80f208d7ad0937f0ca3e1b2bc5f02e9b4a7e04d76847e9ce33731fa1ab1224

SSH into each worker node, run the following commands:

$ sudo kubeadm reset
$ kubeadm join --token scc32h.8ycsccrktmjyaruv --discovery-token-ca-cert-hash sha256:4b80f208d7ad0937f0ca3e1b2bc5f02e9b4a7e04d76847e9ce33731fa1ab1224

Replace the join command with the one generated for your system. Running get nodes now prints:

$ kubectl get nodes
NAME         STATUS   ROLES                  AGE    VERSION
k8s-master   Ready    control-plane,master   378d   v1.21.1
k8s-node1    Ready    <none>           378d   v1.21.1
k8s-node2    Ready    <none>           378d   v1.21.1    

Yaay, phew ... I can now start working on what I wanted to 5 hours back!

Categories: Kubernetes (1)


