If your Kubernetes cluster has suddenly stopped working, it might be because of expired certificates. I spent almost the entire Sunday afternoon troubleshooting this only because of extremely poor error reporting by Kubernetes services (kubelet) and useless documentation found online.
First what symptoms
The three VMs hosting the Kubernetes cluster started up as usual and to my horror running kubectl gave me:
The connection to the server x.x.x.:6443 was refused - did you specify the right host or port
Nice, right?
Initial thoughts
Now this machine has not been touched for 5 days, everything was working fine till I shutdown 5 days back so what could be the problem? Looking at the problem I started troubleshooting kubectl. I moved from kubctl to api-server and finally to the Kubelet service. I noticed that the kubelet service was failing with no information hinting why. I read numerous posts each claiming to fix the problem. I noticed the version of Kubernets was quite out of date: 1.14 when I have 1.21.1. Many of the command mentioned in multiple blog posts were the same or had slight variations of each other like this command (link):
$ sudo kubeadm alpha kubeconfig user --org system:nodes --client-name system:node:$(hostname) Command "user" is deprecated, please use the same command under "kubeadm kubeconfig" required flag(s) "config" not set To see the stack trace of this error execute with --v=5 or higher
What worked
What worked is a mishmash of steps from multiple blog posts. Some of the steps may even be redundant. However I am placing these steps exactly how I used them hoping they will work for you. Note: do not try this in production and it would be a great idea to take a snapshot of your master node before trying these commands.
Before jumping into copy-pasting the commands
Before jumping right in let's at least be sure the problem is expired certificates.
$ sudo kubeadm certs check-expiration [check-expiration] Reading configuration from the cluster... [check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' [check-expiration] Error reading configuration from the Cluster. Falling back to default configuration CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED admin.conf May 22, 2022 19:16 UTCno apiserver May 22, 2022 19:16 UTC ca no apiserver-etcd-client May 22, 2022 19:16 UTC etcd-ca no apiserver-kubelet-client May 22, 2022 19:16 UTC ca no controller-manager.conf May 22, 2022 19:16 UTC no etcd-healthcheck-client May 22, 2022 19:16 UTC etcd-ca no etcd-peer May 22, 2022 19:16 UTC etcd-ca no etcd-server May 22, 2022 19:16 UTC etcd-ca no front-proxy-client May 22, 2022 19:16 UTC front-proxy-ca no scheduler.conf May 22, 2022 19:16 UTC no CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED ca May 20, 2031 19:16 UTC 8y no etcd-ca May 20, 2031 19:16 UTC 8y no front-proxy-ca May 20, 2031 19:16 UTC 8y no
As we can see all the certificates expired on May 22 so its quite likely this is what is causing all the problem.
Backing up the old certs and configs
Many of the articles had this steps so here it is:
$ mkdir -p $HOME/k8s-old-certs/pki $ sudo /bin/cp -p /etc/kubernetes/pki/*.* $HOME/k8s-old-certs/pki $ sudo /bin/cp -p /etc/kubernetes/*.conf $HOME/k8s-old-certs $ mkdir -p $HOME/k8s-old-certs/.kube $ sudo /bin/cp -p ~/.kube/config $HOME/k8s-old-certs/.kube/.
Renewing the certificates
$ sudo kubeadm certs renew all [renew] Reading configuration from the cluster... [renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' [renew] Error reading configuration from the Cluster. Falling back to default configuration certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed certificate for serving the Kubernetes API renewed certificate the apiserver uses to access etcd renewed certificate for the API server to connect to kubelet renewed certificate embedded in the kubeconfig file for the controller manager to use renewed certificate for liveness probes to healthcheck etcd renewed certificate for etcd nodes to communicate with each other renewed certificate for serving etcd renewed certificate for the front proxy client renewed certificate embedded in the kubeconfig file for the scheduler manager to use renewed
The certiciate used by kubelet
You'll find four files /var/lib/kubelet/pki/. One of them is kubelet.crt. This file has also expired if we check with openssl:
$ sudo cat /var/lib/kubelet/pki/kubelet.crt | openssl x509 -noout -enddate notAfter=May 22 18:16:16 2022 GMT
I tried to read about this file but there seems to be a lot of confusion what this file is used for, in fact there is disagreement as to whether this file is even needed at all.
Deleting old certificates
Stopping kubectl was not mentioned in any of the articles but I did it anyway:
$ sudo systemctl stop kubelet
$ sudo rm /etc/kubernetes/kubelet.conf $ sudo ls /var/lib/kubelet/pki -rw------- 1 root root 2830 May 22 2021 kubelet-client-2021-05-22-19-16-17.pem lrwxrwxrwx 1 root root 59 May 22 2021 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2021-05-22-19-16-17.pem -rw-r--r-- 1 root root 2266 May 22 2021 kubelet.crt -rw------- 1 root root 1675 May 22 2021 kubelet.key $ sudo rm /var/lib/kubelet/pki/kubelet-client-2021-05-22-19-16-17.pem $ sudo rm /var/lib/kubelet/pki/kubelet-client-current.pem $ sudo rm /var/lib/kubelet/pki/kubelet.crt $ sudo rm /var/lib/kubelet/pki/kubelet.key
One blog post mentioned that simply restarting all Kubernetes services should fix the problem after deleting the above files. Needless to say that did not work.
The silver bullet
This specific command regenerated the kube config file. The only article which mentions this step is this one.
$ sudo kubeadm init phase kubeconfig kubelet $ sudo systemctl start kubelet
After this step we at least have kubelet service starting up. However on trying to run any kubectl commands, we see:
$ kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} error: You must be logged in to the server (the server has asked for the client to provide credentials)
Now it looks like there is something which is preventing the client (kubectl) talking to the api server - a slight improvement.
Updating the client data configs
This part of the answer came from here. We need to copy two strings from /etc/kubernetes/admin.conf to ~/.kube/config. Replace the strings for "client-certificate-data" and "client-key-data" in ~/.kube/config using values present /etc/kubernetes/admin.conf. Let's now run kubectl for the last time:
$ kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
What about the minions?
Running get nodes will results in something like this:
$ kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-master Ready control-plane,master 378d v1.21.1 k8s-node1 Ready <none> 378d v1.21.1 k8s-node2 NotReady <none> 378d v1.21.1
We got the master node working with renewed certificates but not the worker nodes. Again this information was not readily available and multiple links, posts and some imagination led to the solution. After trying a lot of stuff, I finally asked myself 'why not just rejoin the cluster?' and that worked. First run the following command on the master node to get the join command:
$ sudo kubeadm token create --print-join-command kubeadm join 192.168.220.30:6443 --token scc32h.8ycsccrktmjyaruv --discovery-token-ca-cert-hash sha256:4b80f208d7ad0937f0ca3e1b2bc5f02e9b4a7e04d76847e9ce33731fa1ab1224
SSH into each worker node, run the following commands:
$ sudo kubeadm reset $ kubeadm join 192.168.220.30:6443 --token scc32h.8ycsccrktmjyaruv --discovery-token-ca-cert-hash sha256:4b80f208d7ad0937f0ca3e1b2bc5f02e9b4a7e04d76847e9ce33731fa1ab1224
Replace the join command with the one generated for your system. Running get nodes now prints:
$ kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-master Ready control-plane,master 378d v1.21.1 k8s-node1 Ready <none> 378d v1.21.1 k8s-node2 Ready <none> 378d v1.21.1
Yaay, phew ... I can now start working on what I wanted to 5 hours back!