To start with, I would like to explain what does self-healing means in terms of Kubernetes. Self-healing is a fantastic feature of Kubernetes to recover from service or node failure automatically. In the following article, we will consider the benefit of using replication for your micro-services and how the Kubernetes cluster can automatically recover from a service failure.
Prerequisite
One of the great features of Kubernetes is the ability to replicate pods and their underlying containers across the cluster. So, before we set up our self-healing feature please make sure you have managed replication and here is a simple example of a deployment file that will deploy nginx
container with replication factor 3
:
apiVersion: apps/v1 kind: Deployment-example metadata: name: nginx-deployment-example labels: app: nginx spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:1.15.4 ports: - containerPort: 80
All right, so let’s create the deployment:
kubectl create -f deployment.yaml
Now, let’s check whether our nginx-deployment-example
was created:
kubectl get deployments -n default
You should see your nginx-deployment-example
deployment in the default namespace. If we want to see more details about those pods, please run the following command:
kubectl get pods -n default
We will see our 3 nginx-deployment-example
pods:
NAME READY STATUS RESTARTS AGE nginx-deployment-example-f4cd8584-f494x 1/1 Running 0 94s nginx-deployment-example-f4cd8584-qvkbg 1/1 Running 0 94s nginx-deployment-example-f4cd8584-z2bzb 1/1 Running 0 94s
Self-healing
Kubernetes ensures that the actual state of the cluster and the desired statue of the cluster are always in-sync. This is made possible through continuous monitoring within the Kubernetes cluster. Whenever the state of a cluster changes from what has been defined, the various components of Kubernetes work to bring it back to its defined state. This automated recovery is often referred to as self-healing.
So, let’s copy one of the pods mentioned in the prerequisite and see what happens when we delete it:
kubectl delete pod nginx-deployment-example-f4cd8584-f494x
And after a few seconds, we see that our pod was deleted:pod "nginx-deployment-example-f4cd8584-f494x" deleted
Let’s go ahead and list the pods one more time:
kubectl get pods -n default NAME READY STATUS RESTARTS AGE nginx-deployment-example-f4cd8584-qvkbg 1/1 Running 0 109s nginx-deployment-example-f4cd8584-sgfqq 1/1 Running 0 5s nginx-deployment-example-f4cd8584-z2bzb 1/1 Running 0 109s
And we see that the pod nginx-deployment-example-f4cd8584-sgfqq
was automatically created to replace our deleted pod nginx-deployment-example-f4cd8584-f494x
. And the reason is that nginx
deployment is set to have 3 replicas. So, even though one of these was deleted, our Kubernetes cluster works to make sure that the desired state is the actual state that we have.
So, now let’s consider the case when there is an actual node failure in your cluster. First, let’s check our nodes:
kubectl get nodes
You will see your Master and Worker nodes. Now, let’s figure out what server our pods are running on. To do that, we have to describe
it:
kubectl describe pod nginx-deployment-example-f4cd8584-qvkbg
Under the Events, you can see to which server the pod was assigned. We can also scroll up and under Node will also see where it has been assigned, which is of course will be the same server. Once you’ve identified all servers that are pods running on, you have to pick one server and simulate a node failure by shutting down
the server.
Once the node has been shut down let us head back to the Master and check on the status of the nodes:
kubectl get nodes
We see that the cluster knows that one node is down. Let’s also list our pods:
kubectl get pods -n default
You see that the state of the pod that was running on "failed"
node is Unknown, but you see that another pod took its place. Let’s go ahead and list the deployments:
kubectl get deployments -n default
And as we expected we have 3 pods available which are now in sync with our Desired amount of pods.
Now, let’s try to describe the Unknown pod:
kubectl describe pod <pod_in_unknown_state>
You see that the Status of Unknown pod is Terminating
, Termination Grace Period
will be the 30s
by default and the reason for this is NodeLost
. Also, you will see the messages that specifies that our node which was running our pod is unresponsive.
Now, let’s start
our "failed"
node and wait till it successfully rejoins the cluster.
Alright, once "failed"
node is up and running, let’s go ahead and check the status of our deployment:
kubectl get deployments -n default
We will see that our pod count is in sync. So, let’s list our pods out:
kubectl get pods -n default
We will see that our Kubernetes cluster has finally terminated the old pod, and we are left with our desired count of 3
pods.
Conclusion
As we see, Kubernetes takes a self-healing approach to infrastructure that reduces the criticality of failures, making fire drills less common. Kubernetes heals itself when there is a discrepancy and ensures the cluster always matches the declarative state. In other words, Kubernetes kicks in and fixes a deviation, if detected. For example, if a pod goes down, a new one will be deployed to match the desired state. Alternatively you can also use a cloud container hosting platform.