Kubernetes. Replication and self-healing

March 18, 2023 3 min read

To start with, I would like to explain what does self-healing means in terms of Kubernetes. Self-healing is a fantastic feature of Kubernetes to recover from service or node failure automatically. In the following article, we will consider the benefit of using replication for your micro-services and how the Kubernetes cluster can automatically recover from a service failure.

Prerequisite

One of the great features of Kubernetes is the ability to replicate pods and their underlying containers across the cluster. So, before we set up our self-healing feature please make sure you have managed replication and here is a simple example of a deployment file that will deploy nginx container with replication factor 3:

apiVersion: apps/v1
kind: Deployment-example
metadata:
  name: nginx-deployment-example
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.15.4
        ports:
        - containerPort: 80

All right, so let’s create the deployment:

kubectl create -f deployment.yaml

Now, let’s check whether our nginx-deployment-example was created:

 kubectl get deployments -n default

You should see your nginx-deployment-example deployment in the default namespace. If we want to see more details about those pods, please run the following command:

kubectl get pods -n default

We will see our 3 nginx-deployment-example pods:

NAME                                     READY   STATUS    RESTARTS   AGE
nginx-deployment-example-f4cd8584-f494x   1/1     Running   0          94s
nginx-deployment-example-f4cd8584-qvkbg   1/1     Running   0          94s
nginx-deployment-example-f4cd8584-z2bzb   1/1     Running   0          94s

Self-healing

Kubernetes ensures that the actual state of the cluster and the desired statue of the cluster are always in-sync. This is made possible through continuous monitoring within the Kubernetes cluster. Whenever the state of a cluster changes from what has been defined, the various components of Kubernetes work to bring it back to its defined state. This automated recovery is often referred to as self-healing.
So, let’s copy one of the pods mentioned in the prerequisite and see what happens when we delete it:

kubectl delete pod nginx-deployment-example-f4cd8584-f494x

And after a few seconds, we see that our pod was deleted:
pod "nginx-deployment-example-f4cd8584-f494x" deleted
Let’s go ahead and list the pods one more time:

kubectl get pods -n default
NAME                                     READY   STATUS    RESTARTS   AGE
nginx-deployment-example-f4cd8584-qvkbg   1/1     Running   0          109s
nginx-deployment-example-f4cd8584-sgfqq   1/1     Running   0          5s
nginx-deployment-example-f4cd8584-z2bzb   1/1     Running   0          109s

And we see that the pod nginx-deployment-example-f4cd8584-sgfqq was automatically created to replace our deleted pod nginx-deployment-example-f4cd8584-f494x. And the reason is that nginx deployment is set to have 3 replicas. So, even though one of these was deleted, our Kubernetes cluster works to make sure that the desired state is the actual state that we have.
So, now let’s consider the case when there is an actual node failure in your cluster. First, let’s check our nodes:

kubectl get nodes

You will see your Master and Worker nodes. Now, let’s figure out what server our pods are running on. To do that, we have to describe it:

kubectl describe pod nginx-deployment-example-f4cd8584-qvkbg

Under the Events, you can see to which server the pod was assigned. We can also scroll up and under Node will also see where it has been assigned, which is of course will be the same server. Once you’ve identified all servers that are pods running on, you have to pick one server and simulate a node failure by shutting down the server.
Once the node has been shut down let us head back to the Master and check on the status of the nodes:

kubectl get nodes

We see that the cluster knows that one node is down. Let’s also list our pods:

kubectl get pods -n default

You see that the state of the pod that was running on "failed" node is Unknown, but you see that another pod took its place. Let’s go ahead and list the deployments:

kubectl get deployments -n default

And as we expected we have 3 pods available which are now in sync with our Desired amount of pods.
Now, let’s try to describe the Unknown pod:

kubectl describe pod <pod_in_unknown_state>

You see that the Status of Unknown pod is Terminating, Termination Grace Period will be the 30s by default and the reason for this is NodeLost. Also, you will see the messages that specifies that our node which was running our pod is unresponsive.
Now, let’s start our "failed" node and wait till it successfully rejoins the cluster.
Alright, once "failed" node is up and running, let’s go ahead and check the status of our deployment:

kubectl get deployments -n default

We will see that our pod count is in sync. So, let’s list our pods out:

kubectl get pods -n default

We will see that our Kubernetes cluster has finally terminated the old pod, and we are left with our desired count of 3 pods.

Conclusion

As we see, Kubernetes takes a self-healing approach to infrastructure that reduces the criticality of failures, making fire drills less common. Kubernetes heals itself when there is a discrepancy and ensures the cluster always matches the declarative state. In other words, Kubernetes kicks in and fixes a deviation, if detected. For example, if a pod goes down, a new one will be deployed to match the desired state. Alternatively you can also use a cloud container hosting platform.

Discover more with Gcore Managed Kubernetes