Kubernetes workshop: Troubleshooting
Events
The first source of information when something goes wrong is the event stream. Note that you may want to sort them by creation time
1kubectl get events -n foo --sort-by=.metadata.creationTimestamp
2...
320m Normal Created pod/web-85575f4476-5pbqv Created container nginx
420m Normal Started pod/web-85575f4476-5pbqv Started container nginx
520m Normal SuccessfulDelete replicaset/web-987f6cf9 Deleted pod: web-987f6cf9-mzsxd
620m Normal ScalingReplicaSet deployment/web Scaled down replica set web-987f6cf9 to 0
Logs
Having a look to a pod's logs is just the matter of running
1kubectl logs -f --tail=7 -c mysql wordpress-mysql-6c597b98bd-4mbbd
22021-06-24 08:27:38 1 [Note] - '::' resolves to '::';
32021-06-24 08:27:38 1 [Note] Server socket created on IP: '::'.
42021-06-24 08:27:38 1 [Warning] Insecure configuration for --pid-file: Location '/var/run/mysqld' in the path is accessible to all OS users. Consider choosing a different directory.
52021-06-24 08:27:38 1 [Warning] 'proxies_priv' entry '@ root@wordpress-mysql-6c597b98bd-4mbbd' ignored in --skip-name-resolve mode.
62021-06-24 08:27:38 1 [Note] Event Scheduler: Loaded 0 events
72021-06-24 08:27:38 1 [Note] mysqld: ready for connections.
8Version: '5.6.51' socket: '/var/run/mysqld/mysqld.sock' port: 3306 MySQL Community Server (GPL)
Alternatively you can use a tool made to display logs from multiple pods: stern. A better way to explore logs is to send them to a central location using a tool such as Loki or the well know EFK stack.
Health checks
Kubernetes self healing system is mostly based on health checks
. There are different types of health checks (please have a look to the official documentation).
We'll add a new plugin to kubectl
which is really useful to export a resource while cleaning useless metadatas: neat
1kubectl krew install neat
2Updated the local copy of plugin index.
3Installing plugin: neat
4Installed plugin: neat
5...
Let's create a new deployment using the image nginx
1kubectl create deploy web --image=nginx --dry-run=client -o yaml | kubectl neat > /tmp/web.yaml
Edit its content and add an HTTP health check on port 80. The endpoint must return a code ranging between 200 and 400 and it has to be a relevant test that shows the actual availability of the service.
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 labels:
5 app: web
6 name: web
7spec:
8 replicas: 1
9 selector:
10 matchLabels:
11 app: web
12 template:
13 metadata:
14 creationTimestamp: null
15 labels:
16 app: web
17 spec:
18 containers:
19 - image: nginx
20 name: nginx
21 livenessProbe:
22 httpGet:
23 path: /
24 port: 80
25 initialDelaySeconds: 3
26 periodSeconds: 3
1kubectl apply -f /tmp/web.yaml
2deployment.apps/web created
3
4kubectl describe deploy web | grep Liveness:
5 Liveness: http-get http://:80/ delay=3s timeout=1s period=3s #success=1 #failure=3
The pod should be up without any error
1kubectl get po -l app=web
2NAME READY STATUS RESTARTS AGE
3web-85575f4476-6qvd5 1/1 Running 0 92s
We're going to simulate a service being unavailable, just change the path being checked. Here we'll use another method to modify a resource by creating a patch
and applying it.
Create a yaml /tmp/patch.yaml
file
1cat > /tmp/patch.yaml <<EOF
2spec:
3 template:
4 spec:
5 containers:
6 - name: nginx
7 livenessProbe:
8 httpGet:
9 path: /foobar
10EOF
And we're going to apply our change as follows
1kubectl patch deployment web --patch "$(cat /tmp/patch.yaml)" --record
2deployment.apps/web patched
3
4kubectl describe deployment web | grep Liveness:
5 Liveness: http-get http://:80/foobar delay=3s timeout=1s period=3s #success=1 #failure=3
Now our pod should start to fail, the number of restarts increases
1kubectl get po -l app=web
2web-987f6cf9-n4rnb 1/1 Running 4 83s
Until the pod enter in a CrashLoopBackOff
, meaning that it constantly restarts.
1kubectl get po -l app=web
2NAME READY STATUS RESTARTS AGE
3web-987f6cf9-n4rnb 0/1 CrashLoopBackOff 5 3m23s
Describing the pod will give you a hint on the reason it restarts
1kubectl describe po web-987f6cf9-n4rnb | tail -n 5
2Normal Created 4m7s (x3 over 4m30s) kubelet Created container nginx
3Normal Started 4m7s (x3 over 4m30s) kubelet Started container nginx
4Warning Unhealthy 3m56s (x9 over 4m26s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 404
5Normal Killing 3m56s (x3 over 4m20s) kubelet Container nginx failed liveness probe, will be restarted
6Normal Pulling 3m56s (x4 over 4m35s) kubelet Pulling image "nginx"
Rollback the latest change in order to return to a working state.
Note that we used the option --record
when we applied the patch. That helps saving changes history.
1kubectl rollout history deployment web
2deployment.apps/web
3REVISION CHANGE-CAUSE
41 <none>
52 kubectl patch deployment web --patch=spec:
6 template:
7 spec:
8 containers:
9 - name: nginx
10 livenessProbe:
11 httpGet:
12 path: /foobar --record=true
13
14kubectl rollout undo deployment web
15deployment.apps/web rolled back
Cleanup
1kubectl delete deploy web
2deployment.apps "web" deleted
learnk8s documentation
There is a great documentation that contains all the steps that help debugging a deployment: https://learnk8s.io/troubleshooting-deployments
➡️ Next: RBAC