节点故障排查思路

参考文档:
查看节点状态
1
| kubectl describe nodes $nodename
|
查看ntp
1 2 3
| systemctl status chronyd systemctl restart chronyd journalctl -u chronyd
|
重启kubelet和docker
1 2 3 4 5 6 7 8
| systemctl stop kubelet systemctl stop docker systemctl stop docker.socket systemctl stop containerd systemctl daemon-reload systemctl start containerd systemctl start docker systemctl start kubelet
|
docker启动失败
docker启动卡住,查看日志
1 2
| systemctl status docker -l journalctl -ru docker
|
报错:
1
| Error (Unable to complete atomic operation, key modified) deleting object [endpoint 622bf1a499580702606742e5f5554ac99e7c0d61abcd5d9063881fc2da33d16f afdce62ce70de2cbe5a971b05521280940947e4968c163e48c3e5252919a4fae], retrying....
|
解决办法:
1 2 3 4
| ps -ef | grep docker kill -9 xxx systemctl stop containerd systemctl start docker
|
降级docker
1 2 3 4 5 6 7 8 9 10 11
| docker version yum list docker-ce --showduplicates | sort -r systemctl stop kubelet systemctl stop docker systemctl stop docker.socket systemctl stop containerd version=19.03.15 yum downgrade --setopt=obsoletes=0 -y docker-ce-${version} docker-ce-cli-${version} docker-ce-selinux-${version} containerd.io systemctl start containerd systemctl start docker systemctl start kubelet
|
PLEG is not healthy
Pod生命周期事件生成器PLEG(Pod Lifecycle Event Generator)会记录Pod生命周期中的各种事件,如容器的启动、终止等。PLEG is not healthy异常通常是由于节点上的运行时进程异常或者节点Systemd版本缺陷导致。