一个计算机技术爱好者与学习者

0%

好好学K8S:K8S故障排查常用命令

1. 前言

本文记录K8S故障排查常用命令,备忘。

2. 查看资源对象详情

2.1. 查找namespace下所有资源

1
2
kubectl api-resources --verbs=list --namespaced -o name | \
xargs -n 1 kubectl get --show-kind --ignore-not-found -n <your-namespace>

2.2. 查看Pod详细信息

1
2
kubectl get pods $pod_name -n $namespace -oyaml
kubectl describe pods $pod_name -n $namespace

2.3. 查看节点详情

1
kubectl describe node $node_name

2.4. 查看Deployment详情

1
2
kubectl get deployment $controller_name -n $namespace -oyaml
kubectl describe deployment $controller_name -n $namespace

2.5. 查看Service详情

1
2
kubectl get service $service_name -n $namespace
kubectl get service $service_name -n $namespace -oyaml

查看endpoints资源,service选择到了哪些pod和端口

1
kubectl get endpoints $service_name

3. 资源用量统计

3.1. 查看Node资源使用

1
2
3
kubectl top nodes
kubectl top nodes --sort-by='cpu'
kubectl top nodes --sort-by='memory'

3.2. 查看Pod资源使用

1
2
3
kubectl top pods -A
kubectl top pods -A --sort-by='cpu'
kubectl top pods -A --sort-by='memory'

3.3. 查看Pod资源申请

1
2
3
4
5
6
7
8
9
10
11
12
13
14
kubectl get po -A \
-o custom-columns="name:metadata.name,namespace:metadata.namespace,requests-cpu:spec.containers[*].resources.requests.cpu,requests-memory:spec.containers[*].resources.requests.memory"

kubectl get po -A --field-selector status.phase==Running \
-o custom-columns="name:metadata.name,namespace:metadata.namespace,requests-cpu:spec.containers[*].resources.requests.cpu,requests-memory:spec.containers[*].resources.requests.memory"

kubectl get po -A --field-selector status.phase==Running,spec.nodeName=worker0 \
-o custom-columns="name:metadata.name,namespace:metadata.namespace,requests-cpu:spec.containers[*].resources.requests.cpu,requests-memory:spec.containers[*].resources.requests.memory"

kubectl get po -A \
-o=jsonpath="{range .items[*]}{.metadata.namespace}:{.metadata.name}{'\n'}{range .spec.containers[*]} {.name}:{.resources.requests.cpu}{'\n'}{end}{'\n'}{end}"

kubectl get po -A \
-o=jsonpath="{range .items[*]}{.metadata.namespace}:{.metadata.name}{'\n'}{range .spec.containers[*]} {.name}:{.resources.requests.memory}{'\n'}{end}{'\n'}{end}"

3.4. 查看Pod资源限制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
kubectl get po -A \
-o custom-columns="name:metadata.name,namespace:metadata.namespace,limits-cpu:spec.containers[*].resources.limits.cpu,limits-memory:spec.containers[*].resources.limits.memory"

kubectl get po -A --field-selector status.phase==Running \
-o custom-columns="name:metadata.name,namespace:metadata.namespace,limits-cpu:spec.containers[*].resources.limits.cpu,limits-memory:spec.containers[*].resources.limits.memory"

kubectl get po -A --field-selector status.phase==Running,spec.nodeName=worker0 \
-o custom-columns="name:metadata.name,namespace:metadata.namespace,limits-cpu:spec.containers[*].resources.limits.cpu,limits-memory:spec.containers[*].resources.limits.memory"

kubectl get po -A \
-o=jsonpath="{range .items[*]}{.metadata.namespace}:{.metadata.name}{'\n'}{range .spec.containers[*]} {.name}:{.resources.limits.cpu}{'\n'}{end}{'\n'}{end}"

kubectl get po -A \
-o=jsonpath="{range .items[*]}{.metadata.namespace}:{.metadata.name}{'\n'}{range .spec.containers[*]} {.name}:{.resources.limits.memory}{'\n'}{end}{'\n'}{end}"

4. Pod操作

4.1. 查看容器日志

查看容器标准输出日志

1
2
3
4
5
kubectl logs $pod_name -n $namespace 
kubectl logs $pod_name -c $container_name -n $namespace
kubectl logs --tail=100 $pod_name -c $container_name -n $namespace
kubectl logs -f $pod_name -c $container_name -n $namespace
kubectl logs --tail=100 -l app=test -n $namespace

查看crashed容器标准输出日志

1
kubectl logs --previous $pod_name -c $container_name -n $namespace 

查看deployment的所有容器标准输出日志

1
kubectl logs -f deployment/$deployment_name -n $namespace

4.2. 查看容器内部日志

1
2
kubectl exec $pod_name -n $namespace -- cat /var/log/cassandra/system.log
kubectl exec $pod_name -c $container_name -n $namespace -- cat /var/log/cassandra/system.log

4.3. 执行命令

1、登录容器

1
2
3
kubectl exec -it pod-name /bin/bash
kubectl exec -it pod-name -c container-name /bin/bash
kubectl exec -it pod-name -c container-name sh

2、直接执行命令

1
2
3
4
5
6
kubectl exec pod-name env
kubectl exec pod-name -- env
kubectl exec pod-name -it -- env
kubectl exec -n default pod-name -it -- env
# 命令带参数时必须加双横线
kubectl exec pod-name -- sh -c 'echo ${LANG}'

4.4. 拷贝文件

拷贝pod内容到宿主机

1
kubectl cp $podname:/tmp/$filename .

拷贝宿主机内容到pod

1
kubectl cp $filename $podname:/tmp/

前提是容器里需要有tar命令,否则执行会报错:

1
2
OCI runtime exec failed: exec failed: unable to start container process: exec: "tar": executable file not found in $PATH: unknown
command terminated with exit code 126

5. 验证测试

5.1. Pod测试

为了验证dns解析是否正确、service访问是否正常,最好的方法就是在k8s集群中启动一个pod进行测试。

1
2
3
4
5
#kubectl run test --image=busybox:1.25 --command sleep 3600

kubectl run test --image=alpine:3.7.3 --command sleep 3600

kubectl run test --image=debian:buster --command sleep 3600

注意:用于测试的容器镜像不要使用busybox镜像,测试结果可能不准确。
比如,使用最新版busybox的nslookup解析可能解析失败,但是换个低版本的busybox可能就正常了;使用telnet的测试端口提示Connection closed by foreign host,nc测试也是没有结果,但是换成其他镜像就可以正常连接。

5.2. 自定义测试镜像

1、准备Dockerfile

1
2
3
4
FROM debian:buster
RUN sed -i 's#http://deb.debian.org#http://mirrors.tuna.tsinghua.edu.cn#g' /etc/apt/sources.list
RUN sed -i 's#http://security.debian.org/debian-security#http://mirrors.tuna.tsinghua.edu.cn/debian-security#g' /etc/apt/sources.list
RUN apt update && apt install -y telnet netcat curl

2、构建镜像&上传镜像

1
2
docker build -t voidking/debian:buster .
docker push voidking/debian:buster

3、启动测试pod

1
kubectl run test --image=voidking/debian:buster --command sleep 3600

6. 查看IP范围

6.1. service cidr

怎样查看一个k8s集群的service ip范围?

1
2
kubeadm config view | grep Subnet
kubectl get pods -n kube-system kube-apiserver-master -oyaml | grep service-cluster-ip-range

6.2. pod cidr

怎样查看一个k8s集群的pod ip范围?

1
2
kubeadm config view | grep Subnet
kubectl cluster-info dump | grep -i cidr

如果上面两个方法都找不到,那么还可以通过网络组件的日志来查看,以weave为例。

1
2
docker ps | grep weave
docker logs <weave-container-id> | grep ipalloc-range

7. 集群操作

7.1. 查看集群信息

1
2
kubectl cluster-info
kubectl cluster-info dump

7.2. 查看集群状态

1
2
kubectl get cs
kubectl get componentstatuses
1
2
3
4
NAME                 STATUS      MESSAGE                                                                                       ERROR
controller-manager Unhealthy Get "http://127.0.0.1:10252/healthz": dial tcp 127.0.0.1:10252: connect: connection refused
scheduler Unhealthy Get "http://127.0.0.1:10251/healthz": dial tcp 127.0.0.1:10251: connect: connection refused
etcd-0 Healthy {"health":"true"}

如果看到上面的信息,不要紧张。通过kubeadm安装的k8s集群,都是这样,可以忽略。
如果非要解决,那么可以修改kube-controller-manager.yaml和kube-scheduler.yaml中的command启动参数,添加address参数。

1
2
3
4
5
6
7
8
9
10
11
spec:
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
- --address=127.0.0.1

7.3. 查看集群事件

1
kubectl get ev