一个计算机技术爱好者与学习者

0%

kubeadm意外配置代理引发的问题

1. 问题描述

《ansible+kubeadm部署K8S高可用集群》一文中,我们安装完成了K8S高可用集群。表面看一切正常,跑测试pod也没有问题。

但是在《K8S中安装Milvus》时,就出现了问题。

1
2
3
4
5
6
helm install milvus-operator \
-n milvus-operator --create-namespace \
milvus-operator-0.6.5.tgz
kubectl get -n milvus-operator deploy/milvus-operator
kubectl -n milvus-operator logs job/milvus-operator-checker
kubectl describe pod/milvus-operator-5dbf664f8b-24hc9 -n milvus-operator

报错:

1
2
Warning  FailedMount  82s (x13 over 11m)   kubelet            MountVolume.SetUp failed for volume "cert" : secret "milvus-operator-webhook-cert" not found
Warning FailedMount 27s (x5 over 9m36s) kubelet Unable to attach or mount volumes: unmounted volumes=[cert], unattached volumes=[cert kube-api-access-xnkbb]: timed out waiting for the condition

从报错看,像是cert-manager的问题。

2. cert-manager问题排查

参考Verifying the Installation,验证cert-manager-webhook是否正常

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
kubectl get pods --namespace cert-manager

cat <<EOF > test-resources.yaml
apiVersion: v1
kind: Namespace
metadata:
name: cert-manager-test
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: test-selfsigned
namespace: cert-manager-test
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: selfsigned-cert
namespace: cert-manager-test
spec:
dnsNames:
- example.com
secretName: selfsigned-cert-tls
issuerRef:
name: test-selfsigned
EOF

kubectl apply -f test-resources.yaml
kubectl describe certificate -n cert-manager-test
kubectl delete -f test-resources.yaml

报错:

1
2
Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": proxyconnect tcp: read tcp 192.168.56.102:33020->192.168.56.1:7890: read: connection reset by peer
Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": proxyconnect tcp: read tcp 192.168.56.102:33022->192.168.56.1:7890: read: connection reset by peer

看到这个报错,我就恍然大明白:肯定是因为代理引起的问题!!!

《ansible+kubeadm部署K8S高可用集群》一文中,配置了上网代理和docker代理:

1
2
3
export http_proxy=http://192.168.56.1:7890
export https_proxy=http://192.168.56.1:7890
export no_proxy=127.0.0.1,localhost,192.168.56.0/24,10.96.0.0/12,10.244.0.0/16,172.31.0.0/16

3. 取消代理

既然是代理引起的问题,那就把它们都取消试试。

1、取消上网代理,问题依旧
2、取消docker代理,问题依旧

啊嘞,什么鬼?代理都取消了,问题依然没有解决,还有哪里有配置代理吗?

4. kube-proxy

考虑到K8S集群中service,主要是通过kube-proxy组件实现的。
查看kube-proxy配置,发现环境变量中确实有代理配置,怀疑是在kubeadm安装集群时从master宿主机继承的。

1
kubectl get daemonset.apps/kube-proxy -n kube-system -oyaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
spec:
containers:
- command:
- /usr/local/bin/kube-proxy
- --config=/var/lib/kube-proxy/config.conf
- --hostname-override=$(NODE_NAME)
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: http_proxy
value: http://192.168.56.1:7890
- name: https_proxy
value: http://192.168.56.1:7890
- name: no_proxy
value: 127.0.0.1,localhost,192.168.56.0/24,10.96.0.0/12,10.244.0.0/16,172.31.0.0/16
1
kubectl edit daemonset.apps/kube-proxy -n kube-system

去掉kube-proxy中关于代理配置的环境变量,问题依然没有解决。。。

5. kubeadm proxy

kubeadm安装集群时,除了会把代理配置的环境变量放到kube-proxy,会不会也把代理配置的环境变量放到其他组件?
经查,还真有!!!除了kube-proxy,还有kube-apiserver、kube-controller-manager、kube-scheduler

取消代理配置的办法:
/etc/kubernetes/manifests/目录中,分别编辑这些组件的yaml文件,删掉代理配置的环境变量。

问题终于解决,milvus顺利安装成功,感动哭了。。。

参考文档:

6. no_proxy疑问

虽然代理问题解决了,但是还有一个疑问:
我们明明给192.168.56.0/24配置了no_proxy,为啥还是走的代理?
这是因为,docker并不能处理CIDR,k8s核心组件也不能处理CIDR,详情参考《Linux配置网络代理》

7. 卸载集群

其实,还有一个更简单更暴力的解决办法:卸载整个K8S集群,然后重装。重装前记得把环境变量中的网络代理取消掉。

1
2
3
4
5
6
ansible all -i hosts -m command -a "sudo kubeadm reset -f"

ansible all -i hosts -m command -a "sudo rm -rf /var/lib/cni/"
ansible all -i hosts -m command -a "sudo rm -rf /etc/cni/"
ansible all -i hosts -m command -a "sudo ifconfig cni0 down"
ansible all -i hosts -m command -a "sudo ip link delete cni0"

8. 代理引发的其他问题

8.1. KubeSphere无法登录

登录时报错:
Internal error occurred: failed calling webhook “users.iam.kubesphere.io”: Post “https://ks-controller-manager.kubesphere-system.svc:443/validate-email-iam-kubesphsere-io-v1alpha2?timeout=30s": proxyconnect tcp: read tcp 192.168.56.102:48188->192.168.56.1:7890: read: connection reset by peer

同样可以通过本文中的方法进行解决。