1. 问题描述《ansible+kubeadm部署K8S高可用集群》 一文中,我们安装完成了K8S高可用集群。表面看一切正常,跑测试pod也没有问题。
但是在《K8S中安装Milvus》 时,就出现了问题。
1 2 3 4 5 6 helm install milvus-operator \ -n milvus-operator --create-namespace \ milvus-operator-0.6.5.tgz kubectl get -n milvus-operator deploy/milvus-operator kubectl -n milvus-operator logs job/milvus-operator-checker kubectl describe pod/milvus-operator-5dbf664f8b-24hc9 -n milvus-operator
报错:
1 2 Warning FailedMount 82s (x13 over 11m) kubelet MountVolume.SetUp failed for volume "cert" : secret "milvus-operator-webhook-cert" not found Warning FailedMount 27s (x5 over 9m36s) kubelet Unable to attach or mount volumes: unmounted volumes=[cert], unattached volumes=[cert kube-api-access-xnkbb]: timed out waiting for the condition
从报错看,像是cert-manager的问题。
2. cert-manager问题排查参考Verifying the Installation ,验证cert-manager-webhook是否正常
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 kubectl get pods --namespace cert-manager cat <<EOF > test-resources.yaml apiVersion: v1 kind: Namespace metadata: name: cert-manager-test --- apiVersion: cert-manager.io/v1 kind: Issuer metadata: name: test-selfsigned namespace: cert-manager-test spec: selfSigned: {} --- apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: selfsigned-cert namespace: cert-manager-test spec: dnsNames: - example.com secretName: selfsigned-cert-tls issuerRef: name: test-selfsigned EOF kubectl apply -f test-resources.yaml kubectl describe certificate -n cert-manager-test kubectl delete -f test-resources.yaml
报错:
1 2 Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": proxyconnect tcp: read tcp 192.168.56.102:33020->192.168.56.1:7890: read: connection reset by peer Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": proxyconnect tcp: read tcp 192.168.56.102:33022->192.168.56.1:7890: read: connection reset by peer
看到这个报错,我就恍然大明白:肯定是因为代理引起的问题!!!
《ansible+kubeadm部署K8S高可用集群》 一文中,配置了上网代理和docker代理:
1 2 3 export http_proxy=http://192.168.56.1:7890export https_proxy=http://192.168.56.1:7890export no_proxy=127.0.0.1,localhost,192.168.56.0/24,10.96.0.0/12,10.244.0.0/16,172.31.0.0/16
3. 取消代理既然是代理引起的问题,那就把它们都取消试试。
1、取消上网代理,问题依旧 2、取消docker代理,问题依旧
啊嘞,什么鬼?代理都取消了,问题依然没有解决,还有哪里有配置代理吗?
4. kube-proxy考虑到K8S集群中service,主要是通过kube-proxy组件实现的。 查看kube-proxy配置,发现环境变量中确实有代理配置,怀疑是在kubeadm安装集群时从master宿主机继承的。
1 kubectl get daemonset.apps/kube-proxy -n kube-system -oyaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 spec: containers: - command: - /usr/local/bin/kube-proxy - --config=/var/lib/kube-proxy/config.conf - --hostname-override=$(NODE_NAME) env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: http_proxy value: http://192.168.56.1:7890 - name: https_proxy value: http://192.168.56.1:7890 - name: no_proxy value: 127.0 .0 .1 ,localhost,192.168.56.0/24,10.96.0.0/12,10.244.0.0/16,172.31.0.0/16
1 kubectl edit daemonset.apps/kube-proxy -n kube-system
去掉kube-proxy中关于代理配置的环境变量,问题依然没有解决。。。
5. kubeadm proxykubeadm安装集群时,除了会把代理配置的环境变量放到kube-proxy,会不会也把代理配置的环境变量放到其他组件? 经查,还真有!!!除了kube-proxy,还有kube-apiserver、kube-controller-manager、kube-scheduler
取消代理配置的办法:/etc/kubernetes/manifests/
目录中,分别编辑这些组件的yaml文件,删掉代理配置的环境变量。
问题终于解决,milvus顺利安装成功,感动哭了。。。
参考文档:
6. no_proxy疑问虽然代理问题解决了,但是还有一个疑问: 我们明明给192.168.56.0/24
配置了no_proxy,为啥还是走的代理? 这是因为,docker并不能处理CIDR,k8s核心组件也不能处理CIDR,详情参考《Linux配置网络代理》
7. 卸载集群其实,还有一个更简单更暴力的解决办法:卸载整个K8S集群,然后重装。重装前记得把环境变量中的网络代理取消掉。
1 2 3 4 5 6 ansible all -i hosts -m command -a "sudo kubeadm reset -f" ansible all -i hosts -m command -a "sudo rm -rf /var/lib/cni/" ansible all -i hosts -m command -a "sudo rm -rf /etc/cni/" ansible all -i hosts -m command -a "sudo ifconfig cni0 down" ansible all -i hosts -m command -a "sudo ip link delete cni0"
8. 代理引发的其他问题8.1. KubeSphere无法登录登录时报错: Internal error occurred: failed calling webhook “users.iam.kubesphere.io”: Post “https://ks-controller-manager.kubesphere-system.svc:443/validate-email-iam-kubesphsere-io-v1alpha2?timeout=30s" : proxyconnect tcp: read tcp 192.168.56.102:48188->192.168.56.1:7890: read: connection reset by peer
同样可以通过本文中的方法进行解决。