1. 需求描述
当前有三个K8S集群,每个集群内部都配置了监控和告警。但是,如果某个集群整个都挂掉了,那么是收不到告警的。
为了解决这个问题,需要配置一个外部的探活脚本,探测集群是否存活。探测时,选择集群中的任意两台主机的ssh端口进行探测。
2. 探活脚本实现
1、探活脚本为 probe.sh
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 
 | #!/bin/bash
 dev_feishu_url="https://open.feishu.cn/open-apis/bot/v2/hook/aaa"
 dev_idc="开发机房"
 dev_error=false
 
 test_feishu_url="https://open.feishu.cn/open-apis/bot/v2/hook/bbb"
 test_idc="测试机房"
 test_error=false
 
 prod_feishu_url="https://open.feishu.cn/open-apis/bot/v2/hook/ccc"
 prod_idc="生产机房"
 prod_error=false
 
 source /root/.bashrc
 
 now=`date +%Y%m%d-%H:%M:%S`
 
 host=$(< /path/to/host.txt)
 linenum=$(echo "${host}" | wc -l)
 for index in `seq 1 ${linenum}`;do
 
 line=$(echo "${host}" | sed -n "${index}p")
 idc=$(echo $line | awk '{print $1}')
 ip=$(echo $line | awk '{print $2}')
 port=$(echo $line | awk '{print $3}')
 value=$(timeout 5s nc -zv $ip $port 2>&1)
 echo "$value"
 if [[ "$value" =~ "succeeded" ]];then
 continue
 fi
 if [[ "$idc" == "dev" ]];then
 dev_error=true
 fi
 if [[ "$idc" == "test" ]];then
 test_error=true
 fi
 if [[ "$idc" == "prod" ]];then
 prod_error=true
 fi
 done
 
 
 if $dev_error
 then
 echo "dev_error"
 curl -X POST "${dev_feishu_url}" \
 -H "Content-Type: application/json" \
 -d "{\"msg_type\":\"text\",\"content\":{\"text\":\"${dev_idc} \n${now} \ndev cluster error!\"}}"
 else
 echo "dev_working"
 fi
 
 if $test_error
 then
 echo "test_error"
 curl -X POST "${test_feishu_url}" \
 -H "Content-Type: application/json" \
 -d "{\"msg_type\":\"text\",\"content\":{\"text\":\"${test_idc} \n${now} \ntest cluster error!\"}}"
 else
 echo "test_working"
 fi
 
 if $prod_error
 then
 echo "prod_error"
 curl -X POST "${prod_feishu_url}" \
 -H "Content-Type: application/json" \
 -d "{\"msg_type\":\"text\",\"content\":{\"text\":\"${prod_idc} \n${now} \nprod cluster error!\"}}"
 else
 echo "prod_working"
 fi
 
 | 
2、准备一个机房和主机信息文件 host.txt
| 12
 3
 4
 5
 6
 
 | dev 192.168.56.101 22dev 192.168.56.102 22
 test 172.16.0.101 222
 test 172.16.0.102 222
 prod 172.16.10.101 2222
 prod 172.16.10.102 2222
 
 | 
3. 配置定时探活
每隔5分钟探活一次:
| 1
 | */5 * * * * /path/to/probe.sh
 |