15-k8s日常运维-单master¶
修改node主机名称¶
如果我们想修改node的主机名称的话,我们需要将node进行缩容,在从新加入集群,方可修改node的主机名称
比如我们将node3 修改为 node-192e168e1e29
先查看一下这个node节点上的pod信息
[root@master1 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 19m v1.20.6
node1 Ready <none> 17m v1.20.6
node2 Ready <none> 16m v1.20.6
node3 Ready <none> 3m33s v1.20.6
标记节点不可调度
kubectl cordon node3
驱逐这个node节点上的pod
[root@master1 ~]# kubectl drain node3 --delete-local-data --force --ignore-daemonsets
Flag --delete-local-data has been deprecated, This option is deprecated and will be deleted. Use --delete-emptydir-data.
node/node2 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-xv5cp, kube-system/kube-proxy-b9mt5
evicting pod kube-system/coredns-7f89b7bc75-4tlt6
evicting pod default/nginx-deployment-5d47ff8589-fd68t
evicting pod default/nginx-deployment-5d47ff8589-kkjtv
evicting pod default/nginx-deployment-5d47ff8589-klqfc
evicting pod default/nginx-deployment-5d47ff8589-lwmn9
pod/coredns-7f89b7bc75-4tlt6 evicted
pod/nginx-deployment-5d47ff8589-fd68t evicted
pod/nginx-deployment-5d47ff8589-kkjtv evicted
pod/nginx-deployment-5d47ff8589-klqfc evicted
pod/nginx-deployment-5d47ff8589-lwmn9 evicted
node/node2 evicted
删除这个node节点
[root@master1 ~]# kubectl delete nodes node3
node "node3" deleted
然后在node3这个节点上执行如下命令:
[root@node3 ~]# kubeadm reset
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
现在我们开始修改node3的主机名称为
hostnamectl set-hostname node-192e168e1e29 && bash
修改master节点的/etc/hosts文件解析
[root@master1 ~]# cat /etc/hosts
192.168.1.26 master1
192.168.1.27 node1
192.168.1.28 node2
192.168.1.29 node-192e168e1e29
安装 k8s 集群-添加工作节点
[root@master1 ~]# kubeadm token create --print-join-command
kubeadm join 192.168.1.26:6443 --token vrx60x.1sq6s9g752fe1ufr --discovery-token-ca-cert-hash sha256:9495462d474420d5e4ee3b39bb8a258997f7dfb9d76926baa4aaeaba167b436d
在node2节点执行如上命令
看到下面说明 node-192e168e1e29 节点已经加入到集群了,充当工作节点
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
在 master1上查看集群节点状况:
[root@master1 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 2d19h v1.20.6
node-192e168e1e29 Ready <none> 18s v1.20.6
node1 Ready <none> 2d19h v1.20.6
node2 Ready <none> 4h33m v1.20.6
如果想恢复的话,按照以上部署恢复即可,即可恢复哦
修改node角色名称¶
我们可以设置node节点的角色名称为 work
kubectl get nodes --show-labels
[root@master1 ~]# kubectl label nodes node3 node-role.kubernetes.io/work=
node/node3 labeled
[root@master1 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 2d21h v1.20.6
node1 Ready <none> 2d21h v1.20.6
node2 Ready <none> 6h50m v1.20.6
node3 Ready node 133m v1.20.6
当然我们也可以取消角色名称
[root@master1 ~]# kubectl label nodes node3 node-role.kubernetes.io/work-
node/node3 labeled
[root@master1 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 2d21h v1.20.6
node1 Ready <none> 2d21h v1.20.6
node2 Ready <none> 6h50m v1.20.6
node3 Ready <none> 134m v1.20.6
node节点下线维修¶
https://www.csdn.net/tags/MtTaEg2sMjU5OTY0LWJsb2cO0O0O.html
场景:k8s集群中的node节点在正常的情况下,需要进行停机维修
首先我们先查看一下当前集群的节点
[root@master1 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 2d19h v1.20.6
node1 Ready <none> 2d19h v1.20.6
node2 Ready <none> 4h36m v1.20.6
node3 Ready <none> 30s v1.20.6
我们来模拟node2节点,需要计划性的下线维修,首先标记节点不可调度
kubectl cordon node2
查看一个node2节点的状态会变化为 SchedulingDisabled 如下:
[root@master1 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 2d19h v1.20.6
node1 Ready <none> 2d19h v1.20.6
node2 Ready,SchedulingDisabled <none> 4h46m v1.20.6
node3 Ready <none> 9m50s v1.20.6
接着我们需要将node2节点上的pod进行驱逐
[root@master1 ~]# kubectl drain node2 --delete-local-data --force --ignore-daemonsets
Flag --delete-local-data has been deprecated, This option is deprecated and will be deleted. Use --delete-emptydir-data.
node/node2 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-xv5cp, kube-system/kube-proxy-b9mt5
evicting pod kube-system/coredns-7f89b7bc75-4tlt6
evicting pod default/nginx-deployment-5d47ff8589-fd68t
evicting pod default/nginx-deployment-5d47ff8589-kkjtv
evicting pod default/nginx-deployment-5d47ff8589-klqfc
evicting pod default/nginx-deployment-5d47ff8589-lwmn9
pod/coredns-7f89b7bc75-4tlt6 evicted
pod/nginx-deployment-5d47ff8589-fd68t evicted
pod/nginx-deployment-5d47ff8589-kkjtv evicted
pod/nginx-deployment-5d47ff8589-klqfc evicted
pod/nginx-deployment-5d47ff8589-lwmn9 evicted
node/node2 evicted
参数如下:
--delete-local-data 删除本地数据,即使emptyDir也将删除;
--ignore-daemonsets 忽略DeamonSet,否则DeamonSet被删除后,仍会自动重建;
--force 不加force参数只会删除该node节点上的ReplicationController, ReplicaSet, DaemonSet,StatefulSet or Job,加上后所有pod都将删除;
此时与默认迁移不同的是,pod会先重建再终止
,此时的**服务中断时间=重建时间+服务启动时间+readiness探针检测正常时间**,必须等到1/1 Running
服务才会正常。因此在单副本时迁移时,服务终端是不可避免的。
然后我们将服务器进行关机,等待服务器修好以后,再次开机,讲node节点恢复为可以调度的状态
kubectl uncordon node2
最终的集群状态如下:
[root@master1 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 2d19h v1.20.6
node1 Ready <none> 2d19h v1.20.6
node2 Ready <none> 4h54m v1.20.6
node3 Ready <none> 18m v1.20.6
node节点故障疏散¶
场景: 一个k8s集群 突然1个node节点 直接停电down机啦 然后这个node系统直接损坏,无法启动系统, 他上面的node上的pod 这时应该怎么恢复呢 或者调度到其他node上呢
我们将node3节点进行直接关机,模拟服务器突然down机
[root@node3 ~]# poweroff
我们可以立马看到node3上的pod网络是不通的
[root@master1 ~]# kubectl get pod -o wide|grep node3
nginx-deployment-5d47ff8589-288mj 1/1 Running 0 8m48s 10.244.135.28 node3 <none> <none>
nginx-deployment-5d47ff8589-2lnch 1/1 Running 0 8m38s 10.244.135.77 node3 <none> <none>
nginx-deployment-5d47ff8589-2pjf9 1/1 Running 0 8m22s 10.244.135.156 node3 <none> <none>
[root@master1 ~]# ping -c 4 10.244.135.28
PING 10.244.135.28 (10.244.135.28) 56(84) bytes of data.
--- 10.244.135.28 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 2999ms
等待一段时间后,我们可以看到node3节点上pod的状态会变成 Terminating
[root@master1 ~]# kubectl get pod -o wide|grep node3|head -3
nginx-deployment-5d47ff8589-288mj 1/1 Terminating 0 58m 10.244.135.28 node3 <none> <none>
nginx-deployment-5d47ff8589-2lnch 1/1 Terminating 0 57m 10.244.135.77 node3 <none> <none>
nginx-deployment-5d47ff8589-2pjf9 1/1 Terminating 0 57m 10.244.135.156 node3 <none> <none>
如果此时我们已经确认node3节点已经属于不可恢复的节点
我们开始清理node3节点上异常的pod
[root@master1 ~]# cat clean_pod.sh
#!/bin/bash
node_list="node2"
for n in "${node_list}"
do
fail_pod_count=$(kubectl get pod -o wide -ALL |grep " ${n} "|grep -v kube-system|wc -l)
for m in `seq 1 $fail_pod_count`
do
fail_pod_name=$(kubectl get pod -o wide -ALL |grep " ${n} "|grep -v kube-system|awk 'NR=='$m'{print $2}')
fail_pod_namespace=$(kubectl get pod -o wide -ALL |grep " ${n} "|grep -v kube-system|awk 'NR=='$m'{print $1}')
echo "kubectl delete pod $fail_pod_name -n $fail_pod_namespace --force --grace-period=0"
sleep 0.5
done
done
将打印出来的命令进行执行执行即可
[root@master1 ~]# kubectl get pod -o wide|grep node3
最终清理完毕pod
删除这个node节点
[root@master1 ~]# kubectl delete nodes node3
node "node3" deleted
每当删除namespace或pod 等一些Kubernetes资源时,有时资源状态会卡在terminating,很长时间无法删除,甚至有时增加--force flag(强制删除)之后还是无法正常删除。这时就需要edit该资源,将字段finalizers设置为null,之后Kubernetes资源就正常删除了。
当删除pod时有时会卡住,pod状态变为terminating,无法删除pod
(1)强制删除
kubectl delete pod xxx -n xxx --force --grace-period=0
(2)如果强制删除还不行,设置finalizers为空
(如果一个容器已经在运行,这时需要对一些容器属性进行修改,又不想删除容器,或不方便通过replace的方式进行更新。kubernetes还提供了一种在容器运行时,直接对容器进行修改的方式,就是patch命令。)
kubectl patch pod xxx -n xxx -p '{"metadata":{"finalizers":null}}'
node节点限制pod¶
https://blog.51cto.com/zhangxueliang/2969910
k8s 更改pod数量限制(默认每个节点最多110组pod)0/3 nodes are available: 3 Insufficient cpu报错排查
我们目前有3个node节点,也就是说最多不能创建超过330个pod,现在我们将deployment-nginx的pod扩容到350个 看下效果
kubectl scale deployment nginx-deployment --replicas 350
我们可以看到很多pod的状态为Pending模式,这个就表示每个节点的pod已经有110个啦,不能在继续增加
[root@master1 ~]# kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-deployment 322/350 350 322 2d18h
那些没有创建出来的pod 一直处于Pending状态
[root@master1 pod]# kubectl get pod -ALL |grep Pending
default nginx-deployment-5d47ff8589-2jgmn 0/1 Pending 0 3m1s
default nginx-deployment-5d47ff8589-2vctp 0/1 Pending 0 3m2s
default nginx-deployment-5d47ff8589-5nqgl 0/1 Pending 0 3m2s
我们可以通过修改来配置node的pod限制
cat >/etc/sysconfig/kubelet<<\EOF
KUBELET_EXTRA_ARGS="--fail-swap-on=false --max-pods=1000"
EOF
[root@node2 ~]# vi /usr/lib/systemd/system/kubelet.service
[Service]
EnvironmentFile=-/etc/sysconfig/kubelet
systemctl daemon-reload
systemctl restart kubelet
我们可以看到那些处于Pending状态的pod会被陆续创建出来
[root@master1 ~]# kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-deployment 350/350 350 350 2d18h
如果我们想恢复恢复默认pod限制的方法
cat >/etc/sysconfig/kubelet<<\EOF
KUBELET_EXTRA_ARGS=
EOF
systemctl daemon-reload
systemctl restart kubelet
systemctl status kubelet
最后我们不要忘记讲环境进行清理
[root@master1 ~]# kubectl delete deployment nginx-deployment
deployment.apps "nginx-deployment" deleted