SoFunction
Updated on 2025-03-09

K8s cluster restart and recovery - Node node start and stop method

1 Application scenarios

Scene

  • In actual work, a Worker node may need to be maintained
  • Migrate, we need to smoothly stop and start the node
  • The impact on clusters and services during start-stop should be minimized

Notice

  • Exclude Worker node operation
  • The workload (Pod) on the Worker node will be evicted to other nodes
  • Please make sure the cluster resources are sufficient

2 Operation steps

2.1 Stop Worker node scheduling

# View informationroot@sh-gpu091:~# kubectl get node
NAME                 STATUS     ROLES   AGE    VERSION
172.19.13.31         Ready      node    403d   v1.14.1
   Ready      node    403d   v1.14.1
   Ready      node    403d   v1.14.1
   Ready      node    403d   v1.14.1
   Ready      node    403d   v1.14.1
# Stop Worker node schedulingroot@sh-gpu091:~# kubectl cordon 
node/ cordoned
  • Check the node status
root@sh-gpu091:~# kubectl get node
NAME                 STATUS                     ROLES   AGE    VERSION
172.19.13.31         Ready                      node    403d   v1.14.1
   Ready                      node    403d   v1.14.1
   Ready                      node    403d   v1.14.1
   Ready                      node    403d   v1.14.1
   Ready,SchedulingDisabled   node    403d   v1.14.1

2.2 Evicting Workloads on Worker Nodes

# --ignore-daemonsets Ignore daemonset when expelling pods# --delete-local-data Delete the temporary data of the pod when expelling a pod. This parameter will not delete the persistent data.root@sh-gpu091:~# kubectl drain  --delete-local-data --ignore-daemonsets --force
node/ already cordoned
WARNING: ignoring DaemonSet-managed Pods: cattle-system/cattle-node-agent-8wcvs, kube-system/kube-flannel-ds-kqzhc, kube-system/nvidia-device-plugin-daemonset-rr2lf, monitoring/prometheus-node-exporter-xtbxp
evicting pod "model-server-0"
evicting pod "singleview-proxy-client-pbdownloader-0"
evicting pod "singleview-proxy-service-0"
pod/singleview-proxy-client-pbdownloader-0 evicted
pod/singleview-proxy-service-0 evicted
pod/model-server-0 evicted
node/ evicted

2.3 Stop Docker, Kubelet and other services

systemctl stop kubelet 
systemctl stop docker
  • Check whether there is still a pod on the node
kubectl get pod -A -o wide |grep 
  • If recovery is not required, you can delete the node and confirm the node information
root@sh-gpu091:~# kubectl delete node 
node "" deleted
root@sh-gpu091:~# kubectl get node
NAME                 STATUS     ROLES   AGE    VERSION
172.19.13.31         Ready      node    403d   v1.14.1
   Ready      node    403d   v1.14.1
   Ready      node    403d   v1.14.1
   Ready      node    403d   v1.14.1
root@sh-gpu091:~#

2.4 Recover Worker nodes

systemctl start docker
systemctl status docker
systemctl start kubelet
systemctl status kubelet

2.5 Allow Worker node scheduling

# Cancel the non-scheduledkubectl uncordon 

Summarize

The above is personal experience. I hope you can give you a reference and I hope you can support me more.