Analysis of all situations in which a Pod is retried after a failed scheduling

text

After a Pod in k8s fails to schedule for some reason, it will be placed in the scheduling failure queue. What happens after the Pod in this queue?

How can they get the "opportunity to be a new person" again? In this article, let’s take a look at the whole story from the source code perspective

In k8s, two coroutines will be set up, and the pods in backoffQ and unscheduledQ will be regularly taken to activeQ

func (p *PriorityQueue) Run() {
   go (, 1.0*, )
   go (, 30*, )
}

flushUnschedulablePodsLeftover

func (p *PriorityQueue) flushUnschedulablePodsLeftover() {
   ()
   defer ()
   var podsToMove []*
   currentTime := ()
   for _, pInfo := range  {
      lastScheduleTime := 
      if (lastScheduleTime) >  {
         podsToMove = append(podsToMove, pInfo)
      }
   }
   if len(podsToMove) > 0 {
      (podsToMove, UnschedulableTimeout)
   }
}

    func (p *PriorityQueue) movePodsToActiveOrBackoffQueue(podInfoList []*, event ) {
       activated := false
       for _, pInfo := range podInfoList {
          // If the event doesn't help making the Pod schedulable, continue.
          // Note: we don't run the check if  is nil, which denotes
          // either there is some abnormal error, or scheduling the pod failed by plugins other than PreFilter, Filter and Permit.
          // In that case, it's desired to move it anyways.
          if len() != 0 && !(pInfo, event) {
             continue
          }
          pod := 
          if (pInfo) {
             if err := (pInfo); err != nil {
                (err, "Error adding pod to the backoff queue", "pod", (pod))
             } else {
                ("backoff", ).Inc()
                (pod)
             }
          } else {
             if err := (pInfo); err != nil {
                (err, "Error adding pod to the scheduling queue", "pod", (pod))
             } else {
                    ("active", ).Inc()
                (pod)
             }
          }
       }
        = 
       if activated {
          ()
       }
    }

Put pods that stay in unscheduledQ for more than podMaxInUnschedulablePodsDuration (default is 5min) into ActiveQ or BackoffQueue. Specifically, which queue to put, judge according to the following rules:

According to the number of times this Pod attempts to be scheduled, calculate the time when this Pod should wait for the next schedule. The calculation rule is exponentially increasing, that is, wait according to time such as 1s, 2s, 4s, 8s, but the waiting time will not increase infinitely and will be limited by podMaxBackoffDuration (default 10s). This parameter means that a Pod is at the maximum time of Backoff. If the waiting time exceeds podMaxBackoffDuration, then just wait for podMaxBackoffDuration to be scheduled again;
Current time - last scheduled time > According to 1, the time you should wait, then put the Pod into activeQ and will be scheduled. Otherwise, the Pod will be placed in the backoff queue and continue to wait. If you are waiting in the backoff queue, it will be taken out by flushBackoffQCompleted later.

So if the Pod meets the conditions here, it will definitely be moved from unscheduleQ to backooff or activeQ

flushBackoffQCompleted

Go to get the pod that has the wait time ends in the backoff queue (priority queue), and put it in activeQ

func (p *PriorityQueue) flushBackoffQCompleted() {
   ()
   defer ()
   activated := false
   for {
      rawPodInfo := ()
      if rawPodInfo == nil {
         break
      }
      pod := rawPodInfo.(*).Pod
      boTime := (rawPodInfo.(*))
      if (()) {
         break
      }
      _, err := ()
      if err != nil {
         (err, "Unable to pop pod from backoff queue despite backoff completion", "pod", (pod))
         break
      }
      (rawPodInfo)
      ("active", BackoffComplete).Inc()
      activated = true
   }
   if activated {
      ()
   }
}

So, in addition to the above-mentioned periodic and proactive decision whether the Pod in UnscheduledQ or backoffQ can be scheduled again, are there any other situations?

There is a answer.

There are four other situations to re-judgment whether the pods in these two queues need to be re-scheduled.

New nodes join the cluster
Node configuration or status changes
Already existing pods have changed
Pods in the cluster have been deleted

().V1().Nodes().Informer().AddEventHandler(
   {
      AddFunc:    ,
      UpdateFunc: ,
      DeleteFunc: ,
   },
)

Newly joined node

func (sched *Scheduler) addNodeToCache(obj interface{}) {
   node, ok := obj.(*)
   if !ok {
      (nil, "Cannot convert to *", "obj", obj)
      return
   }
   nodeInfo := (node)
   (3).InfoS("Add event for node", "node", (node))
   (, preCheckForNode(nodeInfo))
}

func preCheckForNode(nodeInfo *)  {
   // Note: the following checks doesn't take preemption into considerations, in very rare
   // cases (., node resizing), "pod" may still fail a check but preemption helps. We deliberately
   // chose to ignore those cases as unschedulable pods will be re-queued eventually.
   return func(pod *) bool {
      admissionResults := AdmissionCheck(pod, nodeInfo, false)
      if len(admissionResults) != 0 {
         return false
      }
      _, isUntolerated := (()., , func(t *) bool {
         return  == 
      })
      return !isUntolerated
   }
}

It can be seen that when a node joins the cluster, the pods in unscheduledQ will be taken out in turn to make the following judgment:

Pod affinity for nodes
The Nodename in the Pod is not empty. Then judge whether the name of the newly added node is equal.
Determine whether the port requirements of the container in the Pod conflict with the port that has been used by the newly added node
Does Pod tolerate Node's Pods

Only when the above four conditions are met, the newly added node event will trigger the unscheduled Pod to be added to backoffQ or activeQ. As for which queue to be added, the above has been analyzed.

Node update

func (sched *Scheduler) updateNodeInCache(oldObj, newObj interface{}) {
   oldNode, ok := oldObj.(*)
   if !ok {
      (nil, "Cannot convert oldObj to *", "oldObj", oldObj)
      return
   }
   newNode, ok := newObj.(*)
   if !ok {
      (nil, "Cannot convert newObj to *", "newObj", newObj)
      return
   }
   nodeInfo := (oldNode, newNode)
   // Only requeue unschedulable pods if the node became more schedulable.
   if event := nodeSchedulingPropertiesChange(newNode, oldNode); event != nil {
      (*event, preCheckForNode(nodeInfo))
   }
}

func nodeSchedulingPropertiesChange(newNode *, oldNode *) * {
   if nodeSpecUnschedulableChanged(newNode, oldNode) {
      return &
   }
   if nodeAllocatableChanged(newNode, oldNode) {
      return &
   }
   if nodeLabelsChanged(newNode, oldNode) {
      return &
   }
   if nodeTaintsChanged(newNode, oldNode) {
      return &
   }
   if nodeConditionsChanged(newNode, oldNode) {
      return &
   }
   return nil
}

First, we determine what configuration the node has changed, and there are the following situations

Changes in the scheduling situation of nodes
The node's allocable resources have changed
Node label changes
Node stains change
The node state changes

If the reason for the failure of a Pod scheduling can match one of the above reasons, then the node updates this event to trigger the unscheduled Pod to be added to backoffQ or activeQ

().V1().Pods().Informer().AddEventHandler(
   {
      FilterFunc: func(obj interface{}) bool {
         switch t := obj.(type) {
         case *:
            return assignedPod(t)
         case :
            if _, ok := .(*); ok {
               // The carried object may be stale, so we don't use it to check if
               // it's assigned or not. Attempting to cleanup anyways.
               return true
            }
            (("unable to convert object %T to * in %T", obj, sched))
            return false
         default:
            (("unable to handle object in %T: %T", sched, obj))
            return false
         }
      },
      Handler: {
         AddFunc:    ,
         UpdateFunc: ,
         DeleteFunc: ,
      },
   },
)

Already existing pods have changed

func (sched *Scheduler) addPodToCache(obj interface{}) {
   pod, ok := obj.(*)
   if !ok {
      (nil, "Cannot convert to *", "obj", obj)
      return
   }
   (3).InfoS("Add event for scheduled pod", "pod", (pod))
   if err := (pod); err != nil {
      (err, "Scheduler cache AddPod failed", "pod", (pod))
   }
   (pod)
}

func (p *PriorityQueue) AssignedPodAdded(pod *) {
   ()
   ((pod), AssignedPodAdd)
   ()
}

func (p *PriorityQueue) getUnschedulablePodsWithMatchingAffinityTerm(pod *) []* {
   var nsLabels 
   nsLabels = (, )
   var podsToMove []*
   for _, pInfo := range  {
      for _, term := range  {
         if (pod, nsLabels) {
            podsToMove = append(podsToMove, pInfo)
            break
         }
      }
   }
   return podsToMove
}

It can be seen that after the existing pod changes, the Pod affinity configuration will be matched with the Pod in unscheduledQ in sequence. If it can be matched, then the node updates the event to trigger the unscheduled pod to be added to backoffQ or activeQ.

Pod deletion in the cluster

func (sched *Scheduler) deletePodFromCache(obj interface{}) {
  var pod *
   switch t := obj.(type) {
   case *:
      pod = t
   case :
      var ok bool
      pod, ok = .(*)
      if !ok {
         (nil, "Cannot convert to *", "obj", )
         return
      }
   default:
      (nil, "Cannot convert to *", "obj", t)
      return
   }
   (3).InfoS("Delete event for scheduled pod", "pod", (pod))
   if err := (pod); err != nil {
      (err, "Scheduler cache RemovePod failed", "pod", (pod))
   }
   (, nil)
}

It can be seen that the Pod deletion time does not require additional judgment like other times. This preCheck function is empty, so all Pods in unscheduledQ will be placed in activeQ or backoffQ.

From the above situation, we can see that events in the cluster have changed, which can speed up the process of rescheduling of failed pods. Conventionally, a failed scheduling pod needs to wait 5 minutes before it is rejoined backoff or activeQ. The pods in backoffQ also need to wait for a while before they can be rescheduled. This is why, when you modify the node configuration, you can see that the Pod is scheduled immediately.

The above is all the situations when a Pod scheduling fails and re-trials are triggered.

For more information about Pod scheduling failure retriggering, please follow my other related articles!