Saturday, November 30, 2019

Building on Kube-Batch Primitives for Scheduling

In previous posts, I have talked about the logical structures of a Synchronization Set,  a Generation, and a Data Pond.   I have also identified the need to delay the creation of pods in order to conserve resources.

It appears that the kube-batch infrastructure is perfectly positioned to address the question of how these objects get scheduled on a cluster.  In previous posts, it was envisioned that resources would be created directly by client code in our operators, but now I am seeing that there is the need for another layer abstraction here.  

The custom CRD we have envisioned can identify the high-level details for the system and then kube-batch will handle translation into runtime scheduling; this will provide a more robust execution environment.

Upon initial investigation of the kube-batch documentation, I am seeing a few interesting areas:


Via the customizable priority functions, a node can be selected for the execution of a workload (the execution of a Generation)

Workloads will be scheduled on Nodes with higher priority and these priorities will be calculated based on different parameters like ImageLocalityMost/Least Requested Nodes...etc. A basic flow for the Node priority functions is depicted below.

It appears we should be able to schedule pods according to which DataPond is already positioned on the node


To delay pod creation, both kube-batch and PodGroupMinResources will watch ResourceQuota to decide which PodGroup should be in queue firstly. The decision maybe invalid because of race condition, e.g. other controllers create Pods. In such case, PodGroupMinResources will reject PodGroup creation and keep InQueue state until kube-batch transform it back to Pending. To avoid race condition, it's better to let kube-batch manage Pod number and resources (e.g. CPU, memory) instead of Quota.

So delay can be achieved at the PodGroup level.

this looks to be related to Volcano's Pod Group Scheduling:

https://volcano.sh/docs/scheduler-pod-group-status/

This, in turn, means we can construct the "football" like plays passing control from Pod Group to Pod Group.  The DSL for the execution of batch operation could handle this.  


Volcano builds on top of the primitives provided by the kube-batch scheduler.



The vk-scheduler is based on kube-batch and included enhancements for Volcano.

Since a Generation defines a time sliced execution of a Synchronization Set, it maps naturally to a "workload" or in Volcano terms a JobEx.

So now the flow of our system is taking shape.  


  1. Define SyncSet defintions
  2.  Create Generations based on the SyncSet 
  3. Create Data Ponds based on Sync Set definition
  4. For each Generation create a correspoding Job Ex
  5. Bind Pods to nodes based on Node Priority Algorithm which uses DataPond location as the  primary criteria
  6. Move from stage to stage in Generation Execution bringing up resources "on demand" through the Delayed Pod Creation from kube-batch











No comments: