jmenke blog: Building on Kube-Batch Primitives for Scheduling

In previous posts, I have talked about the logical structures of a Synchronization Set, a Generation, and a Data Pond. I have also identified the need to delay the creation of pods in order to conserve resources.

It appears that the kube-batch infrastructure is perfectly positioned to address the question of how these objects get scheduled on a cluster. In previous posts, it was envisioned that resources would be created directly by client code in our operators, but now I am seeing that there is the need for another layer abstraction here.

The custom CRD we have envisioned can identify the high-level details for the system and then kube-batch will handle translation into runtime scheduling; this will provide a more robust execution environment.

Upon initial investigation of the kube-batch documentation, I am seeing a few interesting areas:

Node Priority: https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/node-priority.md

Via the customizable priority functions, a node can be selected for the execution of a workload (the execution of a Generation)

Workloads will be scheduled on Nodes with higher priority and these priorities will be calculated based on different parameters like ImageLocality, Most/Least Requested Nodes...etc. A basic flow for the Node priority functions is depicted below.

It appears we should be able to schedule pods according to which DataPond is already positioned on the node

Delay Pod Creation: https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/delay-pod-creation.md

To delay pod creation, both kube-batch and PodGroupMinResources will watch ResourceQuota to decide which PodGroup should be in queue firstly. The decision maybe invalid because of race condition, e.g. other controllers create Pods. In such case, PodGroupMinResources will reject PodGroup creation and keep InQueue state until kube-batch transform it back to Pending. To avoid race condition, it's better to let kube-batch manage Pod number and resources (e.g. CPU, memory) instead of Quota.

So delay can be achieved at the PodGroup level.

this looks to be related to Volcano's Pod Group Scheduling:

https://volcano.sh/docs/scheduler-pod-group-status/

This, in turn, means we can construct the "football" like plays passing control from Pod Group to Pod Group. The DSL for the execution of batch operation could handle this.

Volcano builds on top of the primitives provided by the kube-batch scheduler.

The vk-scheduler is based on kube-batch and included enhancements for Volcano.

Since a Generation defines a time sliced execution of a Synchronization Set, it maps naturally to a "workload" or in Volcano terms a JobEx.

So now the flow of our system is taking shape.

Define SyncSet defintions
Create Generations based on the SyncSet
Create Data Ponds based on Sync Set definition
For each Generation create a correspoding Job Ex
Bind Pods to nodes based on Node Priority Algorithm which uses DataPond location as the primary criteria
Move from stage to stage in Generation Execution bringing up resources "on demand" through the Delayed Pod Creation from kube-batch

jmenke blog

Saturday, November 30, 2019

Building on Kube-Batch Primitives for Scheduling

Volcano builds on top of the primitives provided by the kube-batch scheduler.

No comments: