It appears that the kube-batch infrastructure is perfectly positioned to address the question of how these objects get scheduled on a cluster. In previous posts, it was envisioned that resources would be created directly by client code in our operators, but now I am seeing that there is the need for another layer abstraction here.
The custom CRD we have envisioned can identify the high-level details for the system and then kube-batch will handle translation into runtime scheduling; this will provide a more robust execution environment.
Upon initial investigation of the kube-batch documentation, I am seeing a few interesting areas:
Node Priority: https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/node-priority.md
Via the customizable priority functions, a node can be selected for the execution of a workload (the execution of a Generation)
Workloads will be scheduled on Nodes with higher priority and these priorities will be calculated based on different parameters like
It appears we should be able to schedule pods according to which DataPond is already positioned on the node
Workloads will be scheduled on Nodes with higher priority and these priorities will be calculated based on different parameters like
ImageLocality
, Most/Least Requested Nodes
...etc. A basic flow for the Node priority functions is depicted below.It appears we should be able to schedule pods according to which DataPond is already positioned on the node
Delay Pod Creation: https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/delay-pod-creation.md
To delay pod creation, both
kube-batch
and PodGroupMinResources
will watch ResourceQuota
to decide which PodGroup
should be in queue firstly. The decision maybe invalid because of race condition, e.g. other controllers create Pods. In such case, PodGroupMinResources
will reject PodGroup
creation and keep InQueue
state until kube-batch
transform it back to Pending
. To avoid race condition, it's better to let kube-batch
manage Pod
number and resources (e.g. CPU, memory) instead of Quota
.
So delay can be achieved at the PodGroup level.
this looks to be related to Volcano's Pod Group Scheduling:
https://volcano.sh/docs/scheduler-pod-group-status/
This, in turn, means we can construct the "football" like plays passing control from Pod Group to Pod Group. The DSL for the execution of batch operation could handle this.
The vk-scheduler is based on kube-batch and included enhancements for Volcano.
Since a Generation defines a time sliced execution of a Synchronization Set, it maps naturally to a "workload" or in Volcano terms a JobEx.
So now the flow of our system is taking shape.
This, in turn, means we can construct the "football" like plays passing control from Pod Group to Pod Group. The DSL for the execution of batch operation could handle this.
Volcano builds on top of the primitives provided by the kube-batch scheduler.
Since a Generation defines a time sliced execution of a Synchronization Set, it maps naturally to a "workload" or in Volcano terms a JobEx.
So now the flow of our system is taking shape.
- Define SyncSet defintions
- Create Generations based on the SyncSet
- Create Data Ponds based on Sync Set definition
- For each Generation create a correspoding Job Ex
- Bind Pods to nodes based on Node Priority Algorithm which uses DataPond location as the primary criteria
- Move from stage to stage in Generation Execution bringing up resources "on demand" through the Delayed Pod Creation from kube-batch
No comments:
Post a Comment