Friday, November 06, 2020

Processing of hierarchical data in a pod with data locality

 



In a previous post I had presented the idea of "higher" level patterns existing in "2nd layer" and beyond code.  I had said patterns would emerge in this "2nd layer" - the orchestration layer.   The first pattern identified was that of iteration with its applicability to time slicing.  Now there appears to be second pattern that is emerging for processing hierarchical data sets with modular logic and data locality.

The ability to cycle through different containers in a pod provides the ability to process hierarchical data sets the don't have homogenous compute requirements - meaning that more than one type of operation is performed in the chain.  

Parent levels could pass data via the pod to their children and visa versa.  Processing of the hierarchy could with any generic graph traversal algorithm.  Breadth first, Depth First, dependency graph, etc. Data could be persisted to network storage for high availability but kept locally for speed during processing of the tree.  This could keep the logic in the application modular where each container performed a specific task.  This in turn could help with versioning and maintenance as changes could be isolated to single container definitions.

In previous posts I had discussed an inner and outer scope for orchestration.  I think Macro and Micro are good terms for this concept.

Possible patterns for orderly execution of compute

Macro - this is for groups of pods

  • Cycling data through pods to digest streaming data
    • Circular Queue
  • Graph Execution
    • Depth First - sequential 
    • Breadth First - sequential
    • Graph with dependencies - sequential
    • Fan Out - parallel
    • Fan In - parallel
  • Mode of communication.
    • To System: via Operator Framework
    • From Pod to Pod 
      • Network - (hits the router)
      • File System - network share

Micro - this is inside a Pod; moving data through containers in a pod and changing containers as you go

  • Cycling - bring sets of containers in and out scope via Advanced Statefull set
    • Circular Queue
  • Graph Execution
    • Depth First 
    • Breath First 
    • Graph with dependencies
    • Fan Out - parallel
    • Fan In - parallel
  • Mode of communication
    • To System -  Kubernetes client 
    • From container to container
      • Network - shared IP 
      • File System (disk) - shared scratch disk (RAM)
      • IPC - only available in Micro mode
NOTE : the key thing to observe here is that the entire graph is not created at once... the graph evolves at both the Macro and Micro levels.  This is how a depth first traversal for instance might not need all of the containers or pods (depending on level)  in a call chain active at one time.  

For the picture below, think of each level as representing the active resources in the system during a particular phase of operations.  Only portions of the entire board are on each level.  In 2D they work as a group, but in 3D you can separate by level.  



On to a discussion of Micro and Macro, the first thing is that the Macro level is responsible for keeping data consistent for a set of Micro level orchestrations. One controls the other in a divide an conquer type system.

You can see that the API for Macro is pretty much the same as that for Micro.  This leads me to think it might be possible to create a generic interface that would be used for both.  Both Micro and Macro house meta-logic: it's code that runs Code.  In the Macro layer controlling pods would be via Kubernetes API.  In the Micro level changing containers would be via the out of process Kubernetes API client.    

Where does this Macro code live?  In an Operator.  I would say it needs to be coded in Go as Go has channels which are well suited for implementation of all of the patterns. Kubernetes infrastructure provides work queues and access to the ETCD data store for fault tolerance - keeping the state of the machine at the Macro level.

Moving into the Micro level you are at the level of the Pod.  But, this pod is self aware, it knows it's part of a Macro structure.  The brains of the pod are located in a quarterback sidecar container. The quarterback container knows about the containers it holds and has access to the Kubernetes API; this is where it can define the Micro execution flow of containers.  

It might be possible that the "pattern" code for both the Macro and Micro APIs could have common sections.  The "pattern" logic is the code implementing the Depth First algorithm, the Breadth first algorithm, etc.; this could be a type of flow language or API.  One interesting consequence of having this flow language / API would be that it could be generated or programmed by some higher level code.  If this is possible then entire systems could self bootstrap from data.

This would lead to the next level.  It would be interesting to see if both of these layers could be abstracted further.  If these very structures (sections of micro and macro logic) could be themselves orchestrated then we start to get higher level power.  This was mentioned in earlier posts as a progression to higher and higher levels of aggregation.  


No comments: