Friday, September 06, 2019

Expanding on idea of batch processing




In previous posts I have explored how sets of Kubernetes services could be partitioned into "time slices" that provided a static calculation context for complex problems.  A "time slice" or generation would be a set of Kubernetes services bound to a particular set of data.  Some of my previous work at Verizon Wireless involved Spring Batch and I am realizing that this new Kubernetes pattern could provide the base for something similar to Spring Batch.

In Spring Batch, you have one Item Processor for "Job" and each Job is provided with its own set of data via an Item Reader.  Output is processed with an Item Writer.  

In designing this new Kubernetes oriented batch system, I think Spring Batch provides a clue as to what abstractions will proved to be useful.   Indeed, we will need readers, processors and writers. 

The Item Processor is described above.  As for the reader, Synchronized sets should subscribe to Kafka topics or similar message pipeline.  As for the writer, results should be published on similar message pipeline.

Interfaces should be created to represent these base abstractions and Kubernetes CRDs should be created to represent the items that will exist in Kubernetes.

In order to drive this engine some type of partition-er will be needed similar to Spring.  It is envisioned that checkpoint IDs will be passed to Generation identifying the cutoff point for data in a stream.  Data will flow into the Generation as a stream, but processing will take place only on data that is before the checkpoint.  The partitioner will be in charge of this checkpointing operation and will keep track of the state of each Generations current processing status.  

No comments: