Saturday, October 05, 2019

DSL for execution of batch operations (Part 2)


Sharing my notes on data flow and data ponds:


This article has patterns in it that could be used in an engine that generated pipelines in Go. 

The engine would read a DSL that described data flows.  ( First basic pattern to implement is fan out)

Once this DSL is created (would describe a flow of operations).  It will be possible to tie into the Generation / Sync set ecosystem described in earlier blog posts.

So the fan out in terms of containers:  each node in the flow is a container with simple input output.

Execution will be via a flow through a directed graph comprised of nodes (Go Channels).

Instead of calling methods inline.   Signals on channels would result in method calls into local containers. 

As the flow of execution moves through the path of channels generated by the DSL, “methods” are executed not inline but as GRPC calls between containers. 

A Quarterback container in the pod would orchestrate the flow with other containers.

It would be the Quarterback object that read the DSL and processed the data flow - it would house the set of channel calls.  

Cancellation would be supported via callback to the master container on errors or timeouts in the channels.

Need a basic GRPC wrapper for this.  Input and Output. Every container would expose this GRPC API

Containers in the same pod can share scratch space (local storage is very fast) – also have MemSQL local db ready as the local data source in the pod

So a family of containers would exist...

These containers would share the same file system (“the land surrounding the pond”) and be collocated on same node as MemSQL database (around a “data pond”) that is populated by an upstream data provider (“data lake” – MemSQL db).  Operations could pull data out of the pond and share data on the land.  (same filesystem)

Each one of these containers would be a “scratch pad” for some type of numerical analysis.  It would take input and generate output.

Output could be aggregated into full solutions in a divide and conquer type of processing.

Note: in this data flow, containers could be called more than once.  This would permit a "continuation" style of processing.  Where half completed operations could be halted waiting for signal from another container later down the processing chain.  This would mean data structures could stay in memory and would not need to load again. This could be implemented via a flow of channels inside the worker containers.




No comments: