Sharing my notes on data flow and data ponds:
This article has patterns in it that could be used in an
engine that generated pipelines in Go.
The engine would read a DSL that described data flows.
( First basic pattern to implement is fan out)
Once this DSL is created (would describe a flow of operations).
It will be possible to tie into the Generation / Sync set ecosystem described in earlier blog posts.
So the fan out in terms of containers: each node in
the flow is a container with simple input output.
Execution will be via a flow through a directed graph
comprised of nodes (Go Channels).
Instead of calling methods inline. Signals on
channels would result in method calls into local containers.
As the flow of execution moves through the path of channels
generated by the DSL, “methods” are executed not inline but as GRPC calls
between containers.
A Quarterback container in the pod would orchestrate the flow with
other containers.
It would be the Quarterback object that read the DSL and
processed the data flow - it would house the set of channel calls.
Cancellation would be supported via callback to the master container
on errors or timeouts in the channels.
Need a basic GRPC wrapper for this. Input and Output.
Every container would expose this GRPC API
Containers in the same pod can share scratch space (local
storage is very fast) – also have MemSQL local db ready as the local data source in
the pod
So a family of containers would exist...
These containers would share the same file system (“the land
surrounding the pond”) and be collocated on same node as MemSQL database (around a
“data pond”) that is populated by an upstream data provider (“data lake” –
MemSQL db). Operations could pull data out of the pond and share data on
the land. (same filesystem)
Each one of these containers would be a “scratch pad” for
some type of numerical analysis. It would take input and generate output.
Output could be aggregated into full solutions in a divide
and conquer type of processing.
Note: in this data flow, containers could be called more than once. This would permit a "continuation" style of processing. Where half completed operations could be halted waiting for signal from another container later down the processing chain. This would mean data structures could stay in memory and would not need to load again. This could be implemented via a flow of channels inside the worker containers.
Note: in this data flow, containers could be called more than once. This would permit a "continuation" style of processing. Where half completed operations could be halted waiting for signal from another container later down the processing chain. This would mean data structures could stay in memory and would not need to load again. This could be implemented via a flow of channels inside the worker containers.
No comments:
Post a Comment