Monday, January 20, 2020

The impact of Envoy and the path to "type safety" in data

When discussing the reasoning behind creating Envoy, Matt states:

The network should be transparent to applications. 





https://youtu.be/gQF23Vw0keg?t=90 


I think this is interesting as this leads to not only the simplification of many engineering problems but opens the door to new ways of arranging compute resources to slice and dice problem sets.

Having a layer in your architecture where networking concerns can be managed orthogonally to application concerns presents the developers with a new pathway to exploit.  What if an uber language existed that could take advantage of this control?  What would that, in turn, mean for developers?

The same tools used to abstract out networking concerns and streamline microservice development could be used to break bigger problems into smaller ones.

Questions arise such as:

Why only have 1 set of long-lived services when you can have sets of windowed versions of these services?

What impact would this have on the ability to parallelize compute efforts?

Is there such a thing as "type safety" for data?

When a set of services can be guaranteed to be working off the same base data what does that mean for application logic?

Maybe it can be said that mutable data should be transparent to (streaming) applications?  Meaning that applications don't need to be aware of it - they should be able to program assuming data immutability.  Lower level substrates (the windowing process of time-sliced universes) should enable a solution to scale without changing code: it should be an infrastructure level change that controls the degree of parallelism.


Is there any evidence that this might be a good idea?  

One source might actually be the same talk; Matt mentions a strategy used by Envoy internally called Read-Copy-Update: it allows concurrency between multiple readers though the use of temporary pointers to current data.

As per Wikipedia:

Read-copy-update allows multiple threads to efficiently read from shared memory by deferring updates after pre-existing reads to a later time while simultaneously marking the data, ensuring new readers will read the updated data. This makes all readers proceed as if there were no synchronization involved, hence they will be fast, but also making updates more difficult.

A similar idea is to use an append-only system to allow multiple concurrent updates and readers.  This is done by never directly updating rows but instead storing new rows (old versions are purged when they move out of the window of processing) 

In a snapshotting system, instead of having pointers, each time-sliced environment gets its own immutable version of the data.  This achieves the same data isolation, but instead of changing pointers to individual items, we are doing snapshots of many items at once via a select on the database that gets the current values of all relevant items.  There is a size penalty but this can be reduced by purging old records.

If configured correctly, a service should be able to provide either a single recurring set of answers for a set of calculations or MANY time-shifted answers.  The application itself should not have this logic.

Time for All Calculations in the set:  100 sec.
Customer requirements are for updates every 10 sec.
The solution:  have 10 copies of the service running:   t, t+10, t + 20, etc...

The update speed should be configurable and the application logic should not need to change, or even be aware of the other parallel instances.

Update 2/3:  In doing more research on this I came across this information

https://stackoverflow.com/questions/39715803/what-is-the-difference-between-mini-batch-vs-real-time-streaming-in-practice-no

Perhaps I need to look into Apache Flink/Beam for Real-Time Streaming.?  More to come on this.  We need to make sure we are aiming at the right target here.  This started out as an exercise in how to effectively use Kubernetes.  Not sure if the paths are compatible at this point.   I think the path is clear to doing mini-batches but not sure what a Kubernetes equivalent of real-time streaming would be.

One thing that had concerned me about real-time at the start of my research was the consistency of input data.  If some values in a data set move faster than others.  Then we can update some of them more frequently, but what does this do to data consistency?  Does the result of a calculation have to represent real input values that actually existed at some time in the past?  I think it should. Snapshotting handles this. When you have a single stream of data you don't need to worry about this.  But when you have a complex data set I think it may matter.

Calc1 takes 1 min to calculate
Calc2 takes 10 seconds to calculate

What combinations of answers are REAL?  Is the result from the first invocation of Calc1 together with the 3rd invocation of Calc2 a valid combination?  Technically these two calculations are not guaranteed to be using input data from the same time period.  The 3rd execution of Calc2 is using data from a different time than the slower first execution of Calc1.

It could be stated that the only REAL answers are ONLY those that represent inputs that are consistent on a time basis for all involved Calculations.  This would make the calculations repeatable.  If input data was allowed to be based on the computation time of any iteration, then in multiple runs, it's possible actually probable the answers would be different.

Something to think about at least.  On the one hand, you have data consistency and repeatability, but on the other hand, you are only updating at the speed of your slowest calculation.  But this is where having multiple instances executing at once helps.  If you have 5 sets of calculations happening at once you can get updates 5 times faster.  Note the lag is still the lag of the slowest calculation.

Update  2/4

As it turns out, there is research confirming what I have stated in Update 2.

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

This article states:

if you care about both correctness and the context within which events actually occurred, you must analyze data relative to their inherent event times, not the processing time at which they are encountered during the analysis itself.

Note: The article also mentions windowing as a method to assure correctness.






No comments: