jmenke blog: Backpressure and wasted compute cycles

I have been reading through the various examples of how to use Akka streams to process data with "backpressure" and I am finding that our particular use case might demand something extra in addition to what Akka streaming api provides.

It seems like the streaming api will allow you filter what events make it into a buffer of events that need to be processed by a consumer, but ends up directing your compute cycles to be expended on the events in an order that does not optimize for calculating on the most recent event. This can result in what I am calling wasted compute cycles.

Assumption: If you have a queue of size X and apply backpressure rules to make sure the buffer is not expanded beyond capacity you end up with an ordered data structure that will process data that is older first.

My question is : why waste time on processing old data first? If you know that your data feed provides Y events in the time it takes to make a single calculation (and each event is cumulative snapshot of state) and your calculation speed for a single calculation is S and the number of workers you have to process data is W. Then you can figure out the optimal spacing between actually responding to events to avoid wasted work.

This is to say that given a set of continuously changing values, wasted work can be defined as dedicating processing time to snapshots that are effectively very similar to a previous processing effort at the expense of ignoring data that is substantially different given a set of events in a processing window. You should select the events you respond to based on optimal spacing not by first in first out buffer or the availability of resources to process the event

If you look at this relationship:

Y = average number of events coming in during the time it takes for single calculation to process with a snapshot of existing data

C = average time to process a calculation

P = number of processes available to process events

Then the optimal spacing between events (this will determine which event should trigger a calculation) could be expressed in terms of

Y / P = optimal number of events to ignore until an event is eligible to be considered for processing

or

C / P = optimal time span between allowing an event to trigger processing

If you filter the events according to these formulas - right now thinking both equations could apply... then you will not be wasting time with sequential data that has been added to a buffer...

The main problem I see with sequential data being added to buffer and processing off that buffer is that you are not triggering calculations with the most recent data... you are triggering off the oldest values in the buffer.

This will have to be proven.. but I see the potential for a pool of workers to get into a state where multiple workers are triggered to work on data with less than the optimal amount of spacing... just because the workers are available. So you get a time series that has clusters of calculation results followed by nothing (processing time for wasted calculations) and then another cluster of the same... I would say that if you have a time series that has a window of let's say 300 events and all your processing time is directed to processing the first 50 events... then your algorithm does not clearly accomplish it's goals. Backpressure should not be blind... it should be able to determine the optimal spacing between "processing" events.

I am hoping to find some solution that uses the Akka primitives to accomplish this goal. Right now, I am leaning towards having a Scheduler actor that controls a set of worker Actors. This scheduler actor will make sure that work is only assigned to a child actor if it satisfies the spacing laws. One level above this Scheduler actor can be the standard backpressure mechanisms. If your data is coming in event faster than you can determine which events to filter than you need some method of dropping events even before they get passed to the Scheduler actor.

jmenke blog

Sunday, April 29, 2018

Backpressure and wasted compute cycles

No comments: