Details
-
Brainstorming
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
We have this little note in our ITBLL harness,
// If we cause enough chaos, RPC requests might get into long backoffs. During this // time, it won't send keep alives to the map/reduce context. So increase the timeout // a bunch
Investigating, the ITBLL Generator's persist method updates the MR context progress only every 100 puts. You'd think that would be enough, but given chaos, it really isn't. What if we update progress with every put? Digging through MR source code, it seems that calling the context.progress() method only sets an AtomicBoolean that a progress update needs sent, actual sending of progress reports is gated by mapreduce.task.progress-report.interval, or 1% of mapreduce.task.timeout, which defaults to 1% of 300_000ms, or 3 seconds. So yeah, we should probably update this AtomicBool much more often in chaotic jobs, as doing so is effectively free and will improve reliability.
But still, every put is perhaps excessive. What if we add a pre-flush hook to (Async)BufferedMutator so that a MR job can set this progress flag right before the client disappears down into a retry loop? I bet other applications would find such a hook useful as well.