This mainly fixes issues when we had "long" errors, for example a multi blocked when trying to obtain a lock that was finally failing after 60s. Previously we were trying only for 5 minutes. We now do all the tries. I've fixed stuff around this area to make it work.
There is also more logs.
I've changed the back off array. With the default pause of 100ms, even after 20 tries we still retry every 10s.
I've also changed the max per RS to something minimal. If the cluster is not in a very good state it's less aggressive. It seems to be a better default.
I've done two tests:
- on a small; homogeneous cluster, I had the same performances
- on a bigger, but heterogeneous cluster it was twice as fast.
|Status||Patch Available [ 10002 ]||Open [ 1 ]|
|Status||Patch Available [ 10002 ]||Resolved [ 5 ]|
|Hadoop Flags||Reviewed [ 10343 ]|
|Resolution||Fixed [ 1 ]|
|Status||Resolved [ 5 ]||Closed [ 6 ]|
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|3h 58m||1||Nicolas Liochon||26/Oct/13 00:47|
|2h 43m||2||Nicolas Liochon||26/Oct/13 00:47|
|4d 13h 20m||1||Nicolas Liochon||30/Oct/13 13:08|
|47d 5h 38m||1||stack||16/Dec/13 18:46|