I've been having trouble getting a sustained, large ITBLL run to complete over the last few days. I'm seeing a bunch of the below:
- A region splits or is moved
- Chaos kills the Master in the middle of the Split or Move Procedure after a Region has been offlined
- Master takes a while to come back whether because it is not started until a couple of minutes have passed and then there is some recovery to be done.
So a region can be offline for minutes. Default we retry up to 16 times which ends up at about 2.5 minutes before we give up.
So, I can up the retries when running larger tests but also, the region should come back online faster.
Let me hang ITBLL fixes/notes off here.