This mainly fixes issues when we had "long" errors, for example a multi blocked when trying to obtain a lock that was finally failing after 60s. Previously we were trying only for 5 minutes. We now do all the tries. I've fixed stuff around this area to make it work.
There is also more logs.
I've changed the back off array. With the default pause of 100ms, even after 20 tries we still retry every 10s.
I've also changed the max per RS to something minimal. If the cluster is not in a very good state it's less aggressive. It seems to be a better default.
I've done two tests:
- on a small; homogeneous cluster, I had the same performances
- on a bigger, but heterogeneous cluster it was twice as fast.