Elliott Clark: I used custom, because the current naming scheme is not appropriate in my opinion (I started with medium/semi QOS, but then changed it to Custom). Using priority is kind of a misnomer as there is no priority as such, its just different set of handlers that is serving the requests.
Though we call them priorityHandlers, etc, they are just like regular handlers but for meta operations. I think we should change their name to metaOpsHandlers (or metaHandlers). Yea, I just used a threshold b/w 0 and 10.
Since this starts 0 "custom" priority handlers by default it will add another undocumented step when enabling replication. We should either make the number of handlers start by default > 0, or have the number depend on if replication is enabled.
I am ok with >0 default; don't think it should be tied to replication as they can be used for other methods too (such as Security, etc)
The naming is weird. These are not "Custom"QOS, but "Medium"QOS methods, right?
Hope you find it rationale now.
By default now (if hbase.regionserver.custom.priority.handler.count is not set), replicateWALEntry would use non-priority handlers... Which is not right, I think. It should revert back to the current behavior in that case (which is to do use the priorityQOS.
default > 0 sounds good?
What I still do not understand... Does this problem always happen? Does it happen because replicateWALEntry takes too long to finish? Does this only happen when the slave is already degraded for other reasons? Should we also work on replicateWALEntry failing faster in case of problems (shorter/fewer retries, etc)?
It can occur when the slave cluster is slow. And whenever it happens, it will make the entire cluster unresponsive. I have a patch which adds the fail fast behavior in sink and has been testing it too. It looks good so far. I tried creating a new JIRA but IOE while creating it (see
INFRA-5131). Will attach the patch once its created.