We have come across this situation multiple times where a zookeeper issues can cause NodeFailoverWorker to fail picking up replication queue for a dead region server silently. One example is when the znode size for a particular queue exceed jute.maxBuffer value.
There can be other situations that may lead to this and just go undetected. We need to have a metric for number of unclaimed replication queues. This will help in mitigating the problem through alerting on the metric and identifying underlying issues.