Affects Version/s: 0.94.1, 0.95.2
Fix Version/s: 0.95.0
3 nodes cluster test, but can occur as well on a much bigger one. It's all luck!
Release Note:The Split Log Manager now takes into account the state of the region server doing the split. If this region server is marked as dead (i.e. its ZooKeeper connection expires), its task is immediately resubmitted. If the region server is still in the "alive" state, then we wait for 2 minutes before resubmitting, instead of 25 seconds previously. This delay can be changed with the parameter "hbase.splitlog.manager.timeout" (milliseconds, new default since 0.96: 120000).
The Split Log Manager now takes into account the state of the region server doing the split. If this region server is marked as dead (i.e. its ZooKeeper connection expires), its task is immediately resubmitted. If the region server is still in the "alive" state, then we wait for 2 minutes before resubmitting, instead of 25 seconds previously. This delay can be changed with the parameter "hbase.splitlog.manager.timeout" (milliseconds, new default since 0.96: 120000).
With default settings for "hbase.splitlog.manager.timeout" => 25s and "hbase.splitlog.max.resubmit" => 3.
On tests mentionned on
HBASE-5843, I have variations around this scenario, 0.94 + HDFS 1.0.3:
The regionserver in charge of the split does not answer in less than 25s, so it gets interrupted but actually continues. Sometimes, we go out of the number of retry, sometimes not, sometimes we're out of retry, but the as the interrupts were ignored we finish nicely. In the mean time, the same single task is executed in parallel by multiple nodes, increasing the probability to get into race conditions.
t0: unplug a box with DN+RS
t + x: other boxes are already connected, to their connection starts to dies. Nevertheless, they don't consider this node as suspect.
t + 180s: zookeeper -> master detects the node as dead. recovery start. It can be less than 180s sometimes it around 150s.
t + 180s: distributed split starts. There is only 1 task, it's immediately acquired by a one RS.
t + 205s: the RS has multiple errors when splitting, because a datanode is missing as well. The master decides to give the task to someone else. But often the task continues in the first RS. Interrupts are often ignored, as it's well stated in the code ("// TODO interrupt often gets swallowed, do what else?")
t + 211s: two regionsservers are processing the same task. They fight for the leases:
They can fight like this for many files, until the tasks finally get interrupted or finished.
The taks on the second box can be cancelled as well. In this case, the task is created again for a new box.
The master seems to stop after 3 attemps. It can as well renounce to split the files. Sometimes the tasks were not cancelled on the RS side, so the split is finished despites what the master thinks and logs. In this case, the assignement starts. In the other, it's "we've got a problem").
t + 300s: split is finished. Assignement starts
t + 330s: assignement is finished, regions are available again.
There are a lot of subcases possible depending on the number of logs files, of region server and so on.
The issues are:
1) it's difficult, especially in HBase but not only, to interrupt a task. The pattern is often
This tyically shallows the interrupt. There are other variation, but this one seems to be the standard.
Even if we fix this in HBase, we need the other layers to be Interrupteble as well. That's not proven.
2) 25s is very aggressive, considering that we have a default timeout of 180s for zookeeper. In other words, we give 180s to a regionserver before acting, but when it comes to split, it's 25s only. There may be reasons for this, but it seems dangerous, as during a failure the cluster is less available than during normal operations. We could do stuff around this, for example:
=> Obvious option: increase the timeout at each try. Something like *2.
=> Also possible: increase the initial timeout
=> check for an update instead of blindly cancelling + resubmitting.
3) Globally, it seems that this retry mechanism duplicates the failure detection already in place with ZK. Would it not make sense to just hook into this existing detection mechanism, and resubmit a task if and only if we detect that the regionserver in charge died? During a failure scenario we should be much more gentle than during normal operation, not the opposite.