This came from
I got the issue. I am able to reproduce this
See the logs
Here we can observe that the SPLITTING node was first created. Then we transit it to SPLITTING to SPLITTING so that AM can have the nodeDataChange event. But for the nodeDataChange event to happen first nodeChildrenChange event should happen so that the master can set a watcher on the node.
Now when this hang happens, we can see that after the transition happens only then the watcher is set by nodeChildrenChange event and so the SPLITTING to SPLITTING event itself is missed or skipped.
Ideally the nodeChildrenChange event iterates thro the list of new znodes on the /hbase/assignment nodes. And then creates a watcher on that. One reason could be there are more than one znode and so the watch setting operation takes time. The order of execution is different when we try running from eclipse and when we run mvn tests.
My conclusion is that the testcase actually reveals the problem but the same can happen in any case where the SPLITTING event can get missed out. May be some of the SPLIT related bugs that were raised is due to this? Need to analyse.
Any suggestions welcome. We should ensure that the transition from SPLITTING to SPLITTING should happen only after the master has set the watch on the znode and we should be sure of that.