TestSplitLogManager.testVanishingTaskZNode fails when run individually (run just that test case from eclipse). I've also noticed that it is flaky on windows.
The reason is a rare race condition, which somehow does not happen that much when the whole class is run.
The sequence of events is smt like this:
- we create 1 log file to split
- we call splitLogDistributed() in its own thread.
- splitLogDistributed() is waiting in waitForSplittingCompletion() since there are no splitlogworkers, it keep waiting.
- we delete the task znode from zk
- SplitLogManager receives the zk callback from GetDataAsyncCallback, which will call setDone() and mark the task as success.
- However, meanwhile the waitForSplittingCompletion() loops sees that remainingInZK == 0, and calls return concurrently to the above.
- on return from waitForSplittingCompletion(), splitLogDistributed() fails because the znode delete callback has not completed yet.
This race only happens when the last task is deleted from zk, and normally only the SplitLogManager deletes the task znodes after processing it, so I don't think this is a production issue.