Here are more details: rollEditLog() was called via RPC from SNN and opening of new edit files failed. The exception was sent back to the caller, but no action was taken locally. From this point on, the edit log state is BETWEEN_LOG_SEGMENTS and no further rolling was allowed because endCurrentLogSegment() fails. But logSync() and logEdit() went on as if nothing is wrong.
Trunk does not have this issue. In mapJournalsAndReportErrors(), if a journal marked as required fails, namenode will terminate. But if none is marked required, it will simply throw an exception even if all journals fail. But logSync() will log FATAL and terminate since JournalSet#isEmpty() works diferently in trunk.
In branch-0.23, FSEditLog maintains a list of journals. logSync() invokes isEmpty(), but it won't check the validity of journals in the list. Instead it checks one by one in a loop. Although it already has a logic for counting and disabling bad journals, there is nothing equivalent to the resource availability check in trunk/branch-2. I think the best place to add this is
. This will make the failure behavior almost same as what is already implemented in truck/branch-2.
This issue does not exit in branch-1, where rollEditLog() clears editStreams before creating new edit files. Since it calls exitIfNoStreams() before returning, namenode will terminate if no edit stream was successfully created.
As for test cases, trunk already has TestEditLogJournalFailures. I will create a new patch for branch-0.23 and a test case.