Been thinking about this a bit... I don't think it's so trivial as to just use our existing JournalSet implementation across multiple remote directories. The issue is the following case:
- HA cluster configured with two NNs: NN1 and NN2, and two shared storage directories (SD1 and SD2)
- NN1 is active and writing to SD1 and SD2 when a network issue occurs which partitions NN1 from SD2 and NN2 from SD1 (this isn't that unlikely, for example if NN1 and SD1 share a rack while NN2 and SD2 share a rack).
- If a failover occurs at this point, NN2 could take over without reading the most recent edits from SD1, resulting in a divergent namespace.
I have two possible solutions in mind:
Solution 1: use a traditional quorum system
When a NN writes, require that all writes must be synced to W shared storage directories. When the SBN reads, require that it read edits from R shared storage directories. The quorum requirement is that R + W > N where N is the total number of shared dirs.
In the error case described above, we could set W = 1 and R = 2. Thus the active NN could continue to operate even if one of the SDs is down – but no failovers could occur during that window of time. Once the directory is restored, the system would be fully operational again and ready for failover.
The usual quorum requirement that W > N/2 would not be necessary here, since we already ensure a designed single writer by fencing operations.
Solution 2: use an external source of record to agree on storage directory state.
In this solution, an external system (likely zookeeper) is used to agree upon the state of the storage directories. ZK would contain a znode which lists the active shared directories. When the active NN writes, it writes to all of these active directories. If any of the writes fail, it must update the znode to mark the failed directory as out-of-date before acking the write.
When a directory is restored, it will be re-added to the znode listing active directories.
When the SBN processes a failover, it considers only those directories listed in the znode as active.
Any other solutions I'm not thinking of here?