Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2782

HA: Support multiple shared edits dirs

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.24.0
    • Fix Version/s: None
    • Component/s: ha
    • Labels:
      None

      Description

      Supporting multiple shared dirs will improve availability (eg see HDFS-2769). You may want to use multiple shared dirs on a single filer (eg for better fault isolation) or because you want to use multiple filers/mounts. Per HDFS-2752 (and HDFS-2735) we need to do things like use the JournalSet in EditLogTailer and add tests.

        Issue Links

          Activity

          Hide
          Eli Collins added a comment -

          Given QJM (HDFS-3077) IMO this is no longer worth considering.

          Show
          Eli Collins added a comment - Given QJM ( HDFS-3077 ) IMO this is no longer worth considering.
          Hide
          Aaron T. Myers added a comment -

          Converting to top-level issue with commit of HDFS-1623.

          Show
          Aaron T. Myers added a comment - Converting to top-level issue with commit of HDFS-1623 .
          Hide
          Eli Collins added a comment -

          There's also a variant of solution #2 where you don't use an external source, eg NN1 and NN2 communicate directly to determine a common set of shared dirs (they re-negotiate when either sees a SD fail). In your partition scenario failover would not be possible because NN2 can not access all the directories it previously negotiated with NN1. This isn't a good solution though because it requires both the active and standby be available whenever a SD fails (and the list needs to be re-negotiated).

          Show
          Eli Collins added a comment - There's also a variant of solution #2 where you don't use an external source, eg NN1 and NN2 communicate directly to determine a common set of shared dirs (they re-negotiate when either sees a SD fail). In your partition scenario failover would not be possible because NN2 can not access all the directories it previously negotiated with NN1. This isn't a good solution though because it requires both the active and standby be available whenever a SD fails (and the list needs to be re-negotiated).
          Hide
          Todd Lipcon added a comment -

          Been thinking about this a bit... I don't think it's so trivial as to just use our existing JournalSet implementation across multiple remote directories. The issue is the following case:

          • HA cluster configured with two NNs: NN1 and NN2, and two shared storage directories (SD1 and SD2)
          • NN1 is active and writing to SD1 and SD2 when a network issue occurs which partitions NN1 from SD2 and NN2 from SD1 (this isn't that unlikely, for example if NN1 and SD1 share a rack while NN2 and SD2 share a rack).
          • If a failover occurs at this point, NN2 could take over without reading the most recent edits from SD1, resulting in a divergent namespace.

          I have two possible solutions in mind:

          Solution 1: use a traditional quorum system

          When a NN writes, require that all writes must be synced to W shared storage directories. When the SBN reads, require that it read edits from R shared storage directories. The quorum requirement is that R + W > N where N is the total number of shared dirs.

          In the error case described above, we could set W = 1 and R = 2. Thus the active NN could continue to operate even if one of the SDs is down – but no failovers could occur during that window of time. Once the directory is restored, the system would be fully operational again and ready for failover.

          The usual quorum requirement that W > N/2 would not be necessary here, since we already ensure a designed single writer by fencing operations.

          Solution 2: use an external source of record to agree on storage directory state.

          In this solution, an external system (likely zookeeper) is used to agree upon the state of the storage directories. ZK would contain a znode which lists the active shared directories. When the active NN writes, it writes to all of these active directories. If any of the writes fail, it must update the znode to mark the failed directory as out-of-date before acking the write.

          When a directory is restored, it will be re-added to the znode listing active directories.

          When the SBN processes a failover, it considers only those directories listed in the znode as active.

          Any other solutions I'm not thinking of here?

          Show
          Todd Lipcon added a comment - Been thinking about this a bit... I don't think it's so trivial as to just use our existing JournalSet implementation across multiple remote directories. The issue is the following case: HA cluster configured with two NNs: NN1 and NN2, and two shared storage directories (SD1 and SD2) NN1 is active and writing to SD1 and SD2 when a network issue occurs which partitions NN1 from SD2 and NN2 from SD1 (this isn't that unlikely, for example if NN1 and SD1 share a rack while NN2 and SD2 share a rack). If a failover occurs at this point, NN2 could take over without reading the most recent edits from SD1, resulting in a divergent namespace. I have two possible solutions in mind: Solution 1: use a traditional quorum system When a NN writes, require that all writes must be synced to W shared storage directories. When the SBN reads, require that it read edits from R shared storage directories. The quorum requirement is that R + W > N where N is the total number of shared dirs. In the error case described above, we could set W = 1 and R = 2. Thus the active NN could continue to operate even if one of the SDs is down – but no failovers could occur during that window of time. Once the directory is restored, the system would be fully operational again and ready for failover. The usual quorum requirement that W > N/2 would not be necessary here, since we already ensure a designed single writer by fencing operations. Solution 2: use an external source of record to agree on storage directory state. In this solution, an external system (likely zookeeper) is used to agree upon the state of the storage directories. ZK would contain a znode which lists the active shared directories. When the active NN writes, it writes to all of these active directories. If any of the writes fail, it must update the znode to mark the failed directory as out-of-date before acking the write. When a directory is restored, it will be re-added to the znode listing active directories. When the SBN processes a failover, it considers only those directories listed in the znode as active. Any other solutions I'm not thinking of here?
          Hide
          Eli Collins added a comment -

          #1 Expecting both independent failures (eg the dirs are one two separate mounts with different underlying volumes may fail independently) and dependent failures (if the connection to the same server is severed for example).

          #2 Yes, a 2nd or 3rd shared dir is just like the first. Eg if the primary or standby marked a shared dir as bad it and the standby could switch to another.

          Show
          Eli Collins added a comment - #1 Expecting both independent failures (eg the dirs are one two separate mounts with different underlying volumes may fail independently) and dependent failures (if the connection to the same server is severed for example). #2 Yes, a 2nd or 3rd shared dir is just like the first. Eg if the primary or standby marked a shared dir as bad it and the standby could switch to another.
          Hide
          Sanjay Radia added a comment -

          2 questions:

          • If you have multiple dirs within the same NFS server, are you expecting them to fail independently?
          • For multiple NFS servers, will the standby be expected to read from any/all of them?
          Show
          Sanjay Radia added a comment - 2 questions: If you have multiple dirs within the same NFS server, are you expecting them to fail independently? For multiple NFS servers, will the standby be expected to read from any/all of them?

            People

            • Assignee:
              Unassigned
              Reporter:
              Aaron T. Myers
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development