Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-6185 HDFS operational and debuggability improvements
  3. HDFS-7121

For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that the operation can succeed.

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: journal-node
    • Labels:
      None

      Description

      Several JournalNode operations are not satisfied by a quorum. They must succeed on every JournalNode in the cluster. If the operation succeeds on some nodes, but fails on others, then this may leave the nodes in an inconsistent state and require operations to do manual recovery steps. For example, if doPreUpgrade succeeds on 2 nodes and fails on 1 node, then the operator will need to correct the problem on the failed node and also manually restore the previous.tmp directory to current on the 2 successful nodes before reattempting the upgrade.

        Activity

        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Moving bugs out of previously closed releases into the next minor release 2.8.0.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Moving bugs out of previously closed releases into the next minor release 2.8.0.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Sounds good. Thanks for working on this.

        Show
        cmccabe Colin P. McCabe added a comment - Sounds good. Thanks for working on this.
        Hide
        cnauroth Chris Nauroth added a comment -

        I agree that a pre-check strategy is likely to be good enough. Upgrade and rollback are operations that execute infrequently. Typically they're done during periods of low activity on the cluster with a close watch by an admin. Clients can't really connect anyway. It's highly likely that a pre-check would expose potential problems in most realistic scenarios, and the inherent time-of-check/time-of-use race condition is unlikely to happen. I'm going to draft a prototype patch for a pre-check RPC.

        Show
        cnauroth Chris Nauroth added a comment - I agree that a pre-check strategy is likely to be good enough. Upgrade and rollback are operations that execute infrequently. Typically they're done during periods of low activity on the cluster with a close watch by an admin. Clients can't really connect anyway. It's highly likely that a pre-check would expose potential problems in most realistic scenarios, and the inherent time-of-check/time-of-use race condition is unlikely to happen. I'm going to draft a prototype patch for a pre-check RPC.
        Hide
        cmccabe Colin P. McCabe added a comment - - edited

        Good point. I wasn't thinking of that failure case.

        I think a "pre-check" should include checking that we have the ability to write to the target directory. POSIX has access() for this... this might not be accessible from Java, but we could get something similar by creating a directory there with a random UUID and then immediately deleting it. If we can do that, then it's almost certain that we can do the rename later, barring something exotic like ACLs or selinux.

        Of course, even if we did two-phase commit, we'd still have to do something meaningful in the "promise" phase. That would mean doing exactly this check that the filesystem permissions were sane. Otherwise the node would be making a promise it couldn't keep.

        I don't like the "have everyone do the rename and undo everyone if someone fails" solution that you mentioned earlier. I think it's rather complex and has a lot of weird corner cases (like if undo fails). Also, it seems confusingly similar to rollback (or maybe that's just me?)

        Show
        cmccabe Colin P. McCabe added a comment - - edited Good point. I wasn't thinking of that failure case. I think a "pre-check" should include checking that we have the ability to write to the target directory. POSIX has access() for this... this might not be accessible from Java, but we could get something similar by creating a directory there with a random UUID and then immediately deleting it. If we can do that, then it's almost certain that we can do the rename later, barring something exotic like ACLs or selinux. Of course, even if we did two-phase commit, we'd still have to do something meaningful in the "promise" phase. That would mean doing exactly this check that the filesystem permissions were sane. Otherwise the node would be making a promise it couldn't keep. I don't like the "have everyone do the rename and undo everyone if someone fails" solution that you mentioned earlier. I think it's rather complex and has a lot of weird corner cases (like if undo fails). Also, it seems confusingly similar to rollback (or maybe that's just me?)
        Hide
        cnauroth Chris Nauroth added a comment -

        I think it's probably good enough to just check if all JournalNodes are present before sending out the doPreUpgrade message.

        Hi Colin. This is coming out of a production support issue in which some invalid file system permissions caused the rename from current to previous.tmp to fail on 1 out of 3 JournalNodes. There weren't any nodes down. A pre-check like you suggested wouldn't have helped protect against this, because the failure wouldn't show up until actually attempting to do the work.

        Show
        cnauroth Chris Nauroth added a comment - I think it's probably good enough to just check if all JournalNodes are present before sending out the doPreUpgrade message. Hi Colin. This is coming out of a production support issue in which some invalid file system permissions caused the rename from current to previous.tmp to fail on 1 out of 3 JournalNodes. There weren't any nodes down. A pre-check like you suggested wouldn't have helped protect against this, because the failure wouldn't show up until actually attempting to do the work.
        Hide
        cmccabe Colin P. McCabe added a comment -

        I think it's probably good enough to just check if all JournalNodes are present before sending out the doPreUpgrade message. This guards against the administrative misconfiguration case, or the case where one or more journal nodes are down. It's true that we could experience a failure in between that check and the pre-upgrade operation, but the chances of that happening are very low. If it does happen, it will simply result in a JN being dropped out of the quorum later, which monitoring tools will pick up, and admins will fix. I'm pretty sure that there isn't a complete solution to this problem because it can be reduced to the Two Generals Problem.

        Show
        cmccabe Colin P. McCabe added a comment - I think it's probably good enough to just check if all JournalNodes are present before sending out the doPreUpgrade message. This guards against the administrative misconfiguration case, or the case where one or more journal nodes are down. It's true that we could experience a failure in between that check and the pre-upgrade operation, but the chances of that happening are very low. If it does happen, it will simply result in a JN being dropped out of the quorum later, which monitoring tools will pick up, and admins will fix. I'm pretty sure that there isn't a complete solution to this problem because it can be reduced to the Two Generals Problem.
        Hide
        cnauroth Chris Nauroth added a comment -

        I don't have a specific design in mind yet, so brainstorming comments are welcome. Possible ideas so far are:

        1. If the QuorumJournalManager client gets an exception on any node, then send a corresponding undo message to the nodes that previously completed the operation successfully. This would be best effort only, because a well-timed network failure could prevent delivery of the undo message, and that JournalNode still would be left in an inconsistent state.
        2. Do a full-fledged multi-phase commit. The operations involved are executed only rarely as "offline" events like software upgrade and rollback, so I don't expect typical criticisms of scalability on multi-phase commit protocols would be a problem here.
        Show
        cnauroth Chris Nauroth added a comment - I don't have a specific design in mind yet, so brainstorming comments are welcome. Possible ideas so far are: If the QuorumJournalManager client gets an exception on any node, then send a corresponding undo message to the nodes that previously completed the operation successfully. This would be best effort only, because a well-timed network failure could prevent delivery of the undo message, and that JournalNode still would be left in an inconsistent state. Do a full-fledged multi-phase commit. The operations involved are executed only rarely as "offline" events like software upgrade and rollback, so I don't expect typical criticisms of scalability on multi-phase commit protocols would be a problem here.

          People

          • Assignee:
            cnauroth Chris Nauroth
            Reporter:
            cnauroth Chris Nauroth
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development