Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: QuorumJournalManager (HDFS-3077)
    • Component/s: ha
    • Labels:
      None

      Description

      This fixes a bug in the HDFS-3077 branch:
      If one of the JNs goes down and comes back up, it may have a discontiguous set of logs in its storage directories. This means that a future getEditLogManifest call will return a manifest that has gaps - eg it may have txn 1-3 and 7-9 but be missing 4-6. This is OK because it's guaranteed that a quorum of JNs do have those edits elsewhere.

      This situation currently causes an exception since the RemoteEditLogManifest constructor verifies that the returned log files form a contiguous set of edits. That was true for the NameNode but no longer true for this new usage.

      1. hdfs-3725.txt
        8 kB
        Todd Lipcon

        Activity

        Hide
        Todd Lipcon added a comment -

        Thanks, committed to branch.

        Show
        Todd Lipcon added a comment - Thanks, committed to branch.
        Hide
        Aaron T. Myers added a comment -

        Makes sense. Thanks for the explanation.

        +1, the patch looks good to me.

        Show
        Aaron T. Myers added a comment - Makes sense. Thanks for the explanation. +1, the patch looks good to me.
        Hide
        Todd Lipcon added a comment -

        I think this is safe, because we do the same checking for gaps on the client side while we're loading. On loading each transaction, we verify that it's the "expected" one (i.e the previous transaction + 1).

        Show
        Todd Lipcon added a comment - I think this is safe, because we do the same checking for gaps on the client side while we're loading. On loading each transaction, we verify that it's the "expected" one (i.e the previous transaction + 1).
        Hide
        Aaron T. Myers added a comment -

        The patch looks pretty good to me. One question for you, though: does loosening this check for the QJM case not unnecessarily weaken the check in the non-QJM case?

        Show
        Aaron T. Myers added a comment - The patch looks pretty good to me. One question for you, though: does loosening this check for the QJM case not unnecessarily weaken the check in the non-QJM case?
        Hide
        Todd Lipcon added a comment -

        This depends on HADOOP-8624 to increase the log level for RPCs. It made the test easier to write and understand. But if that one gets held up for any reason, we can just remove the line which enables trace logging.

        Show
        Todd Lipcon added a comment - This depends on HADOOP-8624 to increase the log level for RPCs. It made the test easier to write and understand. But if that one gets held up for any reason, we can just remove the line which enables trace logging.
        Hide
        Todd Lipcon added a comment -

        Attached patch fixes the issue and adds a test. The test fails without the bug fix and passes with.

        Show
        Todd Lipcon added a comment - Attached patch fixes the issue and adds a test. The test fails without the bug fix and passes with.

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development