Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9533

seen_txid in the shared edits directory is modified during bootstrapping

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.8.0, 2.7.3, 3.0.0-alpha1
    • Component/s: ha, namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The last known transaction id is stored in the seen_txid file of all known directories of a NNStorage when starting a new edit segment. However, we have seen a case where it contains an id that falls in the middle of an edit segment. This was the seen_txid file in the sahred edits directory. The active namenode's local storage was containing valid looking seen_txid.

        Activity

        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Closing the JIRA as part of 2.7.3 release.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.
        Hide
        zhz Zhe Zhang added a comment -

        Thanks for clarifying this Kihwal.

        Show
        zhz Zhe Zhang added a comment - Thanks for clarifying this Kihwal.
        Hide
        kihwal Kihwal Lee added a comment -

        I haven't seen it actually creating problems other than our internal monitoring complaining about it. Since it is rare to do bootstrapStandby to existing HA clusters, we don't have many data points.

        Show
        kihwal Kihwal Lee added a comment - I haven't seen it actually creating problems other than our internal monitoring complaining about it. Since it is rare to do bootstrapStandby to existing HA clusters, we don't have many data points.
        Hide
        zhz Zhe Zhang added a comment -

        Kihwal Lee / Daryn Sharp Thanks for reporting and fixing the issue.

        I wonder what's the symptom of the bug. Was it just a wrong seen_id file or was it causing other issues like crash or corruption?

        I agree that the standby should not touch shared edit dir at all. But more fundamentally, why does it even write a new seen_id to the image dir? Before the change, we are always updating the seen_id in all NNStorage dirs consistently. Could there be an issue if the image and edit dirs have different seen_id's?

        Show
        zhz Zhe Zhang added a comment - Kihwal Lee / Daryn Sharp Thanks for reporting and fixing the issue. I wonder what's the symptom of the bug. Was it just a wrong seen_id file or was it causing other issues like crash or corruption? I agree that the standby should not touch shared edit dir at all. But more fundamentally, why does it even write a new seen_id to the image dir? Before the change, we are always updating the seen_id in all NNStorage dirs consistently. Could there be an issue if the image and edit dirs have different seen_id 's?
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #8991 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8991/)
        HDFS-9533. seen_txid in the shared edits directory is modified during (kihwal: rev 5cb1e0118b173a95c1f7bdfae1e58d7833d61c26)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/FSImageTestUtil.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/BootstrapStandby.java
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestBootstrapStandby.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8991 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8991/ ) HDFS-9533 . seen_txid in the shared edits directory is modified during (kihwal: rev 5cb1e0118b173a95c1f7bdfae1e58d7833d61c26) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/FSImageTestUtil.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/BootstrapStandby.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestBootstrapStandby.java
        Hide
        kihwal Kihwal Lee added a comment -

        Thanks for the review, Daryn. I've committed this to trunk, branch-2, branch-2.8 and branch-2.7. There was a some significant context difference between trunk and branch-2 in the test case. But no added lines needed to be changed.

        Show
        kihwal Kihwal Lee added a comment - Thanks for the review, Daryn. I've committed this to trunk, branch-2, branch-2.8 and branch-2.7. There was a some significant context difference between trunk and branch-2 in the test case. But no added lines needed to be changed.
        Hide
        daryn Daryn Sharp added a comment -

        +1. Fully agree non-active has no business touching the shared dir.

        Show
        daryn Daryn Sharp added a comment - +1. Fully agree non-active has no business touching the shared dir.
        Hide
        kihwal Kihwal Lee added a comment -

        The test failures are not related to this patch. Besides there is no intersection between the sets of failed tests in jdk7 and jdk8.

        Show
        kihwal Kihwal Lee added a comment - The test failures are not related to this patch. Besides there is no intersection between the sets of failed tests in jdk7 and jdk8.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 0s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
        +1 mvninstall 10m 7s trunk passed
        +1 compile 1m 9s trunk passed with JDK v1.8.0_66
        +1 compile 0m 55s trunk passed with JDK v1.7.0_91
        +1 checkstyle 0m 21s trunk passed
        +1 mvnsite 1m 10s trunk passed
        +1 mvneclipse 0m 17s trunk passed
        +1 findbugs 2m 25s trunk passed
        +1 javadoc 1m 39s trunk passed with JDK v1.8.0_66
        +1 javadoc 2m 22s trunk passed with JDK v1.7.0_91
        +1 mvninstall 1m 3s the patch passed
        +1 compile 1m 5s the patch passed with JDK v1.8.0_66
        +1 javac 1m 5s the patch passed
        +1 compile 0m 56s the patch passed with JDK v1.7.0_91
        +1 javac 0m 56s the patch passed
        +1 checkstyle 0m 21s the patch passed
        +1 mvnsite 1m 9s the patch passed
        +1 mvneclipse 0m 16s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 2m 37s the patch passed
        +1 javadoc 1m 34s the patch passed with JDK v1.8.0_66
        +1 javadoc 2m 30s the patch passed with JDK v1.7.0_91
        -1 unit 78m 16s hadoop-hdfs in the patch failed with JDK v1.8.0_66.
        -1 unit 78m 21s hadoop-hdfs in the patch failed with JDK v1.7.0_91.
        -1 asflicense 0m 27s Patch generated 56 ASF License warnings.
        192m 33s



        Reason Tests
        JDK v1.8.0_66 Failed junit tests hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
          hadoop.hdfs.security.TestDelegationTokenForProxyUser
          hadoop.hdfs.TestDatanodeDeath
          hadoop.hdfs.TestLeaseRecovery2
        JDK v1.7.0_91 Failed junit tests hadoop.hdfs.server.namenode.TestFSImageWithAcl
          hadoop.hdfs.qjournal.TestSecureNNWithQJM
          hadoop.hdfs.TestDFSUpgradeFromImage
          hadoop.hdfs.server.namenode.TestFsck



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:0ca8df7
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12777179/HDFS-9533.patch
        JIRA Issue HDFS-9533
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux ad583ac7ac20 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 7fb212e
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt
        JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13861/testReport/
        asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-asflicense-problems.txt
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Max memory used 76MB
        Powered by Apache Yetus 0.1.0 http://yetus.apache.org
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13861/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 10m 7s trunk passed +1 compile 1m 9s trunk passed with JDK v1.8.0_66 +1 compile 0m 55s trunk passed with JDK v1.7.0_91 +1 checkstyle 0m 21s trunk passed +1 mvnsite 1m 10s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 2m 25s trunk passed +1 javadoc 1m 39s trunk passed with JDK v1.8.0_66 +1 javadoc 2m 22s trunk passed with JDK v1.7.0_91 +1 mvninstall 1m 3s the patch passed +1 compile 1m 5s the patch passed with JDK v1.8.0_66 +1 javac 1m 5s the patch passed +1 compile 0m 56s the patch passed with JDK v1.7.0_91 +1 javac 0m 56s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 1m 9s the patch passed +1 mvneclipse 0m 16s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 37s the patch passed +1 javadoc 1m 34s the patch passed with JDK v1.8.0_66 +1 javadoc 2m 30s the patch passed with JDK v1.7.0_91 -1 unit 78m 16s hadoop-hdfs in the patch failed with JDK v1.8.0_66. -1 unit 78m 21s hadoop-hdfs in the patch failed with JDK v1.7.0_91. -1 asflicense 0m 27s Patch generated 56 ASF License warnings. 192m 33s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes   hadoop.hdfs.security.TestDelegationTokenForProxyUser   hadoop.hdfs.TestDatanodeDeath   hadoop.hdfs.TestLeaseRecovery2 JDK v1.7.0_91 Failed junit tests hadoop.hdfs.server.namenode.TestFSImageWithAcl   hadoop.hdfs.qjournal.TestSecureNNWithQJM   hadoop.hdfs.TestDFSUpgradeFromImage   hadoop.hdfs.server.namenode.TestFsck Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12777179/HDFS-9533.patch JIRA Issue HDFS-9533 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ad583ac7ac20 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 7fb212e findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13861/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/13861/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Max memory used 76MB Powered by Apache Yetus 0.1.0 http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13861/console This message was automatically generated.
        Hide
        kihwal Kihwal Lee added a comment -

        It turned out a standby namenode was bootstrapped.
        This is from BootstrapStandby#downloadImage() or doRun() in 2.6.

        // 1
        long curTxId = proxy.getTransactionID();
        
        // 2
        image.initEditLog(StartupOption.REGULAR);
        
        // 3
        image.getStorage().writeTransactionIdFileToStorage(curTxId);
        

        (1) gets the current txid from the active node via rpc. (2) causes editLog.initSharedJournalsForRead() to be called and the NNStorage will contain the shared edits directory after that. When (3) is called, the txid obtained in (1) will be written to the shared edits directory.

        No matter what the intention of this code was, the shared edits directory shouldn't be altered by non-active namenode.

        Show
        kihwal Kihwal Lee added a comment - It turned out a standby namenode was bootstrapped. This is from BootstrapStandby#downloadImage() or doRun() in 2.6. // 1 long curTxId = proxy.getTransactionID(); // 2 image.initEditLog(StartupOption.REGULAR); // 3 image.getStorage().writeTransactionIdFileToStorage(curTxId); (1) gets the current txid from the active node via rpc. (2) causes editLog.initSharedJournalsForRead() to be called and the NNStorage will contain the shared edits directory after that. When (3) is called, the txid obtained in (1) will be written to the shared edits directory. No matter what the intention of this code was, the shared edits directory shouldn't be altered by non-active namenode.

          People

          • Assignee:
            kihwal Kihwal Lee
            Reporter:
            kihwal Kihwal Lee
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development