Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9430

Remove waitForLoadingFSImage since checkNNStartup has ensured image loaded and namenode started.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      After initial startup,loading fsimage in between happens only for secondarynn..But it wn't server any client requests.

      So IMO it would be ok to remove it.

      Please check commement here

      1. HDFS-9430.patch
        7 kB
        Brahma Reddy Battula

        Activity

        Hide
        brahmareddy Brahma Reddy Battula added a comment -

        Thanks a lot Ming Ma for review and commit,And thanks to others for additional review.

        Show
        brahmareddy Brahma Reddy Battula added a comment - Thanks a lot Ming Ma for review and commit,And thanks to others for additional review.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #665 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/665/)
        HDFS-9430 Remove waitForLoadingFSImage since checkNNStartup has ensured (mingma: rev 3fa33b5c2c289ceaced30c6c5451f3569110459d)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #665 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/665/ ) HDFS-9430 Remove waitForLoadingFSImage since checkNNStartup has ensured (mingma: rev 3fa33b5c2c289ceaced30c6c5451f3569110459d) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #8920 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8920/)
        HDFS-9430 Remove waitForLoadingFSImage since checkNNStartup has ensured (mingma: rev 3fa33b5c2c289ceaced30c6c5451f3569110459d)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8920 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8920/ ) HDFS-9430 Remove waitForLoadingFSImage since checkNNStartup has ensured (mingma: rev 3fa33b5c2c289ceaced30c6c5451f3569110459d) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        Hide
        mingma Ming Ma added a comment -

        Committed to trunk and branch-2. Thanks Brahma Reddy Battula for the contribution. Thanks Vinayakumar B Tsz Wo Nicholas Sze Haohui Mai for the review.

        Show
        mingma Ming Ma added a comment - Committed to trunk and branch-2. Thanks Brahma Reddy Battula for the contribution. Thanks Vinayakumar B Tsz Wo Nicholas Sze Haohui Mai for the review.
        Hide
        mingma Ming Ma added a comment -

        Thanks all. I agree with the above analysis and the patch is good to go. For the http scenario, NameNodeHttpServer is started up before fsimage load. So it seems possible for NamenodeFsck#fsck to get a NPE when it tries to access RPC server. But that is ok as the jetty thread will catch that exception and NN startup won't be impacted.

        Show
        mingma Ming Ma added a comment - Thanks all. I agree with the above analysis and the patch is good to go. For the http scenario, NameNodeHttpServer is started up before fsimage load. So it seems possible for NamenodeFsck#fsck to get a NPE when it tries to access RPC server. But that is ok as the jetty thread will catch that exception and NN startup won't be impacted.
        Hide
        vinayrpet Vinayakumar B added a comment -

        FsckServlet may not be okay.

        NamenodeFsck#fsck also uses rpcserver for calls. So this also fine.

        Also need to check the html and js files.

        html and js files completely depends on exposed jmx metrics. None of them call RPCs which have waitForLoadingFsImage().
        so thats also fine.

        And as mentioned by brahma, all other calls are safeguarded in nnrpc server.
        So its good to go.

        Show
        vinayrpet Vinayakumar B added a comment - FsckServlet may not be okay. NamenodeFsck#fsck also uses rpcserver for calls. So this also fine. Also need to check the html and js files. html and js files completely depends on exposed jmx metrics. None of them call RPCs which have waitForLoadingFsImage() . so thats also fine. And as mentioned by brahma, all other calls are safeguarded in nnrpc server. So its good to go.
        Hide
        brahmareddy Brahma Reddy Battula added a comment -

        I agree that the RPC server is fine. We should check the web interface:

        I have checked the call heirarchy of all FSNamesystem calls where waitForLoadingFsImage() is present. Everything goes through RPC server, except test calls and LeaseManager#checkLeases().
        1. TestCalls will be safe, as MiniDfsCluster ensures that it waits till complete startup of cluster.
        2. LeaseManager#checkLeases() will be called only when NN is out of safemode, which inturn ensures that image is already loaded.

        So IMO, nothing should be affected by removal of waitForLoadingFsImage().

        Show
        brahmareddy Brahma Reddy Battula added a comment - I agree that the RPC server is fine. We should check the web interface: I have checked the call heirarchy of all FSNamesystem calls where waitForLoadingFsImage() is present. Everything goes through RPC server, except test calls and LeaseManager#checkLeases() . 1. TestCalls will be safe, as MiniDfsCluster ensures that it waits till complete startup of cluster. 2. LeaseManager#checkLeases() will be called only when NN is out of safemode, which inturn ensures that image is already loaded. So IMO, nothing should be affected by removal of waitForLoadingFsImage() .
        Hide
        szetszwo Tsz Wo Nicholas Sze added a comment -

        > My main question was around the order of fsimage loading and RPC server startup. ...

        I agree that the RPC server is fine. We should check the web interface:

        • WebHDFS is fine since it uses rpc server.
        • FsckServlet may not be okay.
        • Also need to check the html and js files.
        Show
        szetszwo Tsz Wo Nicholas Sze added a comment - > My main question was around the order of fsimage loading and RPC server startup. ... I agree that the RPC server is fine. We should check the web interface: WebHDFS is fine since it uses rpc server. FsckServlet may not be okay. Also need to check the html and js files.
        Hide
        szetszwo Tsz Wo Nicholas Sze added a comment -

        waitForLoadingFSImage() and checkNNStartup() safeguards FSNamesystem and NameNodeRpcServer, respectively. In namenode, waitForLoadingFSImage() is not needed as mentioned.

        However, FSNamesystem is also used in SecondaryNameNode and BackupNode and they don't have checkNNStartup(). We may forget about BackupNode since it is not working for a long time anyway. We need to make sure SecondaryNameNode still work correctly after the change.

        Show
        szetszwo Tsz Wo Nicholas Sze added a comment - waitForLoadingFSImage() and checkNNStartup() safeguards FSNamesystem and NameNodeRpcServer, respectively. In namenode, waitForLoadingFSImage() is not needed as mentioned. However, FSNamesystem is also used in SecondaryNameNode and BackupNode and they don't have checkNNStartup(). We may forget about BackupNode since it is not working for a long time anyway. We need to make sure SecondaryNameNode still work correctly after the change.
        Hide
        wheat9 Haohui Mai added a comment -

        It's okay to remove the waitForLoadingFsImage() function but I think it requires more refactoring to get things right.

        I think its fair to remove waitForLoadingFsImage() but leave the tracker FSNamesystem#imageLoaded and get/setter for it as is.

        Today the paradigm is broken. We assume that when replaying editlogs that the NN ops will call functions (i.e., unprotected*)that bypass all checks including quotas, etc. This is clearly broken for all callers of fsn.isImageLoaded() and fsd.shouldSkipQuotaChecks(). I think it's okay to file a follow-up jira to address this.

        Show
        wheat9 Haohui Mai added a comment - It's okay to remove the waitForLoadingFsImage() function but I think it requires more refactoring to get things right. I think its fair to remove waitForLoadingFsImage() but leave the tracker FSNamesystem#imageLoaded and get/setter for it as is. Today the paradigm is broken. We assume that when replaying editlogs that the NN ops will call functions (i.e., unprotected* )that bypass all checks including quotas, etc. This is clearly broken for all callers of fsn.isImageLoaded() and fsd.shouldSkipQuotaChecks() . I think it's okay to file a follow-up jira to address this.
        Hide
        mingma Ming Ma added a comment -

        Thanks Vinayakumar B for the explanation. Make sense to keep the FSNamesystem#imageLoaded get/setter around for quota check during edit log replay at NN startup.

        My main question was around the order of fsimage loading and RPC server startup. In {NameNode}}'s initialize function, loadNamesystem sets imageLoaded to true and it is called before RPC server is started. Thus when RPC methods are processed, imageLoaded should have been set to true.

            loadNamesystem(conf);
            rpcServer = createRpcServer(conf);
        

        I will wait until the end of the week for Tsz Wo Nicholas Sze and Haohui Mai to provide any additional comments they might have before commit.

        Show
        mingma Ming Ma added a comment - Thanks Vinayakumar B for the explanation. Make sense to keep the FSNamesystem#imageLoaded get/setter around for quota check during edit log replay at NN startup. My main question was around the order of fsimage loading and RPC server startup. In {NameNode}}'s initialize function, loadNamesystem sets imageLoaded to true and it is called before RPC server is started. Thus when RPC methods are processed, imageLoaded should have been set to true. loadNamesystem(conf); rpcServer = createRpcServer(conf); I will wait until the end of the week for Tsz Wo Nicholas Sze and Haohui Mai to provide any additional comments they might have before commit.
        Hide
        vinayrpet Vinayakumar B added a comment -

        +1 for the current patch, unless Tsz Wo Nicholas Sze or Haohui Mai doesnt agree with my above points.

        Show
        vinayrpet Vinayakumar B added a comment - +1 for the current patch, unless Tsz Wo Nicholas Sze or Haohui Mai doesnt agree with my above points.
        Hide
        vinayrpet Vinayakumar B added a comment -

        In addition, should we remove the tracking of FSNamesystem#imageLoaded altogether?

        I think tracking is used to decide to whether quota check is required or not. Quota check not required during editlog processing, during which FSNamesystem#imageLoaded will be false.
        So, IMO, removing of this tracker not required.

        Haohui Mai could clarify the intention of checking imageLoaded during RPC processing

        IMO, waitForLoadingFsImage() was trying to ensure full loading of image before serving RPC requests. This was a movement one level higher from FSDirectory to FSNameSystem in HDFS-6480 and included RPC requests as well. At the time this check was added, checkNNStartup() was not present. checkNNStartup() was added to ensure full services startup,(mainly transition RPCs which doesnt come under waitForLoadingFsImage() check), along with fsImage load in HDFS-3443 which committed in Jan-2015.
        I think its fair to remove waitForLoadingFsImage() but leave the tracker FSNamesystem#imageLoaded and get/setter for it as is.

        Show
        vinayrpet Vinayakumar B added a comment - In addition, should we remove the tracking of FSNamesystem#imageLoaded altogether? I think tracking is used to decide to whether quota check is required or not. Quota check not required during editlog processing, during which FSNamesystem#imageLoaded will be false. So, IMO, removing of this tracker not required. Haohui Mai could clarify the intention of checking imageLoaded during RPC processing IMO, waitForLoadingFsImage() was trying to ensure full loading of image before serving RPC requests. This was a movement one level higher from FSDirectory to FSNameSystem in HDFS-6480 and included RPC requests as well. At the time this check was added, checkNNStartup() was not present. checkNNStartup() was added to ensure full services startup,(mainly transition RPCs which doesnt come under waitForLoadingFsImage() check), along with fsImage load in HDFS-3443 which committed in Jan-2015. I think its fair to remove waitForLoadingFsImage() but leave the tracker FSNamesystem#imageLoaded and get/setter for it as is.
        Hide
        mingma Ming Ma added a comment -

        Thanks Brahma Reddy Battula. Overall looks good. In addition, should we remove the tracking of FSNamesystem#imageLoaded altogether? Haohui Mai could clarify the intention of checking imageLoaded during RPC processing; it seems unnecessary given RPC server won't start up until fsimage has been loaded. Vinayakumar B and Tsz Wo Nicholas Sze might be able to confirm that checkNNStartup should be enough.

        Show
        mingma Ming Ma added a comment - Thanks Brahma Reddy Battula . Overall looks good. In addition, should we remove the tracking of FSNamesystem#imageLoaded altogether? Haohui Mai could clarify the intention of checking imageLoaded during RPC processing; it seems unnecessary given RPC server won't start up until fsimage has been loaded. Vinayakumar B and Tsz Wo Nicholas Sze might be able to confirm that checkNNStartup should be enough.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 0s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 8m 3s trunk passed
        +1 compile 0m 43s trunk passed with JDK v1.8.0_66
        +1 compile 0m 48s trunk passed with JDK v1.7.0_85
        +1 checkstyle 0m 18s trunk passed
        +1 mvnsite 0m 56s trunk passed
        +1 mvneclipse 0m 14s trunk passed
        +1 findbugs 2m 9s trunk passed
        +1 javadoc 1m 17s trunk passed with JDK v1.8.0_66
        +1 javadoc 2m 4s trunk passed with JDK v1.7.0_85
        +1 mvninstall 0m 55s the patch passed
        +1 compile 0m 53s the patch passed with JDK v1.8.0_66
        +1 javac 0m 53s the patch passed
        +1 compile 0m 47s the patch passed with JDK v1.7.0_85
        +1 javac 0m 47s the patch passed
        -1 checkstyle 0m 18s Patch generated 2 new checkstyle issues in hadoop-hdfs-project/hadoop-hdfs (total was 248, now 248).
        +1 mvnsite 0m 59s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 2m 22s the patch passed
        +1 javadoc 1m 14s the patch passed with JDK v1.8.0_66
        +1 javadoc 2m 1s the patch passed with JDK v1.7.0_85
        -1 unit 64m 34s hadoop-hdfs in the patch failed with JDK v1.8.0_66.
        -1 unit 59m 38s hadoop-hdfs in the patch failed with JDK v1.7.0_85.
        -1 asflicense 0m 20s Patch generated 56 ASF License warnings.
        153m 52s



        Reason Tests
        JDK v1.8.0_66 Failed junit tests hadoop.hdfs.TestBlockStoragePolicy
          hadoop.hdfs.server.datanode.TestBlockScanner
          hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
          hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA
        JDK v1.7.0_85 Failed junit tests hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints
          hadoop.hdfs.server.datanode.TestBlockReplacement
          hadoop.hdfs.TestDFSStripedOutputStreamWithFailure170



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:0ca8df7
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12775071/HDFS-9430.patch
        JIRA Issue HDFS-9430
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 55e9b8b427d2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 485c346
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_85.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_85.txt
        JDK v1.7.0_85 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13713/testReport/
        asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-asflicense-problems.txt
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Max memory used 76MB
        Powered by Apache Yetus http://yetus.apache.org
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13713/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 8m 3s trunk passed +1 compile 0m 43s trunk passed with JDK v1.8.0_66 +1 compile 0m 48s trunk passed with JDK v1.7.0_85 +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 56s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 2m 9s trunk passed +1 javadoc 1m 17s trunk passed with JDK v1.8.0_66 +1 javadoc 2m 4s trunk passed with JDK v1.7.0_85 +1 mvninstall 0m 55s the patch passed +1 compile 0m 53s the patch passed with JDK v1.8.0_66 +1 javac 0m 53s the patch passed +1 compile 0m 47s the patch passed with JDK v1.7.0_85 +1 javac 0m 47s the patch passed -1 checkstyle 0m 18s Patch generated 2 new checkstyle issues in hadoop-hdfs-project/hadoop-hdfs (total was 248, now 248). +1 mvnsite 0m 59s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 22s the patch passed +1 javadoc 1m 14s the patch passed with JDK v1.8.0_66 +1 javadoc 2m 1s the patch passed with JDK v1.7.0_85 -1 unit 64m 34s hadoop-hdfs in the patch failed with JDK v1.8.0_66. -1 unit 59m 38s hadoop-hdfs in the patch failed with JDK v1.7.0_85. -1 asflicense 0m 20s Patch generated 56 ASF License warnings. 153m 52s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.hdfs.TestBlockStoragePolicy   hadoop.hdfs.server.datanode.TestBlockScanner   hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes   hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA JDK v1.7.0_85 Failed junit tests hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints   hadoop.hdfs.server.datanode.TestBlockReplacement   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure170 Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12775071/HDFS-9430.patch JIRA Issue HDFS-9430 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 55e9b8b427d2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 485c346 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_85.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_85.txt JDK v1.7.0_85 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13713/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/13713/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Max memory used 76MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13713/console This message was automatically generated.
        Hide
        brahmareddy Brahma Reddy Battula added a comment -

        Ming Ma Uploaded the patch..Kindly Review..thanks

        Show
        brahmareddy Brahma Reddy Battula added a comment - Ming Ma Uploaded the patch..Kindly Review..thanks

          People

          • Assignee:
            brahmareddy Brahma Reddy Battula
            Reporter:
            brahmareddy Brahma Reddy Battula
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development