Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12984

BlockPoolSlice can leak in a mini dfs cluster

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.7.5
    • 3.1.0
    • None
    • None
    • Reviewed

    Description

      When running some unit tests for storm we found that we would occasionally get out of memory errors on the HDFS integration tests.

      When I got a heap dump I found that the ShutdownHookManager was full of BlockPoolSlice$1 instances. Which hold a reference to the BlockPoolSlice which then in turn holds a reference to the DataNode etc....

      It looks like when shutdown is called on the BlockPoolSlice there is no way to remove the shut down hook in because no reference to it is saved.

      Attachments

        1. HDFS-12984.001.patch
          2 kB
          Ajay Kumar
        2. Screen Shot 2018-01-05 at 4.38.06 PM.png
          82 kB
          Ajay Kumar
        3. Screen Shot 2018-01-05 at 5.26.54 PM.png
          277 kB
          Ajay Kumar
        4. Screen Shot 2018-01-05 at 5.31.52 PM.png
          422 kB
          Ajay Kumar

        Activity

          ajayydv Ajay Kumar added a comment -

          Hi revans2, thanks for reporting this issue. I tried to recreate this issue by setting up MiniDFSCluster in a loop. It eventually runs out of heap memory but i don't see it happening due to BlockPoolSlice. (I took 15+ heap dumps on OOM and didn't found single instance of BlockPoolSlice in any of them). However there is genuine problem of OOM when MiniDFSCluster is built and shutdown periodically in loop. In MiniDFSCluster#shutdown we are calling ShutdownHookManager#clearShutdownHooks which removes all the shutdown hooks before they are called by Runtime. I think is not correct as it defeats the purpose of ShutdownHook. I will attach a initial patch for review on this. On bigger issue if OOM in MiniDFSCluster heap dumps shows that 80-90% memory is retained by entries in BlockMap which has references in multiple classes.

          ajayydv Ajay Kumar added a comment - Hi revans2 , thanks for reporting this issue. I tried to recreate this issue by setting up MiniDFSCluster in a loop. It eventually runs out of heap memory but i don't see it happening due to BlockPoolSlice . (I took 15+ heap dumps on OOM and didn't found single instance of BlockPoolSlice in any of them). However there is genuine problem of OOM when MiniDFSCluster is built and shutdown periodically in loop. In MiniDFSCluster#shutdown we are calling ShutdownHookManager#clearShutdownHooks which removes all the shutdown hooks before they are called by Runtime . I think is not correct as it defeats the purpose of ShutdownHook. I will attach a initial patch for review on this. On bigger issue if OOM in MiniDFSCluster heap dumps shows that 80-90% memory is retained by entries in BlockMap which has references in multiple classes.

          ajayydv,

          I also ran into issues trying to reproduce this in some environments. Specifically I could never make it happen on my MBP and I don't know why. But if you look at the code inside the BlockPoolSlice

          https://github.com/apache/hadoop/blob/01f3f2167ec20b52a18bc2cf250fb4229cfd2c14/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L165-L173

          If an instance of this is ever created it can never be collected. I am not sure why BlockPoolSlice instances are created some times by a MiniDFSCluster and not others. I am not that familiar with the internals of the DataNode to say off the top of my head. Glad to see you going in the right direction, and I agree that removing everything from the ShutdownHooksManager is far from ideal, but I didn't see this happening, at least not with 2.7.5 and 2.6.2.

          revans2 Robert Joseph Evans added a comment - ajayydv , I also ran into issues trying to reproduce this in some environments. Specifically I could never make it happen on my MBP and I don't know why. But if you look at the code inside the BlockPoolSlice https://github.com/apache/hadoop/blob/01f3f2167ec20b52a18bc2cf250fb4229cfd2c14/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L165-L173 If an instance of this is ever created it can never be collected. I am not sure why BlockPoolSlice instances are created some times by a MiniDFSCluster and not others. I am not that familiar with the internals of the DataNode to say off the top of my head. Glad to see you going in the right direction, and I agree that removing everything from the ShutdownHooksManager is far from ideal, but I didn't see this happening, at least not with 2.7.5 and 2.6.2.
          ajayydv Ajay Kumar added a comment - - edited

          revans2 thanks for the info. Attaching a patch to address issue you raised. OOM i am getting in MiniDFSCluster seems to be another issue altogether. Will create separate jira for it.

          ajayydv Ajay Kumar added a comment - - edited revans2 thanks for the info. Attaching a patch to address issue you raised. OOM i am getting in MiniDFSCluster seems to be another issue altogether. Will create separate jira for it.

          Thanks ajayydv,

          Looks good to me I am +1

          kihwal,

          It has been a long time since I checked anything into Hadoop. Would you be willing to merge this in, and preferably take a look at it too?

          revans2 Robert Joseph Evans added a comment - Thanks ajayydv , Looks good to me I am +1 kihwal , It has been a long time since I checked anything into Hadoop. Would you be willing to merge this in, and preferably take a look at it too?
          arp Arpit Agarwal added a comment -

          +1 pending Jenkins.

          arp Arpit Agarwal added a comment - +1 pending Jenkins.
          genericqa genericqa added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 21s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
                trunk Compile Tests
          +1 mvninstall 19m 30s trunk passed
          +1 compile 1m 6s trunk passed
          +1 checkstyle 0m 51s trunk passed
          +1 mvnsite 1m 20s trunk passed
          +1 shadedclient 13m 19s branch has no errors when building and testing our client artifacts.
          +1 findbugs 2m 10s trunk passed
          +1 javadoc 1m 1s trunk passed
                Patch Compile Tests
          +1 mvninstall 1m 18s the patch passed
          +1 compile 1m 3s the patch passed
          +1 javac 1m 3s the patch passed
          +1 checkstyle 0m 43s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 16 unchanged - 1 fixed = 16 total (was 17)
          +1 mvnsite 1m 10s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 shadedclient 12m 8s patch has no errors when building and testing our client artifacts.
          +1 findbugs 2m 19s the patch passed
          +1 javadoc 0m 51s the patch passed
                Other Tests
          -1 unit 111m 9s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          170m 6s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestReconstructStripedBlocks
            hadoop.hdfs.server.namenode.TestReencryptionWithKMS



          Subsystem Report/Notes
          Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639
          JIRA Issue HDFS-12984
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12905169/HDFS-12984.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
          uname Linux 7410a0c9d868 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/patchprocess/precommit/personality/provided.sh
          git revision trunk / 12d0645
          maven version: Apache Maven 3.3.9
          Default Java 1.8.0_151
          findbugs v3.1.0-RC1
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/22638/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22638/testReport/
          Max. process+thread count 3032 (vs. ulimit of 5000)
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22638/console
          Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          genericqa genericqa added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 19m 30s trunk passed +1 compile 1m 6s trunk passed +1 checkstyle 0m 51s trunk passed +1 mvnsite 1m 20s trunk passed +1 shadedclient 13m 19s branch has no errors when building and testing our client artifacts. +1 findbugs 2m 10s trunk passed +1 javadoc 1m 1s trunk passed       Patch Compile Tests +1 mvninstall 1m 18s the patch passed +1 compile 1m 3s the patch passed +1 javac 1m 3s the patch passed +1 checkstyle 0m 43s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 16 unchanged - 1 fixed = 16 total (was 17) +1 mvnsite 1m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 12m 8s patch has no errors when building and testing our client artifacts. +1 findbugs 2m 19s the patch passed +1 javadoc 0m 51s the patch passed       Other Tests -1 unit 111m 9s hadoop-hdfs in the patch failed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 170m 6s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestReconstructStripedBlocks   hadoop.hdfs.server.namenode.TestReencryptionWithKMS Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 JIRA Issue HDFS-12984 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12905169/HDFS-12984.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 7410a0c9d868 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 12d0645 maven version: Apache Maven 3.3.9 Default Java 1.8.0_151 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-HDFS-Build/22638/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22638/testReport/ Max. process+thread count 3032 (vs. ulimit of 5000) modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22638/console Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          ajayydv Ajay Kumar added a comment -

          revans2,arpitagarwal thanks for review. test failures are unrelated. both of them passed locally.

          ajayydv Ajay Kumar added a comment - revans2 , arpitagarwal thanks for review. test failures are unrelated. both of them passed locally.
          arp Arpit Agarwal added a comment -

          revans2, any objections if I commit this? Lgtm.

          arp Arpit Agarwal added a comment - revans2 , any objections if I commit this? Lgtm.

          +1 for committing it.

          revans2 Robert Joseph Evans added a comment - +1 for committing it.
          arp Arpit Agarwal added a comment -

          I've committed this. Thanks for reporting and reviewing this Robert.

          Thanks for the fix Ajay.

          arp Arpit Agarwal added a comment - I've committed this. Thanks for reporting and reviewing this Robert. Thanks for the fix Ajay.
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13485 (See https://builds.apache.org/job/Hadoop-trunk-Commit/13485/)
          HDFS-12984. BlockPoolSlice can leak in a mini dfs cluster. Contributed (arp: rev b278f7b29305cb67d22ef0bb08b067c422381f48)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13485 (See https://builds.apache.org/job/Hadoop-trunk-Commit/13485/ ) HDFS-12984 . BlockPoolSlice can leak in a mini dfs cluster. Contributed (arp: rev b278f7b29305cb67d22ef0bb08b067c422381f48) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java
          ajayydv Ajay Kumar added a comment -

          revans2 thanks for reporting and review. arpitagarwal thanks for review and commit.

          ajayydv Ajay Kumar added a comment - revans2 thanks for reporting and review. arpitagarwal thanks for review and commit.

          People

            ajayydv Ajay Kumar
            revans2 Robert Joseph Evans
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: