Details

    • Type: Bug
    • Status: Open
    • Priority: Blocker
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change
    • Release Note:
      The default value of "yarn.nodemanager.vmem-check-enabled" was changed to false.

      Description

      In our Hadoop 2 + Java8 effort , we found few jobs are being Killed by Hadoop due to excessive virtual memory allocation. Although the physical memory usage is low.

      The most common error message is "Container [pid=??,containerID=container_??] is running beyond virtual memory limits. Current usage: 365.1 MB of 1 GB physical memory used; 3.2 GB of 2.1 GB virtual memory used. Killing container."

      We see this problem for MR job as well as in spark driver/executor.

        Issue Links

          Activity

          Hide
          kamrul Mohammad Kamrul Islam added a comment -

          My findings and quick resolutions:
          By default, Java 8 allocates extra virtual memory then Java 7. However, we can control the non-heap memory usage by limiting the maximum allowed values for some JVM parameters such as "-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256"

          For M/R based job (such as Pig, Hive etc), user can pass the following JVM -XX parameters as part of mapreduce.reduce.java.opts or mapreduce.map.java.opts

          mapreduce.reduce.java.opts  '-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m -Xmx1536m -Xms512m -Djava.net.preferIPv4Stack=true'
          

          Similarly for Spark job, we need to pass the same parameters in the Spark AM/master and executor. Spark community is working on the ways to pass these type of parameters easily. In Spark-1.1.0, user can pass it for spark-cluster based job submission as follows. For general job submission, user has to wait until https://issues.apache.org/jira/browse/SPARK-4461 is released.

          spark.driver.extraJavaOptions = -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
          

          For Spark executor, pass the following.

           
          spark.executor.extraJavaOptions = -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
          

          These parameters can be set in conf/spark-defaults.conf as well.

          Show
          kamrul Mohammad Kamrul Islam added a comment - My findings and quick resolutions: By default, Java 8 allocates extra virtual memory then Java 7. However, we can control the non-heap memory usage by limiting the maximum allowed values for some JVM parameters such as "-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256" For M/R based job (such as Pig, Hive etc), user can pass the following JVM -XX parameters as part of mapreduce.reduce.java.opts or mapreduce.map.java.opts mapreduce.reduce.java.opts '-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m -Xmx1536m -Xms512m -Djava.net.preferIPv4Stack=true' Similarly for Spark job, we need to pass the same parameters in the Spark AM/master and executor. Spark community is working on the ways to pass these type of parameters easily. In Spark-1.1.0, user can pass it for spark-cluster based job submission as follows. For general job submission, user has to wait until https://issues.apache.org/jira/browse/SPARK-4461 is released. spark.driver.extraJavaOptions = -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m For Spark executor, pass the following. spark.executor.extraJavaOptions = -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m These parameters can be set in conf/spark-defaults.conf as well.
          Hide
          jira.shegalov Gera Shegalov added a comment -

          Hi Mohammad Kamrul Islam, can you confirm that your settings of mapreduce.reduce.memory.mb irrespective of Java8? "2.1 GB virtual memory used" is suspicious in a sense that it looks vmem-pmem-ratio(2.1) * default mapreduce.reduce.memory.mb(1024). However, the reducer is declared to allocate for Java heap -Xmx1536m , let alone the total virtual C-Heap of the JVM.

          Show
          jira.shegalov Gera Shegalov added a comment - Hi Mohammad Kamrul Islam , can you confirm that your settings of mapreduce.reduce.memory.mb irrespective of Java8? "2.1 GB virtual memory used" is suspicious in a sense that it looks vmem-pmem-ratio(2.1) * default mapreduce.reduce.memory.mb(1024) . However, the reducer is declared to allocate for Java heap -Xmx1536m , let alone the total virtual C-Heap of the JVM.
          Hide
          kamrul Mohammad Kamrul Islam added a comment -

          Sorry Gera Shegalov for the late reply. The failure was coming from distcp command which uses 1GB as mapreduce.map.memory.mb. I think distcp is map-only job.

          But in other cases, we used higher memory.mb (2GB) and got the similar exception with max 4.2 GB VM.

          Show
          kamrul Mohammad Kamrul Islam added a comment - Sorry Gera Shegalov for the late reply. The failure was coming from distcp command which uses 1GB as mapreduce.map.memory.mb. I think distcp is map-only job. But in other cases, we used higher memory.mb (2GB) and got the similar exception with max 4.2 GB VM.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          I just hit this issue.

          15/04/06 04:14:47 INFO mapreduce.Job: Task Id : attempt_1428293579539_0001_m_000003_0, Status : FAILED
          Container [pid=7847,containerID=container_1428293579539_0001_01_000005] is running beyond virtual memory limits. Current usage: 123.5 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container.
          
          • Change yarn.nodemanager.vmem-pmem-ratio to some higher value (4 or 5?)
          • Turn yarn.nodemanager.vmem-check-enabled to false

          Either way is fine for me.

          Show
          ajisakaa Akira Ajisaka added a comment - I just hit this issue. 15/04/06 04:14:47 INFO mapreduce.Job: Task Id : attempt_1428293579539_0001_m_000003_0, Status : FAILED Container [pid=7847,containerID=container_1428293579539_0001_01_000005] is running beyond virtual memory limits. Current usage: 123.5 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container. Change yarn.nodemanager.vmem-pmem-ratio to some higher value (4 or 5?) Turn yarn.nodemanager.vmem-check-enabled to false Either way is fine for me.
          Hide
          timyitong Yitong Zhou added a comment -

          Is yarn.nodemanager.vmem-pmem-ratio and yarn.nodemanager.vmem-check-enabled are only supposed to be effective when set in yarn-site.xml or before a container is launched? I tried to set it via JobConf, and did not see an effect. Mohammad's suggestion tweaking jvm.opts worked though.

          Show
          timyitong Yitong Zhou added a comment - Is yarn.nodemanager.vmem-pmem-ratio and yarn.nodemanager.vmem-check-enabled are only supposed to be effective when set in yarn-site.xml or before a container is launched? I tried to set it via JobConf, and did not see an effect. Mohammad's suggestion tweaking jvm.opts worked though.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          set in yarn-site and read by the resource manager: it's a cluster-wide policy.

          Show
          stevel@apache.org Steve Loughran added a comment - set in yarn-site and read by the resource manager: it's a cluster-wide policy.
          Hide
          kdmalviyan Kuldeep Singh added a comment -

          Hey guys, I am facing the same issue with Hadoop 2.7.1 and java 7, I tried all the solutions above mentioned but could not resolve the problem.

          Show
          kdmalviyan Kuldeep Singh added a comment - Hey guys, I am facing the same issue with Hadoop 2.7.1 and java 7, I tried all the solutions above mentioned but could not resolve the problem.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          This issue is Java 8 only. If you are using Java7 and seeing this message, the allocated memory is too small for the task. I'm thinking you should increase the allocated memory for the container to fix this issue.

          Show
          ajisakaa Akira Ajisaka added a comment - This issue is Java 8 only. If you are using Java7 and seeing this message, the allocated memory is too small for the task. I'm thinking you should increase the allocated memory for the container to fix this issue.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Attaching a patch to change the default value of "yarn.nodemanager.vmem-check-enabled" to false. I suppose almost all of Java8 users are hitting this problem, so I'd like to change the default value in trunk/branch-2/branch-2.8.

          Show
          ajisakaa Akira Ajisaka added a comment - Attaching a patch to change the default value of "yarn.nodemanager.vmem-check-enabled" to false. I suppose almost all of Java8 users are hitting this problem, so I'd like to change the default value in trunk/branch-2/branch-2.8.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Akira Ajisaka I don't think the workaround, turning vmem-check off, is acceptable. It's incompatible change as described on YARN-2225.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Akira Ajisaka I don't think the workaround, turning vmem-check off, is acceptable. It's incompatible change as described on YARN-2225 .
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Because of the discussion on YARN-2225, canceling the patch.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Because of the discussion on YARN-2225 , canceling the patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 9s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          0 mvndep 0m 11s Maven dependency ordering for branch
          +1 mvninstall 7m 2s trunk passed
          +1 compile 2m 0s trunk passed with JDK v1.8.0_72
          +1 compile 2m 14s trunk passed with JDK v1.7.0_95
          +1 checkstyle 0m 36s trunk passed
          +1 mvnsite 1m 1s trunk passed
          +1 mvneclipse 0m 25s trunk passed
          +1 findbugs 2m 28s trunk passed
          +1 javadoc 1m 10s trunk passed with JDK v1.8.0_72
          +1 javadoc 3m 40s trunk passed with JDK v1.7.0_95
          0 mvndep 0m 11s Maven dependency ordering for patch
          +1 mvninstall 0m 51s the patch passed
          +1 compile 1m 57s the patch passed with JDK v1.8.0_72
          +1 javac 1m 57s the patch passed
          +1 compile 2m 13s the patch passed with JDK v1.7.0_95
          +1 javac 2m 13s the patch passed
          +1 checkstyle 0m 35s the patch passed
          +1 mvnsite 0m 58s the patch passed
          +1 mvneclipse 0m 22s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 xml 0m 1s The patch has no ill-formed XML file.
          +1 findbugs 2m 56s the patch passed
          +1 javadoc 1m 6s the patch passed with JDK v1.8.0_72
          +1 javadoc 3m 36s the patch passed with JDK v1.7.0_95
          +1 unit 0m 22s hadoop-yarn-api in the patch passed with JDK v1.8.0_72.
          +1 unit 1m 59s hadoop-yarn-common in the patch passed with JDK v1.8.0_72.
          +1 unit 0m 24s hadoop-yarn-api in the patch passed with JDK v1.7.0_95.
          +1 unit 2m 12s hadoop-yarn-common in the patch passed with JDK v1.7.0_95.
          +1 asflicense 0m 20s Patch does not generate ASF License warnings.
          42m 14s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12788024/HADOOP-11364.01.patch
          JIRA Issue HADOOP-11364
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml
          uname Linux 5b2096da8f9d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8ed07bd
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/8631/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/8631/console
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 9s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. 0 mvndep 0m 11s Maven dependency ordering for branch +1 mvninstall 7m 2s trunk passed +1 compile 2m 0s trunk passed with JDK v1.8.0_72 +1 compile 2m 14s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 36s trunk passed +1 mvnsite 1m 1s trunk passed +1 mvneclipse 0m 25s trunk passed +1 findbugs 2m 28s trunk passed +1 javadoc 1m 10s trunk passed with JDK v1.8.0_72 +1 javadoc 3m 40s trunk passed with JDK v1.7.0_95 0 mvndep 0m 11s Maven dependency ordering for patch +1 mvninstall 0m 51s the patch passed +1 compile 1m 57s the patch passed with JDK v1.8.0_72 +1 javac 1m 57s the patch passed +1 compile 2m 13s the patch passed with JDK v1.7.0_95 +1 javac 2m 13s the patch passed +1 checkstyle 0m 35s the patch passed +1 mvnsite 0m 58s the patch passed +1 mvneclipse 0m 22s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 2m 56s the patch passed +1 javadoc 1m 6s the patch passed with JDK v1.8.0_72 +1 javadoc 3m 36s the patch passed with JDK v1.7.0_95 +1 unit 0m 22s hadoop-yarn-api in the patch passed with JDK v1.8.0_72. +1 unit 1m 59s hadoop-yarn-common in the patch passed with JDK v1.8.0_72. +1 unit 0m 24s hadoop-yarn-api in the patch passed with JDK v1.7.0_95. +1 unit 2m 12s hadoop-yarn-common in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 20s Patch does not generate ASF License warnings. 42m 14s Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12788024/HADOOP-11364.01.patch JIRA Issue HADOOP-11364 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml uname Linux 5b2096da8f9d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8ed07bd Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/8631/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/8631/console Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          It's incompatible change as described on YARN-2225.

          Incompatible change can be done in trunk. I'll reopen YARN-2225 shortly.

          Show
          ajisakaa Akira Ajisaka added a comment - It's incompatible change as described on YARN-2225 . Incompatible change can be done in trunk. I'll reopen YARN-2225 shortly.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Thanks Tsuyoshi for the comment and sharing the jira information.

          Show
          ajisakaa Akira Ajisaka added a comment - Thanks Tsuyoshi for the comment and sharing the jira information.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          In YARN-2225, Tsuyoshi commented:

          IMHO, I prefer to make the default value of the vmem ratio larger. How about closing this issue and doing it on another jira(or, moving HADOOP-11364 to YARN issue) since the addressing problem is different from this issue?

          so I moved this issue to YARN. Let's make the default value of vmem ratio larger in this issue.

          Show
          ajisakaa Akira Ajisaka added a comment - In YARN-2225 , Tsuyoshi commented: IMHO, I prefer to make the default value of the vmem ratio larger. How about closing this issue and doing it on another jira(or, moving HADOOP-11364 to YARN issue) since the addressing problem is different from this issue? so I moved this issue to YARN. Let's make the default value of vmem ratio larger in this issue.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          true, but it's still traumatic: cluster performance can seriously suffer. I also expect management tools & their hadoop installations to restore the 2.6 value. Changing the ratio is the solution that would be viable in production with vmem checks enabled

          Show
          stevel@apache.org Steve Loughran added a comment - true, but it's still traumatic: cluster performance can seriously suffer. I also expect management tools & their hadoop installations to restore the 2.6 value. Changing the ratio is the solution that would be viable in production with vmem checks enabled
          Hide
          krishnaChaitanya Krishna Chaitanya added a comment -

          Same issue with jdk1.8.0_60. By changing "yarn.nodemanager.vmem-pmem-ratio to some higher value (4 )" and Turn "yarn.nodemanager.vmem-check-enabled to false" no effect on container kills. With jdk1.7.0_67 no issues.

          Show
          krishnaChaitanya Krishna Chaitanya added a comment - Same issue with jdk1.8.0_60. By changing "yarn.nodemanager.vmem-pmem-ratio to some higher value (4 )" and Turn "yarn.nodemanager.vmem-check-enabled to false" no effect on container kills. With jdk1.7.0_67 no issues.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Hi Krishna, have you changed the configurations on all NodeManagers and restart all of them?

          Show
          ozawa Tsuyoshi Ozawa added a comment - Hi Krishna, have you changed the configurations on all NodeManagers and restart all of them?
          Hide
          krishnaChaitanya Krishna Chaitanya added a comment -

          Yes. i did

          Show
          krishnaChaitanya Krishna Chaitanya added a comment - Yes. i did

            People

            • Assignee:
              kamrul Mohammad Kamrul Islam
              Reporter:
              kamrul Mohammad Kamrul Islam
            • Votes:
              3 Vote for this issue
              Watchers:
              33 Start watching this issue

              Dates

              • Created:
                Updated:

                Development