Hadoop Common
  1. Hadoop Common
  2. HADOOP-7154

Should set MALLOC_ARENA_MAX in hadoop-config.sh

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 1.0.4, 0.22.0
    • Component/s: scripts
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      New versions of glibc present in RHEL6 include a new arena allocator design. In several clusters we've seen this new allocator cause huge amounts of virtual memory to be used, since when multiple threads perform allocations, they each get their own memory arena. On a 64-bit system, these arenas are 64M mappings, and the maximum number of arenas is 8 times the number of cores. We've observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons.

      Setting MALLOC_ARENA_MAX to a low number will restrict the number of memory arenas and bound the virtual memory, with no noticeable downside in performance - we've been recommending MALLOC_ARENA_MAX=4. We should set this in hadoop-env.sh to avoid this issue as RHEL6 becomes more and more common.

      1. hadoop-7154.txt
        0.5 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          We should also consider adding a "ulimit -u <some high number>" automatically for RHEL6 users, since RHEL6 sets the soft max thread limit to 1024 by default (see red hat bug: https://bugzilla.redhat.com/show_bug.cgi?id=432903)

          Without this option set, I find I get "Unable to create native thread" errors under any heavy MR load.

          Clearly we to wrap this stuff so that it won't cause problems on non-RHEL6 operating systems.

          Show
          Todd Lipcon added a comment - We should also consider adding a "ulimit -u <some high number>" automatically for RHEL6 users, since RHEL6 sets the soft max thread limit to 1024 by default (see red hat bug: https://bugzilla.redhat.com/show_bug.cgi?id=432903 ) Without this option set, I find I get "Unable to create native thread" errors under any heavy MR load. Clearly we to wrap this stuff so that it won't cause problems on non-RHEL6 operating systems.
          Hide
          Todd Lipcon added a comment -

          Attached patch only does the MALLOC_ARENA_MAX bit. The ulimit fix is tougher since it appears to not be very portable.

          Show
          Todd Lipcon added a comment - Attached patch only does the MALLOC_ARENA_MAX bit. The ulimit fix is tougher since it appears to not be very portable.
          Hide
          Tom White added a comment -

          +1

          Show
          Tom White added a comment - +1
          Hide
          Todd Lipcon added a comment -

          Committed to trunk and branch-22. Thanks Tom for reviewing.

          Show
          Todd Lipcon added a comment - Committed to trunk and branch-22. Thanks Tom for reviewing.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #625 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk/625/)
          HADOOP-7154. Should set MALLOC_ARENA_MAX in hadoop-env.sh. Contributed by Todd Lipcon.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #625 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk/625/ ) HADOOP-7154 . Should set MALLOC_ARENA_MAX in hadoop-env.sh. Contributed by Todd Lipcon.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk-Commit/523/)

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk-Commit/523/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-22-branch #32 (See https://hudson.apache.org/hudson/job/Hadoop-Common-22-branch/32/)

          Show
          Hudson added a comment - Integrated in Hadoop-Common-22-branch #32 (See https://hudson.apache.org/hudson/job/Hadoop-Common-22-branch/32/ )
          Hide
          Matt Foley added a comment -

          Committed to branch-1.1, where it is also very important.
          Merged to branch-1, and branch-1.0.
          Thanks, Todd!

          Show
          Matt Foley added a comment - Committed to branch-1.1, where it is also very important. Merged to branch-1, and branch-1.0. Thanks, Todd!
          Hide
          Matt Foley added a comment -

          changed title to reflect the change actually done in the patch.

          Show
          Matt Foley added a comment - changed title to reflect the change actually done in the patch.
          Hide
          Eric Charles added a comment -

          For the record, the memory issue also arises on ubuntu 12 64-bits.
          'export MALLOC_ARENA_MAX=4' also fixes it on ubuntu 12.

          Show
          Eric Charles added a comment - For the record, the memory issue also arises on ubuntu 12 64-bits. 'export MALLOC_ARENA_MAX=4' also fixes it on ubuntu 12.
          Hide
          Andy Isaacson added a comment -

          I was very confused by this discussion and dug into it a bit more; here's what I learned. The takeaway is, ARENA_MAX=4 is a win for Java apps.

          1. Java doesn't use malloc() for object allocations; instead it uses its own directly {{mmap()}}ed arenas.
          2. however, a few things such as direct {{ByteBuffer}}s do end up calling malloc on arbitrary threads. There's not much thread locality in the use of such buffers.

          As a result, the glibc arena allocator is using a lot of VSS to optimize a codepath that's not very hot. So decreasing the number of arenas is a win, overall, even though it will increase contention (the malloc arena locks are pretty cold so this doesn't matter much) and potentially increase cache churn. But fewer arenas should decrease total cache footprint by increasing reuse.

          Show
          Andy Isaacson added a comment - I was very confused by this discussion and dug into it a bit more; here's what I learned. The takeaway is, ARENA_MAX=4 is a win for Java apps. Java doesn't use malloc() for object allocations; instead it uses its own directly {{mmap()}}ed arenas. however, a few things such as direct {{ByteBuffer}}s do end up calling malloc on arbitrary threads. There's not much thread locality in the use of such buffers. As a result, the glibc arena allocator is using a lot of VSS to optimize a codepath that's not very hot. So decreasing the number of arenas is a win, overall, even though it will increase contention (the malloc arena locks are pretty cold so this doesn't matter much) and potentially increase cache churn. But fewer arenas should decrease total cache footprint by increasing reuse.
          Hide
          Eric Charles added a comment -

          I wonder if MALLOC_ARENA_MAX="4" fixes everything
          More test on my ubuntu12 laptop dev env (with export MALLOC_ARENA_MAX=4 in .bashrc):

          [1] (Pi Estimator) and [2] (RandomWriter) complete successfully.

          [3] DistributedShell (see 'is running beyond virtual memory limits' message):

          12/07/25 14:23:15 INFO distributedshell.Client: Got application report from ASM for, appId=3, clientToken=null, appDiagnostics=Application application_1343218758332_0003 failed 1 times due to AM Container for appattempt_1343218758332_0003_000001 exited with exitCode: 143 due to: Container [pid=31510,containerID=container_1343218758332_0003_01_000001] is running beyond virtual memory limits. Current usage: 82.8mb of 128.0mb physical memory used; 873.6mb of 268.8mb virtual memory used. Killing container.
          Dump of the process-tree for container_1343218758332_0003_01_000001 :

          • PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
          • 31510 28157 31510 31510 (bash) 0 1 17031168 369 /bin/bash -c /d/opt/jdk1.6.0_31/bin/java -Xmx128m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_memory 10 --num_containers 1 --priority 0 --shell_command ls --shell_args / 1>/data/hadoop-3.0.0-SNAPSHOT-yarn-log-dirs/application_1343218758332_0003/container_1343218758332_0003_01_000001/AppMaster.stdout 2>/data/hadoop-3.0.0-SNAPSHOT-yarn-log-dirs/application_1343218758332_0003/container_1343218758332_0003_01_000001/AppMaster.stderr
          • 31514 31510 31510 31510 (java) 139 10 899014656 20830 /d/opt/jdk1.6.0_31/bin/java -Xmx128m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_memory 10 --num_containers 1 --priority 0 --shell_command ls --shell_args /
            .Failing this attempt.. Failing the application., appMasterHost=, appQueue=default, appMasterRpcPort=0, appStartTime=1343218992022, yarnAppState=FAILED, distributedFinalState=FAILED, appTrackingUrl=, appUser=echarles

          Looking at the Yarn tmp files, I see that MALLOC_ARENA_MAX=4 is effectively exported:
          more ./usercache/echarles/appcache/application_1343218758332_0003/container_1343218758332_0003_01_000001/launch_container.sh | grep MALLOC
          export MALLOC_ARENA_MAX="4"

          Thx, Eric

          [1] hadoop jar $

          {HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 100 100
          [2] hadoop jar ${HADOOP_HOME}

          /share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar randomwriter randomwriter-out
          [3] hadoop jar $

          {HADOOP_HOME}/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-*.jar org.apache.hadoop.yarn.applications.distributedshell.Client -jar ${HADOOP_HOME}

          /share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-3.0.0-SNAPSHOT.jar -shell_command pwd -num_containers 1

          Show
          Eric Charles added a comment - I wonder if MALLOC_ARENA_MAX="4" fixes everything More test on my ubuntu12 laptop dev env (with export MALLOC_ARENA_MAX=4 in .bashrc): [1] (Pi Estimator) and [2] (RandomWriter) complete successfully. [3] DistributedShell (see 'is running beyond virtual memory limits' message): 12/07/25 14:23:15 INFO distributedshell.Client: Got application report from ASM for, appId=3, clientToken=null, appDiagnostics=Application application_1343218758332_0003 failed 1 times due to AM Container for appattempt_1343218758332_0003_000001 exited with exitCode: 143 due to: Container [pid=31510,containerID=container_1343218758332_0003_01_000001] is running beyond virtual memory limits. Current usage: 82.8mb of 128.0mb physical memory used; 873.6mb of 268.8mb virtual memory used. Killing container. Dump of the process-tree for container_1343218758332_0003_01_000001 : PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE 31510 28157 31510 31510 (bash) 0 1 17031168 369 /bin/bash -c /d/opt/jdk1.6.0_31/bin/java -Xmx128m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_memory 10 --num_containers 1 --priority 0 --shell_command ls --shell_args / 1>/data/hadoop-3.0.0-SNAPSHOT-yarn-log-dirs/application_1343218758332_0003/container_1343218758332_0003_01_000001/AppMaster.stdout 2>/data/hadoop-3.0.0-SNAPSHOT-yarn-log-dirs/application_1343218758332_0003/container_1343218758332_0003_01_000001/AppMaster.stderr 31514 31510 31510 31510 (java) 139 10 899014656 20830 /d/opt/jdk1.6.0_31/bin/java -Xmx128m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_memory 10 --num_containers 1 --priority 0 --shell_command ls --shell_args / .Failing this attempt.. Failing the application., appMasterHost=, appQueue=default, appMasterRpcPort=0, appStartTime=1343218992022, yarnAppState=FAILED, distributedFinalState=FAILED, appTrackingUrl=, appUser=echarles Looking at the Yarn tmp files, I see that MALLOC_ARENA_MAX=4 is effectively exported: more ./usercache/echarles/appcache/application_1343218758332_0003/container_1343218758332_0003_01_000001/launch_container.sh | grep MALLOC export MALLOC_ARENA_MAX="4" Thx, Eric [1] hadoop jar $ {HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 100 100 [2] hadoop jar ${HADOOP_HOME} /share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar randomwriter randomwriter-out [3] hadoop jar $ {HADOOP_HOME}/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-*.jar org.apache.hadoop.yarn.applications.distributedshell.Client -jar ${HADOOP_HOME} /share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-3.0.0-SNAPSHOT.jar -shell_command pwd -num_containers 1
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop-1.0.4.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop-1.0.4.
          Hide
          Ben Roling added a comment -

          Todd Lipcon - I know this bug is pretty old, but do you mind doing me the favor of explaining this statement:

          We've observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons.

          Perhaps I am revealing too much of my naivety, but what issues the vmem size presents nor the reasons are necessarily obvious to me. The reason I ask is not directly related to this JIRA nor even Hadoop. I am just trying to learn more about the glibc change and its potential impacts. I've noticed high virtual memory size in another Java-based application (a Zabbix agent process if you care) and I'm struggling slightly to decide if I should worry about it. http://journal.siddhesh.in/posts/malloc-per-thread-arenas-in-glibc.html presents what appears to me to be a rational explanation as to why the virtual memory size shouldn't matter too much.

          I could push on Zabbix to implement a change to set MALLOC_ARENA_MAX and I feel relatively confident the change wouldn't hurt anything but I'm not sure it would actually help anything either. The Zabbix agent appears to be performing fine and the only reason I noticed the high vmem size was because someone pointed me to this JIRA and I did an audit looking for processes with virtual memory sizes that looked suspicious.

          I guess the biggest problem I have with the affect the glibc change has on reported vmem size is that it seems to make vmem size meaningless where previously you could get some idea about what a process was doing from its vmem size but your comment suggests maybe there are other things I should be concerned about as well. If you could share those with me I would greatly appreciate it and perhaps others will benefit as well.

          Thanks!

          Show
          Ben Roling added a comment - Todd Lipcon - I know this bug is pretty old, but do you mind doing me the favor of explaining this statement: We've observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons. Perhaps I am revealing too much of my naivety, but what issues the vmem size presents nor the reasons are necessarily obvious to me. The reason I ask is not directly related to this JIRA nor even Hadoop. I am just trying to learn more about the glibc change and its potential impacts. I've noticed high virtual memory size in another Java-based application (a Zabbix agent process if you care) and I'm struggling slightly to decide if I should worry about it. http://journal.siddhesh.in/posts/malloc-per-thread-arenas-in-glibc.html presents what appears to me to be a rational explanation as to why the virtual memory size shouldn't matter too much. I could push on Zabbix to implement a change to set MALLOC_ARENA_MAX and I feel relatively confident the change wouldn't hurt anything but I'm not sure it would actually help anything either. The Zabbix agent appears to be performing fine and the only reason I noticed the high vmem size was because someone pointed me to this JIRA and I did an audit looking for processes with virtual memory sizes that looked suspicious. I guess the biggest problem I have with the affect the glibc change has on reported vmem size is that it seems to make vmem size meaningless where previously you could get some idea about what a process was doing from its vmem size but your comment suggests maybe there are other things I should be concerned about as well. If you could share those with me I would greatly appreciate it and perhaps others will benefit as well. Thanks!
          Hide
          Ben Roling added a comment -

          Ok, so after further consideration I think my last comment/question was probably somewhat silly. I think the problems the high vmem sizes present to Hadoop are probably obvious to many as Todd originally suggested. I feel sort of dumb for not realizing more quickly.

          MapReduce (and YARN) monitor virtual memory sizes of task processes and kill them when they get too big. For example, mapreduce.map.memory.mb controls the max virtual memory size of a map task. WIthout MALLOC_ARENA_MAX this would be broken since tasks would have super inflated vmem sizes.

          Todd Lipcon - do I have that about right? Are there other types of problems you were noticing?

          Basically it seems any piece of software that tries to make decisions based on process vmem size is going to be messed up by the glibc change and likely has to implement MALLOC_ARENA_MAX. For some reason the fact that Hadoop was making such decisions was escaping me when I made my last comment.

          Show
          Ben Roling added a comment - Ok, so after further consideration I think my last comment/question was probably somewhat silly. I think the problems the high vmem sizes present to Hadoop are probably obvious to many as Todd originally suggested. I feel sort of dumb for not realizing more quickly. MapReduce (and YARN) monitor virtual memory sizes of task processes and kill them when they get too big. For example, mapreduce.map.memory.mb controls the max virtual memory size of a map task. WIthout MALLOC_ARENA_MAX this would be broken since tasks would have super inflated vmem sizes. Todd Lipcon - do I have that about right? Are there other types of problems you were noticing? Basically it seems any piece of software that tries to make decisions based on process vmem size is going to be messed up by the glibc change and likely has to implement MALLOC_ARENA_MAX. For some reason the fact that Hadoop was making such decisions was escaping me when I made my last comment.
          Hide
          Todd Lipcon added a comment -

          Ben Roling yep, you got it. The issues are specific to how Hadoop monitors its tasks, etc. Using lots of vmem on a 64-bit system isn't problematic in and of itself.

          Show
          Todd Lipcon added a comment - Ben Roling yep, you got it. The issues are specific to how Hadoop monitors its tasks, etc. Using lots of vmem on a 64-bit system isn't problematic in and of itself.
          Hide
          Ben Roling added a comment -

          Thanks for the confirmation Todd!

          Show
          Ben Roling added a comment - Thanks for the confirmation Todd!
          Hide
          Lari Hotari added a comment -

          There might be other environment settings that should be tuned besides MALLOC_ARENA_MAX.

          The mallopt man page ("man mallopt") contains an important notice about dynamic mmap threshold in glibc malloc.

          Note: Nowadays, glibc uses a dynamic mmap threshold by
          default. The initial value of the threshold is 128*1024, but
          when blocks larger than the current threshold and less than or
          equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold
          is adjusted upward to the size of the freed block. When
          dynamic mmap thresholding is in effect, the threshold for
          trimming the heap is also dynamically adjusted to be twice the
          dynamic mmap threshold. Dynamic adjustment of the mmap
          threshold is disabled if any of the M_TRIM_THRESHOLD,
          M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX parameters is set.

          More information https://sourceware.org/bugzilla/show_bug.cgi?id=11044 .

          I'd suggest disabling dynamic mmap threshold by setting these environment variables (besided MALLOC_ARENA_MAX):

          # tune glibc memory allocation, optimize for low fragmentation
          # limit the number of arenas
          export MALLOC_ARENA_MAX=4
          # disable dynamic mmap threshold, see M_MMAP_THRESHOLD in "man mallopt"
          export MALLOC_MMAP_THRESHOLD_=131072
          export MALLOC_TRIM_THRESHOLD_=131072
          export MALLOC_TOP_PAD_=131072
          export MALLOC_MMAP_MAX_=65536
          

          That would prevent memory fragmentation by keeping using mmap for memory allocations that are over 128K. (the default before dynamic mmap behaviour was introduced)

          Show
          Lari Hotari added a comment - There might be other environment settings that should be tuned besides MALLOC_ARENA_MAX. The mallopt man page ("man mallopt") contains an important notice about dynamic mmap threshold in glibc malloc. Note: Nowadays, glibc uses a dynamic mmap threshold by default. The initial value of the threshold is 128*1024, but when blocks larger than the current threshold and less than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold is adjusted upward to the size of the freed block. When dynamic mmap thresholding is in effect, the threshold for trimming the heap is also dynamically adjusted to be twice the dynamic mmap threshold. Dynamic adjustment of the mmap threshold is disabled if any of the M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX parameters is set. More information https://sourceware.org/bugzilla/show_bug.cgi?id=11044 . I'd suggest disabling dynamic mmap threshold by setting these environment variables (besided MALLOC_ARENA_MAX): # tune glibc memory allocation, optimize for low fragmentation # limit the number of arenas export MALLOC_ARENA_MAX=4 # disable dynamic mmap threshold, see M_MMAP_THRESHOLD in "man mallopt" export MALLOC_MMAP_THRESHOLD_=131072 export MALLOC_TRIM_THRESHOLD_=131072 export MALLOC_TOP_PAD_=131072 export MALLOC_MMAP_MAX_=65536 That would prevent memory fragmentation by keeping using mmap for memory allocations that are over 128K. (the default before dynamic mmap behaviour was introduced)
          Hide
          Lari Hotari added a comment -

          A note about MALLOC_ARENA_MAX:
          MALLOC_ARENA_MAX is broken on glibc < 2.15 (like Ubuntu 10.04) . The fix was made for 2.16 and backported to 2.15 . MALLOC_ARENA_MAX doesn't work on Ubuntu 10.04 because of this bug.
          The same bug seems to be reported to Redhat as https://bugzilla.redhat.com/show_bug.cgi?id=799327 . Other reports: https://sourceware.org/bugzilla/show_bug.cgi?id=13137 , https://sourceware.org/bugzilla/show_bug.cgi?id=13754 , https://sourceware.org/bugzilla/show_bug.cgi?id=11261 .
          This is the commit to glibc fixing the bug: https://github.com/bminor/glibc/commit/41b81892f11fe1353123e892158b53de73863d62 (backport for 2.15 is https://github.com/bminor/glibc/commit/7cf8e20d03a43b1375e90d381a16caa2686e4fdf ).

          Show
          Lari Hotari added a comment - A note about MALLOC_ARENA_MAX: MALLOC_ARENA_MAX is broken on glibc < 2.15 (like Ubuntu 10.04) . The fix was made for 2.16 and backported to 2.15 . MALLOC_ARENA_MAX doesn't work on Ubuntu 10.04 because of this bug . The same bug seems to be reported to Redhat as https://bugzilla.redhat.com/show_bug.cgi?id=799327 . Other reports: https://sourceware.org/bugzilla/show_bug.cgi?id=13137 , https://sourceware.org/bugzilla/show_bug.cgi?id=13754 , https://sourceware.org/bugzilla/show_bug.cgi?id=11261 . This is the commit to glibc fixing the bug: https://github.com/bminor/glibc/commit/41b81892f11fe1353123e892158b53de73863d62 (backport for 2.15 is https://github.com/bminor/glibc/commit/7cf8e20d03a43b1375e90d381a16caa2686e4fdf ).

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development