Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-177

Hadoop performance degrades significantly as more and more jobs complete

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When I ran the gridmix 2 benchmark load on a fresh cluster of 500 nodes with hadoop trunk,
      the gridmix load, consisting of 202 map/reduce jobs of various sizes, completed in 32 minutes.
      Then I ran the same set of the jobs on the same cluster, yhey completed in 43 minutes.
      When I ran them the third times, it took (almost) forever — the job tracker became non-responsive.

      The job tracker's heap size was set to 2GB.
      The cluster is configured to keep up to 500 jobs in memory.

      The job tracker kept one cpu busy all the time. Look like it was due to GC.

      I believe the release 0.18/0.19 have the similar behavior.

      I believe 0.18 and 0.18 also have the similar behavior.

      1. HADOOP-4766-v1.patch
        3 kB
        Amar Kamat
      2. HADOOP-4766-v2.10.patch
        22 kB
        Amar Kamat
      3. HADOOP-4766-v2.4.patch
        21 kB
        Amar Kamat
      4. HADOOP-4766-v2.6.patch
        21 kB
        Amar Kamat
      5. HADOOP-4766-v2.7.patch
        21 kB
        Amar Kamat
      6. HADOOP-4766-v2.7-0.18.patch
        16 kB
        Amar Kamat
      7. HADOOP-4766-v2.7-0.19.patch
        22 kB
        Amar Kamat
      8. HADOOP-4766-v2.8.patch
        23 kB
        Amar Kamat
      9. HADOOP-4766-v2.8-0.18.patch
        19 kB
        Amar Kamat
      10. HADOOP-4766-v2.8-0.19.patch
        24 kB
        Amar Kamat
      11. HADOOP-4766-v3.4-0.19.patch
        27 kB
        Amar Kamat
      12. map_scheduling_rate.txt
        4 kB
        Runping Qi

        Issue Links

          Activity

          Hide
          Runping Qi added a comment -

          After setting the heapsize of the job tracker to 3GB, the situation becomes even worse — the first set of gridmix 2 jobs did not finish in 4+ hours.

          Something wrong badly
          .

          Show
          Runping Qi added a comment - After setting the heapsize of the job tracker to 3GB, the situation becomes even worse — the first set of gridmix 2 jobs did not finish in 4+ hours. Something wrong badly .
          Hide
          Devaraj Das added a comment -

          Marking it a blocker for the coming releases of the versions this bug was observed on.

          Show
          Devaraj Das added a comment - Marking it a blocker for the coming releases of the versions this bug was observed on.
          Hide
          Amar Kamat added a comment -

          What is the total number of tasks/tips in one gridmix 2 benchmark load?

          Show
          Amar Kamat added a comment - What is the total number of tasks/tips in one gridmix 2 benchmark load?
          Hide
          Runping Qi added a comment -

          around 100K mappers and a few thousand reducers.

          Show
          Runping Qi added a comment - around 100K mappers and a few thousand reducers.
          Hide
          Devaraj Das added a comment -

          I believe the release 0.18/0.19 have the similar behavior. I believe 0.18 and 0.18 also have the similar behavior.

          Runping, could you please clarify this? Did you actually run gridmix with both 0.18 and 0.19?

          The cluster is configured to keep up to 500 jobs in memory.

          This seems quite a large number. So when the second gridmix is run, the JobTracker has the entire set of jobs from the first gridmix. i.e., nearly a 100K maps and 1000s of reducers, right? Although the JobTracker should behave gracefully but am wondering whether the memory pressure is too high. Again, was such a configuration (500 jobs in memory) ever used with gridmix earlier ?

          After setting the heapsize of the job tracker to 3GB, the situation becomes even worse — the first set of gridmix 2 jobs did not finish in 4+ hours.

          Agree this is really weird.

          Show
          Devaraj Das added a comment - I believe the release 0.18/0.19 have the similar behavior. I believe 0.18 and 0.18 also have the similar behavior. Runping, could you please clarify this? Did you actually run gridmix with both 0.18 and 0.19? The cluster is configured to keep up to 500 jobs in memory. This seems quite a large number. So when the second gridmix is run, the JobTracker has the entire set of jobs from the first gridmix. i.e., nearly a 100K maps and 1000s of reducers, right? Although the JobTracker should behave gracefully but am wondering whether the memory pressure is too high. Again, was such a configuration (500 jobs in memory) ever used with gridmix earlier ? After setting the heapsize of the job tracker to 3GB, the situation becomes even worse — the first set of gridmix 2 jobs did not finish in 4+ hours. Agree this is really weird.
          Hide
          Runping Qi added a comment -

          Ops ran multiple gridmix 2 loads on a cluster with Hadoop 0.18.2
          They observed the performance degradation too.

          I can run them against 0.18.2 and 0.19.0 too to confirm.

          Show
          Runping Qi added a comment - Ops ran multiple gridmix 2 loads on a cluster with Hadoop 0.18.2 They observed the performance degradation too. I can run them against 0.18.2 and 0.19.0 too to confirm.
          Hide
          Runping Qi added a comment -

          This is my first time to set 500 jobs in memory. I set that because
          I intended to compare the behaviors of the two gridmix2 runs.

          If the number of tasks kept in memory is critical for the performance of JobTracker (and thus to the whole cluster), then we should set limit on that, instead of the
          number of jobs, because the numbers of tasks of jobs can vary a lot.
          Also, we need to understand how the number of tasks kept in memory impacts the performance.

          Show
          Runping Qi added a comment - This is my first time to set 500 jobs in memory. I set that because I intended to compare the behaviors of the two gridmix2 runs. If the number of tasks kept in memory is critical for the performance of JobTracker (and thus to the whole cluster), then we should set limit on that, instead of the number of jobs, because the numbers of tasks of jobs can vary a lot. Also, we need to understand how the number of tasks kept in memory impacts the performance.
          Hide
          Runping Qi added a comment -

          Did one more test with JT heapsize set to 3GB (Hadoop trunk).
          Looks like the previously reported weird behavior is not reproducible, — that is the good news
          The results of this test are better than those gotten with JT heapsize set to 2GB.
          Here are the details:

          First run of gridmix2: 32 minutes
          Second run of gridmix2: 35 minutes
          Third run: 44 minutes
          Fourth run: JT non responsive after the 650th job was submitted, maxing one cpu.

          Even though I set maximim jobs kept in memory to 500, I actually saw 590 jobs through the web UI.

          Show
          Runping Qi added a comment - Did one more test with JT heapsize set to 3GB (Hadoop trunk). Looks like the previously reported weird behavior is not reproducible, — that is the good news The results of this test are better than those gotten with JT heapsize set to 2GB. Here are the details: First run of gridmix2: 32 minutes Second run of gridmix2: 35 minutes Third run: 44 minutes Fourth run: JT non responsive after the 650th job was submitted, maxing one cpu. Even though I set maximim jobs kept in memory to 500, I actually saw 590 jobs through the web UI.
          Hide
          Christian Kunz added a comment -

          We had similar issues on a test cluster where we run a lot of small jobs, with the JobTracker running out of memory after a while.
          We now started to run JobTracker in 64-bit with a larger heap size. Currently running using 3.2GB memory.

          Show
          Christian Kunz added a comment - We had similar issues on a test cluster where we run a lot of small jobs, with the JobTracker running out of memory after a while. We now started to run JobTracker in 64-bit with a larger heap size. Currently running using 3.2GB memory.
          Hide
          Runping Qi added a comment -

          Atatched the file counts the number of mapper tasks the JT scheduled at every minute time point.
          The first spike corresponding to the first gridmix2 run
          The second spike corresponding to the first gridmix2 run
          The third spike corresponding to the first gridmix2 run
          The tail corresponding to the first gridmix2 run

          Show
          Runping Qi added a comment - Atatched the file counts the number of mapper tasks the JT scheduled at every minute time point. The first spike corresponding to the first gridmix2 run The second spike corresponding to the first gridmix2 run The third spike corresponding to the first gridmix2 run The tail corresponding to the first gridmix2 run
          Hide
          Amar Kamat added a comment -

          If the number of tasks kept in memory is critical for the performance of JobTracker (and thus to the whole cluster), then we should set limit on that, instead of the
          number of jobs, because the numbers of tasks of jobs can vary a lot.

          Yeah. Seems like the performance is dependent on the number of tasks kept in memory. One way to test this would be to start a job on a jobtracker (~2gb heap) with 400K tasks (no-op map tasks). Initially you will see the jobtracker working fine but later it will slow down and will start losing heartbeat. I tried this on 0.17. The best I could get was roughly 200K-250K tasks after which I started seeing this slowdown. Never tried with back to back jobs though.

          Also, we need to understand how the number of tasks kept in memory impacts the performance.

          There was an effort made by Dhruba (HADOOP-4018) to cap JobTracker's memory but ultimately we ended up doing something else. Also HADOOP-2573 should help solve this problem.

          Show
          Amar Kamat added a comment - If the number of tasks kept in memory is critical for the performance of JobTracker (and thus to the whole cluster), then we should set limit on that, instead of the number of jobs, because the numbers of tasks of jobs can vary a lot. Yeah. Seems like the performance is dependent on the number of tasks kept in memory. One way to test this would be to start a job on a jobtracker (~2gb heap) with 400K tasks (no-op map tasks). Initially you will see the jobtracker working fine but later it will slow down and will start losing heartbeat. I tried this on 0.17. The best I could get was roughly 200K-250K tasks after which I started seeing this slowdown. Never tried with back to back jobs though. Also, we need to understand how the number of tasks kept in memory impacts the performance. There was an effort made by Dhruba ( HADOOP-4018 ) to cap JobTracker's memory but ultimately we ended up doing something else. Also HADOOP-2573 should help solve this problem.
          Hide
          Devaraj Das added a comment -

          I'd propose that we set the configuration for the number of completed jobs kept in memory to zero, and see how the JobTracker performs. Though we all suspect number of tasks kept in memory as the cause of the problem, but it will be good to confirm this.

          Show
          Devaraj Das added a comment - I'd propose that we set the configuration for the number of completed jobs kept in memory to zero, and see how the JobTracker performs. Though we all suspect number of tasks kept in memory as the cause of the problem, but it will be good to confirm this.
          Hide
          Amar Kamat added a comment -

          Ran 3 sleep-jobs back to back on 496 nodes with JobTracker's heap size as 1.93GB and each job of 100K maps with map-sleep-time of 1sec and here are the results

          run-no time taken heap size
          1 498 secs 620MB
          2 518 secs 1.1GB
          3 535 secs 1.65GB

          Note that the jobtracker was configured to support 0 user jobs in memory (using the patch attached).

          Looks like there is a memory leak. But the job runtime varies roughly by 1/2min.

          Show
          Amar Kamat added a comment - Ran 3 sleep-jobs back to back on 496 nodes with JobTracker's heap size as 1.93GB and each job of 100K maps with map-sleep-time of 1sec and here are the results run-no time taken heap size 1 498 secs 620MB 2 518 secs 1.1GB 3 535 secs 1.65GB Note that the jobtracker was configured to support 0 user jobs in memory (using the patch attached). Looks like there is a memory leak. But the job runtime varies roughly by 1/2min.
          Hide
          Owen O'Malley added a comment -

          -1 to the current patch. I think we have to leave the job that is currently completing in memory. We want the final status messages to be handled correctly.

          Show
          Owen O'Malley added a comment - -1 to the current patch. I think we have to leave the job that is currently completing in memory. We want the final status messages to be handled correctly.
          Hide
          Amar Kamat added a comment -

          Owen, the patch was to prove that there is a memory leak and also to allow others to reproduce the results. Let me know if this is not the good way to check/prove the memory leak.

          Show
          Amar Kamat added a comment - Owen, the patch was to prove that there is a memory leak and also to allow others to reproduce the results. Let me know if this is not the good way to check/prove the memory leak.
          Hide
          Robert Chansler added a comment -

          Any news on this? Should 18.3 really wait for this?

          Show
          Robert Chansler added a comment - Any news on this? Should 18.3 really wait for this?
          Hide
          Devaraj Das added a comment -

          Could this be caused due to HADOOP-4797 ?

          Show
          Devaraj Das added a comment - Could this be caused due to HADOOP-4797 ?
          Hide
          Amar Kamat added a comment -

          I ran the same setup (see here) with default config ( i.e mapred.jobtracker.completeuserjobs.maximum = 100) and found that time-taken and heap size values remain same. Which means the code for removing the job from memory (upon hitting the limit) is buggy.

          Show
          Amar Kamat added a comment - I ran the same setup (see here ) with default config ( i.e mapred.jobtracker.completeuserjobs.maximum = 100) and found that time-taken and heap size values remain same. Which means the code for removing the job from memory (upon hitting the limit) is buggy.
          Hide
          Devaraj Das added a comment -

          Can we run the same test with the patch for HADOOP-4797. Also we should verify whether this issue shows up on 0.18 branch.

          Show
          Devaraj Das added a comment - Can we run the same test with the patch for HADOOP-4797 . Also we should verify whether this issue shows up on 0.18 branch.
          Hide
          Amar Kamat added a comment - - edited

          I tried running 5 sleep jobs with 100,000 maps (1 sec wait) on 200 nodes back to back. Here are the runtimes

          run-no time
          1 25min 58sec
          2 26min 14sec
          3 26min 19sec
          4 25min 53sec

          Note that the total memory used after running 9 sleep jobs (100,000 maps with 1 sec wait) back to back (few were killed) was ~384MB. Note that the cluster is configured to keep 0 jobs in memory (using the patch) and 5 min tracker-expiry-interval. This experiment was to prove that eventually the job leaves the JobTracker's memory.

          Show
          Amar Kamat added a comment - - edited I tried running 5 sleep jobs with 100,000 maps (1 sec wait) on 200 nodes back to back. Here are the runtimes run-no time 1 25min 58sec 2 26min 14sec 3 26min 19sec 4 25min 53sec Note that the total memory used after running 9 sleep jobs (100,000 maps with 1 sec wait) back to back (few were killed) was ~384MB. Note that the cluster is configured to keep 0 jobs in memory (using the patch ) and 5 min tracker-expiry-interval. This experiment was to prove that eventually the job leaves the JobTracker's memory.
          Hide
          Amar Kamat added a comment -

          Note that the total memory used after running 9 sleep jobs (100,000 maps with 1 sec wait) back to back (few were killed) was ~384MB.

          I realized that while performing the above experiment I was constantly analyzing job history which loads the parsed job's history in the memory.

          Here are the results for the same experiment on 200 nodes without any interference.

          run-no memory before job-run job-runtime
          1 9.74 MB 25.78 min
          2 71 MB 25.58 min
          3 4.88 MB 25.63 min
          4 6.14 MB 25.60 min
          5 4.92 MB 25.63 min
          6 10.32 MB 25.63 min

          Even after running few large (100,000 maps) jobs the job tracker memory usage went as low as 3MB. It went upto a maximum of ~80MB. Note that I did GC in the ExpireLaunchingTasks thread.

          Some points to note :

          • I think that JobTracker should have a mechanism where it drops completed jobs whenever it suspects that its running low on memory. There is no point in keeping 100 jobs per user and slowing down/killing the JT. One way to do this would be to drop completed jobs whenever the JT's memory(used memory) crosses x% of maximum available memory. x% by default can be 75%. Completed jobs might be evicted based on their age (job finish time). This cleanup should happen until the JT's memory goes below the limit.
          • Also a job should be accepted (expanded) once there is sufficient memory i.e within the usable memory (x * max_available_memory).
          • Job history analysis caches some job analysis results (see loadhistory.jsp). This might cause problem if large jobs are analyzed. I feel we should not cache job-history analysis results and redo it everytime.
          Show
          Amar Kamat added a comment - Note that the total memory used after running 9 sleep jobs (100,000 maps with 1 sec wait) back to back (few were killed) was ~384MB. I realized that while performing the above experiment I was constantly analyzing job history which loads the parsed job's history in the memory. Here are the results for the same experiment on 200 nodes without any interference. run-no memory before job-run job-runtime 1 9.74 MB 25.78 min 2 71 MB 25.58 min 3 4.88 MB 25.63 min 4 6.14 MB 25.60 min 5 4.92 MB 25.63 min 6 10.32 MB 25.63 min Even after running few large (100,000 maps) jobs the job tracker memory usage went as low as 3MB. It went upto a maximum of ~80MB. Note that I did GC in the ExpireLaunchingTasks thread. Some points to note : I think that JobTracker should have a mechanism where it drops completed jobs whenever it suspects that its running low on memory. There is no point in keeping 100 jobs per user and slowing down/killing the JT. One way to do this would be to drop completed jobs whenever the JT's memory(used memory) crosses x% of maximum available memory. x% by default can be 75%. Completed jobs might be evicted based on their age (job finish time). This cleanup should happen until the JT's memory goes below the limit. Also a job should be accepted (expanded) once there is sufficient memory i.e within the usable memory (x * max_available_memory). Job history analysis caches some job analysis results (see loadhistory.jsp ). This might cause problem if large jobs are analyzed. I feel we should not cache job-history analysis results and redo it everytime.
          Hide
          Amar Kamat added a comment -

          Attaching a patch that cleans up completed jobs as and when the jobtracker runs low on memory. Jobtracker is assumed to run low on memory if its memory usage crosses the predefined limit passed using mapred.jobtracker.maximum.usable.memory.percent.

          Ran 7 back to back (sleep) jobs of 100,000 tasks each on a jobtracker with 2GB heap memory and the result is as follows :

          no job-runtime
          1 17mins
          2 17mins
          3 18mins
          4 17mins
          5 17mins
          6 18mins
          7 17mins

          Also (manually) killed few jobs to check if they are cleaned up.

          Result of test-patch is as follows

          [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 8 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          
          Show
          Amar Kamat added a comment - Attaching a patch that cleans up completed jobs as and when the jobtracker runs low on memory. Jobtracker is assumed to run low on memory if its memory usage crosses the predefined limit passed using mapred.jobtracker.maximum.usable.memory.percent . Ran 7 back to back (sleep) jobs of 100,000 tasks each on a jobtracker with 2GB heap memory and the result is as follows : no job-runtime 1 17mins 2 17mins 3 18mins 4 17mins 5 17mins 6 18mins 7 17mins Also (manually) killed few jobs to check if they are cleaned up. Result of test-patch is as follows [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          Hide
          Amar Kamat added a comment -

          Ran 14 jobs back to back each of 100,000 tasks and the runtimes are as follows

          no job-runtime
          1 17mins
          2 17mins
          3 17mins
          4 18mins
          5 17mins
          6 17mins
          7 17mins
          8 17mins
          9 17mins
          10 17mins
          11 17mins
          12 17mins
          13 17mins
          14 17mins

          It shows that there is no performance degradation.

          Show
          Amar Kamat added a comment - Ran 14 jobs back to back each of 100,000 tasks and the runtimes are as follows no job-runtime 1 17mins 2 17mins 3 17mins 4 18mins 5 17mins 6 17mins 7 17mins 8 17mins 9 17mins 10 17mins 11 17mins 12 17mins 13 17mins 14 17mins It shows that there is no performance degradation.
          Hide
          Amar Kamat added a comment -

          Incorporated Devaraj's offline comments

          • Consider only completed jobs while expiring as opposed to all the jobs. Only completed are used and a comparator is defined to sort jobs based on their completion time.
          • Changed the info msg to indicate that the memory capping is only for completed jobs
          • TestCompletedJobs now uses new apis for job submission

          Result of test-patch on my box

          [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 8 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          
          Show
          Amar Kamat added a comment - Incorporated Devaraj's offline comments Consider only completed jobs while expiring as opposed to all the jobs. Only completed are used and a comparator is defined to sort jobs based on their completion time. Changed the info msg to indicate that the memory capping is only for completed jobs TestCompletedJobs now uses new apis for job submission Result of test-patch on my box [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          Hide
          Amar Kamat added a comment -

          Incorporated Devaraj's offline comment. Made sure that if there are jobs to be retired then all the jobs are retired irrespective of memory issues. Attaching patch for hadoop-0.19 and hadoop-0.18. The patch for trunk (HADOOP-4766-v2.7.patch) applies for hadoop-0.20 too.

          Show
          Amar Kamat added a comment - Incorporated Devaraj's offline comment. Made sure that if there are jobs to be retired then all the jobs are retired irrespective of memory issues. Attaching patch for hadoop-0.19 and hadoop-0.18. The patch for trunk ( HADOOP-4766 -v2.7.patch) applies for hadoop-0.20 too.
          Hide
          Amar Kamat added a comment -

          Attaching a patch incorporating Devaraj's offline comment. Devaraj pointed out that if the JobTracker is running low on memory and there are some jobs to retire then the (old) patch will retire jobs ignoring memory issue and go back to sleep. Only in the next pass will it detect that that other completed jobs needs to be cleared. The current patch first tries to clear old jobs and if the JobTracker runs low on memory then more completed jobs are cleared until the memory is under control.

          Show
          Amar Kamat added a comment - Attaching a patch incorporating Devaraj's offline comment. Devaraj pointed out that if the JobTracker is running low on memory and there are some jobs to retire then the (old) patch will retire jobs ignoring memory issue and go back to sleep. Only in the next pass will it detect that that other completed jobs needs to be cleared. The current patch first tries to clear old jobs and if the JobTracker runs low on memory then more completed jobs are cleared until the memory is under control.
          Hide
          Devaraj Das added a comment -

          +1 (though I should note that the policy for purging a completed job can be tweaked to take into account the user factor, for example, purge the oldest completed jobs in a round robin fashion on a per user basis). But this patch is good for now.

          Show
          Devaraj Das added a comment - +1 (though I should note that the policy for purging a completed job can be tweaked to take into account the user factor, for example, purge the oldest completed jobs in a round robin fashion on a per user basis). But this patch is good for now.
          Hide
          Arun C Murthy added a comment -

          Explicitly invoking the garbage collector is a bad idea in all expect the most exceptional of cases... so this is the wrong direction. I'd rather have much simpler model: if the JobTracker is low of memory (as defined by the newly introduced config) we just purge all completed jobs. They are available in JobHistory anyway.

          I pushing this from 0.18.* as this patch needs more work and isn't critical for for 0.18.* given that it isn't a very common problem...

          Show
          Arun C Murthy added a comment - Explicitly invoking the garbage collector is a bad idea in all expect the most exceptional of cases... so this is the wrong direction. I'd rather have much simpler model: if the JobTracker is low of memory (as defined by the newly introduced config) we just purge all completed jobs. They are available in JobHistory anyway. I pushing this from 0.18.* as this patch needs more work and isn't critical for for 0.18.* given that it isn't a very common problem...
          Hide
          Sameer Al-Sakran added a comment -

          We're seeing this issue (or some other memory leak in the JobTracker) very frequently, to the point where we are restarting the jobtracker every 3-4 days.

          Show
          Sameer Al-Sakran added a comment - We're seeing this issue (or some other memory leak in the JobTracker) very frequently, to the point where we are restarting the jobtracker every 3-4 days.
          Hide
          Amar Kamat added a comment -

          Result of running 10 back to back sleep jobs of 100,000 tasks each on a 400 node cluster with the latest patch (v2.8) :

          no runtime
          1 1069 sec
          2 1069 sec
          3 1088 sec
          4 1123 sec
          5 1061 sec
          6 1072 sec
          7 1105 sec
          8 1060 sec
          9 1057 sec
          10 1122 sec
          Show
          Amar Kamat added a comment - Result of running 10 back to back sleep jobs of 100,000 tasks each on a 400 node cluster with the latest patch (v2.8) : no runtime 1 1069 sec 2 1069 sec 3 1088 sec 4 1123 sec 5 1061 sec 6 1072 sec 7 1105 sec 8 1060 sec 9 1057 sec 10 1122 sec
          Hide
          Amar Kamat added a comment -

          I had a discussion with Vivek on this and we both feel that cleaning off everything makes more sense. Here is what all we can do

          • Cleanup some X% of jobs and then check if the memory is under control. With this we will avoid immediate (memory) cleanups and also avoid frequent calls to GC. X can be 25%.
          • Sort the job on num-tasks instead of finish-time as the chances of freeing up memory will be definitely.

          We can do all these things as we have no contract regarding the web-ui display of completed jobs. But the only drawback with these approaches is that we have to manually invoke GC for that which might be problematic. There is no point in cleaning X% if there is no way to know if the memory usage is under control (doing GC). So for now I think it makes more sense to cleanup everything and be sure that upon a single gc the JobTracker will be safe. We can later improve on that if needed. Thoughts?

          Show
          Amar Kamat added a comment - I had a discussion with Vivek on this and we both feel that cleaning off everything makes more sense. Here is what all we can do Cleanup some X% of jobs and then check if the memory is under control. With this we will avoid immediate (memory) cleanups and also avoid frequent calls to GC . X can be 25%. Sort the job on num-tasks instead of finish-time as the chances of freeing up memory will be definitely. We can do all these things as we have no contract regarding the web-ui display of completed jobs. But the only drawback with these approaches is that we have to manually invoke GC for that which might be problematic. There is no point in cleaning X% if there is no way to know if the memory usage is under control (doing GC ). So for now I think it makes more sense to cleanup everything and be sure that upon a single gc the JobTracker will be safe. We can later improve on that if needed. Thoughts?
          Hide
          Devaraj Das added a comment -

          Yeah, I too think purging all jobs keeps things simpler rather than relying on GC. +1

          Show
          Devaraj Das added a comment - Yeah, I too think purging all jobs keeps things simpler rather than relying on GC. +1
          Hide
          Amar Kamat added a comment -

          The only catch with this approach is that from the time the RetireJobs thread detects that the JobTracker is running low on memory to the time the GC actually happens, all the jobs that complete will be cleaned up even if its unnecessary. I think its ok to state that "From the point the JobTracker detects that its running low on memory to the point where it detects that its safe, all the jobs that complete will be evicted".

          Show
          Amar Kamat added a comment - The only catch with this approach is that from the time the RetireJobs thread detects that the JobTracker is running low on memory to the time the GC actually happens, all the jobs that complete will be cleaned up even if its unnecessary. I think its ok to state that " From the point the JobTracker detects that its running low on memory to the point where it detects that its safe, all the jobs that complete will be evicted ".
          Hide
          Amar Kamat added a comment -

          Result of running 20 sleep jobs of 100,000 map tasks back to back is as follows :

          no runtime
          1 1065 sec
          2 1075 sec
          3 1094 sec
          4 1070 sec
          5 1060 sec
          6 1074 sec
          7 1070 sec
          8 1051 sec
          9 1069 sec
          10 1050 sec
          11 1052 sec
          12 1057 sec
          13 1055 sec
          14 1057 sec
          15 1058 sec
          16 1053 sec
          17 1058 sec
          18 1053 sec
          19 1057 sec
          20 1053 sec
          Show
          Amar Kamat added a comment - Result of running 20 sleep jobs of 100,000 map tasks back to back is as follows : no runtime 1 1065 sec 2 1075 sec 3 1094 sec 4 1070 sec 5 1060 sec 6 1074 sec 7 1070 sec 8 1051 sec 9 1069 sec 10 1050 sec 11 1052 sec 12 1057 sec 13 1055 sec 14 1057 sec 15 1058 sec 16 1053 sec 17 1058 sec 18 1053 sec 19 1057 sec 20 1053 sec
          Hide
          Arun C Murthy added a comment -

          The patch looks good. +1.

          Minor nit: should we set the default value of mapred.jobtracker.maximum.usable.memory.percent to 0.75 or some such, rather than 1.0? Also, we should call it mapred.jobtracker.jobhistory.heap.limit, the current one might confuse users?

          Show
          Arun C Murthy added a comment - The patch looks good. +1. Minor nit: should we set the default value of mapred.jobtracker.maximum.usable.memory.percent to 0.75 or some such, rather than 1.0? Also, we should call it mapred.jobtracker.jobhistory.heap.limit, the current one might confuse users?
          Hide
          Amar Kamat added a comment -

          should we set the default value of mapred.jobtracker.maximum.usable.memory.percent to 0.75 or some such, rather than 1.0?

          Earlier versions of the patch had the default set to 0.75. But then I realized that by default we should turn off this feature/code and let the admins decide what the right percentage should be. For example for a heap size of 1 GB, 750MB makes sense but for a heap size of 8GB, 6GB is too less and maybe 7.5GB i.e ~0.9 makes more sense.

          Also, we should call it mapred.jobtracker.jobhistory.heap.limit, the current one might confuse users?

          Actually the memory is the overall usage i.e JT + all-jobs and hence I named it as usable memory. Let me know it needs to be renamed.

          Show
          Amar Kamat added a comment - should we set the default value of mapred.jobtracker.maximum.usable.memory.percent to 0.75 or some such, rather than 1.0? Earlier versions of the patch had the default set to 0.75. But then I realized that by default we should turn off this feature/code and let the admins decide what the right percentage should be. For example for a heap size of 1 GB, 750MB makes sense but for a heap size of 8GB, 6GB is too less and maybe 7.5GB i.e ~0.9 makes more sense. Also, we should call it mapred.jobtracker.jobhistory.heap.limit, the current one might confuse users? Actually the memory is the overall usage i.e JT + all-jobs and hence I named it as usable memory. Let me know it needs to be renamed.
          Hide
          Koji Noguchi added a comment -

          Depending on the GC and the config you use, jvm with 1G heap can easily use up all the heap space and then release 500MB with one fullGC. Using runtime memory usage sounds a little tricky. Instead, can we set a threshold by number of total tasks in memory?

          Show
          Koji Noguchi added a comment - Depending on the GC and the config you use, jvm with 1G heap can easily use up all the heap space and then release 500MB with one fullGC. Using runtime memory usage sounds a little tricky. Instead, can we set a threshold by number of total tasks in memory?
          Hide
          Sharad Agarwal added a comment -

          One altogether different approach could be to use Soft references which are good for memory sensitive caches and this case is pretty much the same.
          This way we won't need the retire thread and any parameter for defining max jobs/tasks in history. The jvm would take care of it on its own.

          The way it could work is when job finishes, the job handle is removed from the datastructure and instead new SoftReference(job) handle is put back. When jvm is short of memory, it will clear all the soft references. In JobInProgress#finalize() we can do the clean up work corresponding to that job like clearing other references and doing removeJobTasks().

          Show
          Sharad Agarwal added a comment - One altogether different approach could be to use Soft references which are good for memory sensitive caches and this case is pretty much the same. This way we won't need the retire thread and any parameter for defining max jobs/tasks in history. The jvm would take care of it on its own. The way it could work is when job finishes, the job handle is removed from the datastructure and instead new SoftReference(job) handle is put back. When jvm is short of memory, it will clear all the soft references. In JobInProgress#finalize() we can do the clean up work corresponding to that job like clearing other references and doing removeJobTasks().
          Hide
          Sharad Agarwal added a comment -

          ok I looked it into detail and discussed this with Devaraj. It seems that in finalize(), the lock on Jobtracker needs to be acquired to cleanup the datastructures. As finalize() is called by the GC thread, it would block the GC thread till it gets the lock on JobTracker. The lock on Jobtracker is course grained and we don't want to block the GC on it. So my earlier suggested approach of using finalize() won't work out.

          Show
          Sharad Agarwal added a comment - ok I looked it into detail and discussed this with Devaraj. It seems that in finalize(), the lock on Jobtracker needs to be acquired to cleanup the datastructures. As finalize() is called by the GC thread, it would block the GC thread till it gets the lock on JobTracker. The lock on Jobtracker is course grained and we don't want to block the GC on it. So my earlier suggested approach of using finalize() won't work out.
          Hide
          Amar Kamat added a comment -

          @koji:
          Using tasks as a unit of memory usage is very tricky. Ideally we would require a memory model that can will help us derive memory requirements per task/tip/job etc. Until we have a memory model in place I think its better to go with the current solution as we only care about the overall memory used.

          @Sharad:
          Using soft references might be a better solution and might work well. But I think it will be a major change in the framework and might be filed as an improvement. Since this issue is more of a bug fix, I think we should go ahead and use the current approach. Memory bottleneck with the JobTracker is separately tracked in HADOOP-4974.

          @Arun:
          I am waiting for your input on the comments made here.

          Show
          Amar Kamat added a comment - @koji: Using tasks as a unit of memory usage is very tricky. Ideally we would require a memory model that can will help us derive memory requirements per task/tip/job etc. Until we have a memory model in place I think its better to go with the current solution as we only care about the overall memory used. @Sharad: Using soft references might be a better solution and might work well. But I think it will be a major change in the framework and might be filed as an improvement. Since this issue is more of a bug fix, I think we should go ahead and use the current approach. Memory bottleneck with the JobTracker is separately tracked in HADOOP-4974 . @Arun: I am waiting for your input on the comments made here .
          Hide
          Devaraj Das added a comment -

          After having done some experiments and talking offline to some folks, I have begun to think that FOR NOW the better way to solve this issue is to keep a track of number of total tasks in memory (as Runping & Koji had suggested earlier). Things like freeMemory, totalMemory seems to be quite dependent on the GC in use. Also it is not guaranteed that the usedMemory will go down once some jobs are removed since it is entirely up to the JVM when to actually do a full GC. Until that happens, we might end up in a situation where we purge every newly completed job even though memory is available (since the check for totalMemory - freeMemory in the patch might still show memory usage to be above the threshold). This would mean that users would start hitting the JobHistory for viewing jobs much more frequently. JobHistory file-loading/parsing is quite an expensive operation and should be avoided if possible.
          As a follow-up jira, we might move the JobHistory server to a completely different process outside the JobTracker, and always purge completed jobs immediately. That would keep the UI aspects for serving completed jobs completely outside the JobTracker and should really help the JobTracker overall.

          Show
          Devaraj Das added a comment - After having done some experiments and talking offline to some folks, I have begun to think that FOR NOW the better way to solve this issue is to keep a track of number of total tasks in memory (as Runping & Koji had suggested earlier). Things like freeMemory, totalMemory seems to be quite dependent on the GC in use. Also it is not guaranteed that the usedMemory will go down once some jobs are removed since it is entirely up to the JVM when to actually do a full GC. Until that happens, we might end up in a situation where we purge every newly completed job even though memory is available (since the check for totalMemory - freeMemory in the patch might still show memory usage to be above the threshold). This would mean that users would start hitting the JobHistory for viewing jobs much more frequently. JobHistory file-loading/parsing is quite an expensive operation and should be avoided if possible. As a follow-up jira, we might move the JobHistory server to a completely different process outside the JobTracker, and always purge completed jobs immediately. That would keep the UI aspects for serving completed jobs completely outside the JobTracker and should really help the JobTracker overall.
          Hide
          Amar Kamat added a comment -

          +1. I feel using num-tasks will provide a simple and scalable solution but the only catch is how to set the max limit? This limit should scale with higher heap sizes. We could derive memory required per task by benchmarking the memory usage and then updating it whenever JobTracker undergoes major changes. Is there any nicer way of using number of tasks?
          +1 for moving the JobHistory server out of JobTracker JVM. I have opened HADOOP-4974 to track that.

          Show
          Amar Kamat added a comment - +1. I feel using num-tasks will provide a simple and scalable solution but the only catch is how to set the max limit? This limit should scale with higher heap sizes. We could derive memory required per task by benchmarking the memory usage and then updating it whenever JobTracker undergoes major changes. Is there any nicer way of using number of tasks? +1 for moving the JobHistory server out of JobTracker JVM. I have opened HADOOP-4974 to track that.
          Hide
          Devaraj Das added a comment -

          I'd say that you set the max limit to be configurable with a default sane value (analogous to the mapred.jobtracker.completeuserjobs.maximum). Since this is a short term fix, I'd say that we don't invest time in deriving this value from the heap size, etc.

          Show
          Devaraj Das added a comment - I'd say that you set the max limit to be configurable with a default sane value (analogous to the mapred.jobtracker.completeuserjobs.maximum). Since this is a short term fix, I'd say that we don't invest time in deriving this value from the heap size, etc.
          Hide
          Arun C Murthy added a comment -

          For a short-term fix worrying about limiting tasks seems an overkill, we don't even know how much memory a TIP really needs. Deriving it is non-trivial, hence depending on it seems risky.

          A simple fix is to keep this current patch i.e. remove all completed jobs when we hit the configured heap limit and move jobhistory serving to a stand-alone daemon. Thoughts?

          Show
          Arun C Murthy added a comment - For a short-term fix worrying about limiting tasks seems an overkill, we don't even know how much memory a TIP really needs. Deriving it is non-trivial, hence depending on it seems risky. A simple fix is to keep this current patch i.e. remove all completed jobs when we hit the configured heap limit and move jobhistory serving to a stand-alone daemon. Thoughts?
          Hide
          Devaraj Das added a comment -

          The thing that worries me about the existing patch is that it is not at all predictable how many jobs/tasks would be there in memory at any point. In my experiments with this patch and a standalone program simulating the same behavior as what the patch is trying to do, I saw that even after purging all the jobs, the memory usage as per Runtime.totalMemory - Runtime.freeMemory didn't come down for quite a while and the thread was trying to free up memory needlessly (note that things like whether incremental GC is in use would also influence this behavior).
          The approach of basing things on keeping at most 'n' completed tasks in memory at least leads to much more predictability. True that we don't know that the exact memory consumed by a TIP but we can make a good estimate there and tweak the value of the max tasks in memory if need be. Also, in the current patch, the configuration to do with the memory usage threshold is equally dependent on estimation. I am not sure what the threshold should be - should it be 0.75 or 0.9 or 0.8..
          Why do you say it is an overkill - i thought basing things on estimating total memory usage is more trickier. Basing it on number of completed tasks seems very similar to the "number of completed jobs" that we currently have. It's just that we are stepping one level below and specifying a value for something the base size of which is going to always remain in control. Also, completed jobs should be treated as one unit w.r.t removal. For example, if the value configured for the max tasks is 1000, and we have a job with 1100 tasks, the entire job should be removed (as opposed to removing only 1000 tasks of the job), keeping this whole thing really simple.
          Again, this is a short term fix until we move to the model of having a separate History server process.

          Show
          Devaraj Das added a comment - The thing that worries me about the existing patch is that it is not at all predictable how many jobs/tasks would be there in memory at any point. In my experiments with this patch and a standalone program simulating the same behavior as what the patch is trying to do, I saw that even after purging all the jobs, the memory usage as per Runtime.totalMemory - Runtime.freeMemory didn't come down for quite a while and the thread was trying to free up memory needlessly (note that things like whether incremental GC is in use would also influence this behavior). The approach of basing things on keeping at most 'n' completed tasks in memory at least leads to much more predictability. True that we don't know that the exact memory consumed by a TIP but we can make a good estimate there and tweak the value of the max tasks in memory if need be. Also, in the current patch, the configuration to do with the memory usage threshold is equally dependent on estimation. I am not sure what the threshold should be - should it be 0.75 or 0.9 or 0.8.. Why do you say it is an overkill - i thought basing things on estimating total memory usage is more trickier. Basing it on number of completed tasks seems very similar to the "number of completed jobs" that we currently have. It's just that we are stepping one level below and specifying a value for something the base size of which is going to always remain in control. Also, completed jobs should be treated as one unit w.r.t removal. For example, if the value configured for the max tasks is 1000, and we have a job with 1100 tasks, the entire job should be removed (as opposed to removing only 1000 tasks of the job), keeping this whole thing really simple. Again, this is a short term fix until we move to the model of having a separate History server process.
          Hide
          Owen O'Malley added a comment -

          I think this is the wrong direction. I think that long term, the jobs should be retired instantly when they finish. We need to improve the job history browsing functionality and performance to be the only way that users look at completed jobs.

          Possibilities:
          1. Use a Java applet to display the results in the client's browser.
          2. Use a command line tool to browse the completed job's information.
          3. Use a dedicated server for formatting the job history.

          Show
          Owen O'Malley added a comment - I think this is the wrong direction. I think that long term, the jobs should be retired instantly when they finish. We need to improve the job history browsing functionality and performance to be the only way that users look at completed jobs. Possibilities: 1. Use a Java applet to display the results in the client's browser. 2. Use a command line tool to browse the completed job's information. 3. Use a dedicated server for formatting the job history.
          Hide
          Devaraj Das added a comment -

          Owen, that truly is the long term solution and everyone agrees to that. The thing that we are trying to address in this issue is a stop-gap that attempts to reduce the number of history lookups, if possible. But I am okay with punting this issue and instead address the separate dedicated history server issue (option (3) above). Option(2) already exists but only for the history files that exist on a filesystem that both the JT and client can access (like the hdfs).

          Show
          Devaraj Das added a comment - Owen, that truly is the long term solution and everyone agrees to that. The thing that we are trying to address in this issue is a stop-gap that attempts to reduce the number of history lookups, if possible. But I am okay with punting this issue and instead address the separate dedicated history server issue (option (3) above). Option(2) already exists but only for the history files that exist on a filesystem that both the JT and client can access (like the hdfs).
          Hide
          Amar Kamat added a comment -

          3. Use a dedicated server for formatting the job history.

          HADOOP-5083 is opened to address this.

          Show
          Amar Kamat added a comment - 3. Use a dedicated server for formatting the job history. HADOOP-5083 is opened to address this.
          Hide
          Arun C Murthy added a comment -

          So, should we just retire jobs immediately on completion via HADOOP-5083 itself and mark this jira as 'INVALID' or 'DUPLICATE' ?

          Show
          Arun C Murthy added a comment - So, should we just retire jobs immediately on completion via HADOOP-5083 itself and mark this jira as 'INVALID' or 'DUPLICATE' ?
          Hide
          Devaraj Das added a comment -

          So, should we just retire jobs immediately on completion via HADOOP-5083 itself and mark this jira as 'INVALID' or 'DUPLICATE' ?

          +1

          Show
          Devaraj Das added a comment - So, should we just retire jobs immediately on completion via HADOOP-5083 itself and mark this jira as 'INVALID' or 'DUPLICATE' ? +1
          Hide
          dhruba borthakur added a comment -

          It would be really nice if we can have a stop-gap solution for 0.19. Maybe a very simple fix would be to make the JT retire most compleed jobs when the memory usage exceeds a certain threhold. This will at least give the JT a chance to recover from high memory pressure (instead of just hanging)

          Show
          dhruba borthakur added a comment - It would be really nice if we can have a stop-gap solution for 0.19. Maybe a very simple fix would be to make the JT retire most compleed jobs when the memory usage exceeds a certain threhold. This will at least give the JT a chance to recover from high memory pressure (instead of just hanging)
          Hide
          Amar Kamat added a comment -

          Attaching a patch that ports the latest patch to 0.19. This patch removes completed jobs as soon as the JobTracker detects that it is running low on memory. Testing in progress.

          Show
          Amar Kamat added a comment - Attaching a patch that ports the latest patch to 0.19. This patch removes completed jobs as soon as the JobTracker detects that it is running low on memory. Testing in progress.
          Hide
          Allen Wittenauer added a comment -

          What is the latest status of this patch? It doesn't appear to be committed or, heck, even resolved as to how the fix is going to be applied.

          Show
          Allen Wittenauer added a comment - What is the latest status of this patch? It doesn't appear to be committed or, heck, even resolved as to how the fix is going to be applied.
          Hide
          Evan Pollan added a comment -

          This defect affects me after just 10-15 iterations of a daily job that has on the order to 10K mappers and a thousand or so reducers. This is cropping up using 0.20.2 (from CDH2U3). This seems like a pretty serious problem affecting the longevity of the job tracker. Is there a reason a fix hasn't been committed and released?

          Show
          Evan Pollan added a comment - This defect affects me after just 10-15 iterations of a daily job that has on the order to 10K mappers and a thousand or so reducers. This is cropping up using 0.20.2 (from CDH2U3). This seems like a pretty serious problem affecting the longevity of the job tracker. Is there a reason a fix hasn't been committed and released?

            People

            • Assignee:
              Ioannis Koltsidas
              Reporter:
              Runping Qi
            • Votes:
              3 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

              • Created:
                Updated:

                Development