Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3561 [Umbrella ticket] Performance issues in YARN+MR
  3. MAPREDUCE-3402

AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.1
    • Component/s: applicationmaster, mrv2
    • Labels:
      None

      Description

      The world was rosier before October 19-25, Karam Singh says.

      The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.

      One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

        Activity

        Hide
        Vinod Kumar Vavilapalli added a comment -

        Fixed after MAPREDUCE-3511.

        Show
        Vinod Kumar Vavilapalli added a comment - Fixed after MAPREDUCE-3511 .
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Karam Singh had been extremely helpful in running various tests to hunt this down. And we finally got some results after a couple of weeks of hard work.

        Turns out that most of the issues are because we made a switch from 32 bit JVMs to 64 bit. Using compressed references dramatically increased the AMs speed, and the job finishes in around 30-35 mins. That is still a regression, but atleast the job finishes after the compressed-ops setting and/or changing the jvm back to 32 bit.

        Giving more heap to the 32 bit JVM, around 3GB, helps to finish the job in around 7-8 mins. But that isn't something we want to do for all jobs. Reverting back to original speed definitely means that AM is wasting away time in GCs. Some of the observations Sid made above may hint at the root culprit.

        Will file separate tickets to fix the inefficiencies.

        Show
        Vinod Kumar Vavilapalli added a comment - Karam Singh had been extremely helpful in running various tests to hunt this down. And we finally got some results after a couple of weeks of hard work. Turns out that most of the issues are because we made a switch from 32 bit JVMs to 64 bit. Using compressed references dramatically increased the AMs speed, and the job finishes in around 30-35 mins. That is still a regression, but atleast the job finishes after the compressed-ops setting and/or changing the jvm back to 32 bit. Giving more heap to the 32 bit JVM, around 3GB, helps to finish the job in around 7-8 mins. But that isn't something we want to do for all jobs. Reverting back to original speed definitely means that AM is wasting away time in GCs. Some of the observations Sid made above may hint at the root culprit. Will file separate tickets to fix the inefficiencies.
        Hide
        Siddharth Seth added a comment -

        Possibly different from Vinod's leads. With some changes to the environment - and maybe a result of a few more commits, the job does complete.
        Couple of observations:

        • The first tens of thousands of maps finish pretty fast.
        • GC kicks in midway through the job and can't reclaim much. Spends several cycles where nothing is reclaimed before managing to reclaim a small amount.
        • Counters are taking up a good amount of heap.
        • JobHistory writes cannot keep up.
        • Bumping up the AM heapsize does help.

        Doesn't explain why the performance was better pre Oct 19 though. Opening and linking 2 jiras (non blockers since increasing the heap works well) for possible changes to counters and JobHistory.

        Show
        Siddharth Seth added a comment - Possibly different from Vinod's leads. With some changes to the environment - and maybe a result of a few more commits, the job does complete. Couple of observations: The first tens of thousands of maps finish pretty fast. GC kicks in midway through the job and can't reclaim much. Spends several cycles where nothing is reclaimed before managing to reclaim a small amount. Counters are taking up a good amount of heap. JobHistory writes cannot keep up. Bumping up the AM heapsize does help. Doesn't explain why the performance was better pre Oct 19 though. Opening and linking 2 jiras (non blockers since increasing the heap works well) for possible changes to counters and JobHistory.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        I got quite a few leads. Multiple issues in play.

        Still debugging with some raw patches.

        Show
        Vinod Kumar Vavilapalli added a comment - I got quite a few leads. Multiple issues in play. Still debugging with some raw patches.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Independent invention! I was so into debugging I didn't check the JIRA posts. Yes, I am just using the same benchmark, and reproduced many a oddities with 100K maps, and was extolling you on the way for the benchmark

        Playing with heap-dumps and profilers on this benchmark now.

        Show
        Vinod Kumar Vavilapalli added a comment - Independent invention! I was so into debugging I didn't check the JIRA posts. Yes, I am just using the same benchmark, and reproduced many a oddities with 100K maps, and was extolling you on the way for the benchmark Playing with heap-dumps and profilers on this benchmark now.
        Hide
        Sharad Agarwal added a comment -

        just fyi org.apache.hadoop.mapreduce.v2.app.MRAppBenchmark can be used to benchmark the AM mainly for memory usage, job latencies and state machine transitions. It however doesn't capture the remoting/rpc issues as it doesn't run on real cluster.

        Show
        Sharad Agarwal added a comment - just fyi org.apache.hadoop.mapreduce.v2.app.MRAppBenchmark can be used to benchmark the AM mainly for memory usage, job latencies and state machine transitions. It however doesn't capture the remoting/rpc issues as it doesn't run on real cluster.

          People

          • Assignee:
            Vinod Kumar Vavilapalli
            Reporter:
            Vinod Kumar Vavilapalli
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development