Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3768

MR-2450 introduced a significant performance regression (Hive)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.23.1
    • Fix Version/s: None
    • Component/s: mrv2
    • Labels:
      None

      Description

      MAPREDUCE-2450 introduced, or at least triggers, a significant performance regression in Hive. With MR-2450 the execution time of TestCliDriver.skewjoin goes from 2 minutes to 15 minutes. Reverting this change from the build fixes the issue.

      Here's the relevant query:

      FROM src src1 JOIN src src2 ON (src1.key = src2.key)
      INSERT OVERWRITE TABLE dest_j1 SELECT src1.key, src2.value; 
      

      You can reproduce this by running the following from Hive 8.0 against Hadoop built from branch-23.

      ant very-clean package test -Dtestcase=TestCliDriver -Dqfile=skewjoin.q
      

        Activity

        Hide
        Vinod Kumar Vavilapalli added a comment -

        Also, given MAPREDUCE-3809 is in, we believe this is fixed. Hence reducing the priority, please reprioritize if it isn't fixed yet. Thanks.

        Show
        Vinod Kumar Vavilapalli added a comment - Also, given MAPREDUCE-3809 is in, we believe this is fixed. Hence reducing the priority, please reprioritize if it isn't fixed yet. Thanks.
        Hide
        Amol Kekre added a comment -

        Eli,
        Please test and confirm as we do not have cycles now.

        Show
        Amol Kekre added a comment - Eli, Please test and confirm as we do not have cycles now.
        Hide
        Siddharth Seth added a comment -

        The patch essentially gets rid of upto 3 seconds per task. Looking at the trace, it seems related. OTOH, noticed that the runtime difference was 2 minutes versus 15 minutes. Unless this hive test runs a large number of jobs / task waves - this patch probably won't fix the issue. Have opened a separate jira for the 3 second wait - MAPREDUCE-3809.
        Haven't had a chance to run the hive test. Don't have the cycles rightnow to understand how the hive build system deals with 23 and local jars.

        Show
        Siddharth Seth added a comment - The patch essentially gets rid of upto 3 seconds per task. Looking at the trace, it seems related. OTOH, noticed that the runtime difference was 2 minutes versus 15 minutes. Unless this hive test runs a large number of jobs / task waves - this patch probably won't fix the issue. Have opened a separate jira for the 3 second wait - MAPREDUCE-3809 . Haven't had a chance to run the hive test. Don't have the cycles rightnow to understand how the hive build system deals with 23 and local jars.
        Hide
        Eli Collins added a comment -

        I don't have cycles right now but you can reproduce with the above hive command w/ and w/o your patch.

        Show
        Eli Collins added a comment - I don't have cycles right now but you can reproduce with the above hive command w/ and w/o your patch.
        Hide
        Siddharth Seth added a comment -

        Raw patch. Eli, could you please try this out with Hive.
        The patch interrupts the TaskReporter sleep.

        Show
        Siddharth Seth added a comment - Raw patch. Eli, could you please try this out with Hive. The patch interrupts the TaskReporter sleep.
        Hide
        Arun C Murthy added a comment -

        Thanks, I'll take a look.

        Show
        Arun C Murthy added a comment - Thanks, I'll take a look.
        Hide
        Eli Collins added a comment -

        Here's a jstack during test execution. Possible that it could be a bug in Hive but more likely an issue of 2450 interacting with the LocalJobRunner.

        "LocalJobRunner Map Task Executor #0" prio=5 tid=102804800 nid=0x10ef1e000 in Object.wait() [10ef1d000]
           java.lang.Thread.State: WAITING (on object monitor)
        	at java.lang.Object.wait(Native Method)
        	- waiting on <7bd6c6890> (a java.lang.Object)
        	at java.lang.Object.wait(Object.java:485)
        	at org.apache.hadoop.mapred.Task$TaskReporter.stopCommunicationThread(Task.java:726)
        	- locked <7bd6c6890> (a java.lang.Object)
        	at org.apache.hadoop.mapred.Task.done(Task.java:935)
        	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:232)
        	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        	at java.lang.Thread.run(Thread.java:680)
        
        Show
        Eli Collins added a comment - Here's a jstack during test execution. Possible that it could be a bug in Hive but more likely an issue of 2450 interacting with the LocalJobRunner. "LocalJobRunner Map Task Executor #0" prio=5 tid=102804800 nid=0x10ef1e000 in Object.wait() [10ef1d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <7bd6c6890> (a java.lang.Object) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.mapred.Task$TaskReporter.stopCommunicationThread(Task.java:726) - locked <7bd6c6890> (a java.lang.Object) at org.apache.hadoop.mapred.Task.done(Task.java:935) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:232) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680)
        Hide
        Arun C Murthy added a comment -

        Eli, the question was, did you look into the root cause. Yes, it could have caused a regression or just exposed a bug in Hive - I've seen several examples of both.

        Show
        Arun C Murthy added a comment - Eli, the question was, did you look into the root cause. Yes, it could have caused a regression or just exposed a bug in Hive - I've seen several examples of both.
        Hide
        Eli Collins added a comment -

        I don't. This is a pretty straightforward query and it runs 7x faster on MR1 and the same MR2 build w/o MR-2450 so I suspect MR-2450 had an unintended side-effect, making a thousand jobs faster doesn't imply it didn't introduce a new bug.

        Show
        Eli Collins added a comment - I don't. This is a pretty straightforward query and it runs 7x faster on MR1 and the same MR2 build w/o MR-2450 so I suspect MR-2450 had an unintended side-effect, making a thousand jobs faster doesn't imply it didn't introduce a new bug.
        Hide
        Arun C Murthy added a comment -

        Eli - Do we know why Hive behaves this way? We've seen significant gains, with literally thousands of jobs, with MAPREDUCE-2450. Maybe something odd with Hive?

        Show
        Arun C Murthy added a comment - Eli - Do we know why Hive behaves this way? We've seen significant gains, with literally thousands of jobs, with MAPREDUCE-2450 . Maybe something odd with Hive?

          People

          • Assignee:
            Unassigned
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:

              Development