Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7292 Hive on Spark
  3. HIVE-10989

HoS can't control number of map tasks for runtime skew join [Spark Branch]

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • spark-branch, 1.3.0, 2.0.0
    • Spark
    • None

    Description

      Flags hive.skewjoin.mapjoin.map.tasks and hive.skewjoin.mapjoin.min.split are used to control the number of map tasks for the map join of runtime skew join. They work well for MR but have no effect for spark.
      This makes runtime skew join less useful, i.e. we just end up with slow mappers instead of reducers.

      Attachments

        Activity

          lirui Rui Li added a comment -

          The flags were properly set in the MapWork. We just need to create the RDD accordingly.

          lirui Rui Li added a comment - The flags were properly set in the MapWork. We just need to create the RDD accordingly.
          hiveqa Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12739538/HIVE-10989.1-spark.patch

          ERROR: -1 due to 2 failed/errored test(s), 7567 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestCliDriver.initializationError
          org.apache.hive.jdbc.TestSSL.testSSLConnectionWithProperty
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/878/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/878/console
          Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-878/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 2 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12739538 - PreCommit-HIVE-SPARK-Build

          hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12739538/HIVE-10989.1-spark.patch ERROR: -1 due to 2 failed/errored test(s), 7567 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.initializationError org.apache.hive.jdbc.TestSSL.testSSLConnectionWithProperty Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/878/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/878/console Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-878/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed This message is automatically generated. ATTACHMENT ID: 12739538 - PreCommit-HIVE-SPARK-Build
          lirui Rui Li added a comment -

          Failed tests are not related.

          lirui Rui Li added a comment - Failed tests are not related.
          xuefuz Xuefu Zhang added a comment -

          lirui, Thanks for working on this. Changes look good except that I don't quite understand the 3rd part of the change. Could you please explain? Thanks.

          xuefuz Xuefu Zhang added a comment - lirui , Thanks for working on this. Changes look good except that I don't quite understand the 3rd part of the change. Could you please explain? Thanks.
          lirui Rui Li added a comment -

          Hi xuefuz, these flags should only be set for the MapWork that handles the big table, i.e. in this case the skewed data. Previously, we set the flags for all the MapWork including those for the small table. This was copied from MR, where there's only one MapWork for the big table, and small tables are processed in MapredLocalWork. So the 3rd part makes our implementation inline with the MR version.

          Also some performance data in case you wanna know. I tested joining the skewed data using 6 mappers (configured) vs 2 mappers (default). And the performance is 31s vs 43s. The improvement should be more obvious on bigger data.

          lirui Rui Li added a comment - Hi xuefuz , these flags should only be set for the MapWork that handles the big table, i.e. in this case the skewed data. Previously, we set the flags for all the MapWork including those for the small table. This was copied from MR, where there's only one MapWork for the big table, and small tables are processed in MapredLocalWork. So the 3rd part makes our implementation inline with the MR version. Also some performance data in case you wanna know. I tested joining the skewed data using 6 mappers (configured) vs 2 mappers (default). And the performance is 31s vs 43s. The improvement should be more obvious on bigger data.
          xuefuz Xuefu Zhang added a comment -

          Makes sense. +1

          xuefuz Xuefu Zhang added a comment - Makes sense. +1
          lirui Rui Li added a comment -

          Committed to spark branch. Thanks Xuefu for the review.

          lirui Rui Li added a comment - Committed to spark branch. Thanks Xuefu for the review.
          xuefuz Xuefu Zhang added a comment -

          Merged to master and branch-1.

          xuefuz Xuefu Zhang added a comment - Merged to master and branch-1.

          People

            lirui Rui Li
            lirui Rui Li
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: