[HIVE-10989] HoS can't control number of map tasks for runtime skew join [Spark Branch] - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: spark-branch, 1.3.0, 2.0.0
Component/s: Spark
Labels:
None

Description

Flags hive.skewjoin.mapjoin.map.tasks and hive.skewjoin.mapjoin.min.split are used to control the number of map tasks for the map join of runtime skew join. They work well for MR but have no effect for spark.
This makes runtime skew join less useful, i.e. we just end up with slow mappers instead of reducers.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-10989.1-spark.patch
15/Jun/15 02:41
4 kB
Rui Li

Activity

Ascending order - Click to sort in descending order

Rui Li added a comment - 15/Jun/15 02:41

The flags were properly set in the MapWork. We just need to create the RDD accordingly.

Rui Li added a comment - 15/Jun/15 02:41 The flags were properly set in the MapWork. We just need to create the RDD accordingly.

Hive QA added a comment - 15/Jun/15 04:15

Overall: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12739538/HIVE-10989.1-spark.patch

ERROR: -1 due to 2 failed/errored test(s), 7567 tests executed
Failed tests:

org.apache.hadoop.hive.cli.TestCliDriver.initializationError
org.apache.hive.jdbc.TestSSL.testSSLConnectionWithProperty

Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/878/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/878/console
Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-878/

Messages:

Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed

This message is automatically generated.

ATTACHMENT ID: 12739538 - PreCommit-HIVE-SPARK-Build

Hive QA added a comment - 15/Jun/15 04:15 Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12739538/HIVE-10989.1-spark.patch ERROR: -1 due to 2 failed/errored test(s), 7567 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.initializationError org.apache.hive.jdbc.TestSSL.testSSLConnectionWithProperty Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/878/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/878/console Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-878/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed This message is automatically generated. ATTACHMENT ID: 12739538 - PreCommit-HIVE-SPARK-Build

Rui Li added a comment - 15/Jun/15 04:41

Failed tests are not related.

Rui Li added a comment - 15/Jun/15 04:41 Failed tests are not related.

Xuefu Zhang added a comment - 15/Jun/15 04:43

lirui, Thanks for working on this. Changes look good except that I don't quite understand the 3rd part of the change. Could you please explain? Thanks.

Xuefu Zhang added a comment - 15/Jun/15 04:43 lirui , Thanks for working on this. Changes look good except that I don't quite understand the 3rd part of the change. Could you please explain? Thanks.

Rui Li added a comment - 15/Jun/15 05:06

Hi xuefuz, these flags should only be set for the MapWork that handles the big table, i.e. in this case the skewed data. Previously, we set the flags for all the MapWork including those for the small table. This was copied from MR, where there's only one MapWork for the big table, and small tables are processed in MapredLocalWork. So the 3rd part makes our implementation inline with the MR version.

Also some performance data in case you wanna know. I tested joining the skewed data using 6 mappers (configured) vs 2 mappers (default). And the performance is 31s vs 43s. The improvement should be more obvious on bigger data.

Rui Li added a comment - 15/Jun/15 05:06 Hi xuefuz , these flags should only be set for the MapWork that handles the big table, i.e. in this case the skewed data. Previously, we set the flags for all the MapWork including those for the small table. This was copied from MR, where there's only one MapWork for the big table, and small tables are processed in MapredLocalWork. So the 3rd part makes our implementation inline with the MR version. Also some performance data in case you wanna know. I tested joining the skewed data using 6 mappers (configured) vs 2 mappers (default). And the performance is 31s vs 43s. The improvement should be more obvious on bigger data.

Xuefu Zhang added a comment - 15/Jun/15 13:03

Makes sense. +1

Xuefu Zhang added a comment - 15/Jun/15 13:03 Makes sense. +1

Rui Li added a comment - 16/Jun/15 01:25

Committed to spark branch. Thanks Xuefu for the review.

Rui Li added a comment - 16/Jun/15 01:25 Committed to spark branch. Thanks Xuefu for the review.

Xuefu Zhang added a comment - 02/Aug/15 02:54

Merged to master and branch-1.

Xuefu Zhang added a comment - 02/Aug/15 02:54 Merged to master and branch-1.

People

Assignee:: Rui Li

Reporter:: Rui Li

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jun/15 10:48

Updated:: 16/Feb/16 23:51

Resolved:: 16/Jun/15 01:25