Hive
  1. Hive
  2. HIVE-1093

Add a "skew join map join size" variable to control the input size of skew join's following map join job.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys.
      This can be better by using a small map input size for the following map join job.

      1. hive-1093.patch
        6 kB
        He Yongqiang
      2. hive-1093.2.patch
        7 kB
        He Yongqiang

        Activity

        Hide
        Namit Jain added a comment -

        Committed. Thanks Yongqiang

        Show
        Namit Jain added a comment - Committed. Thanks Yongqiang
        Hide
        Namit Jain added a comment -

        +1

        looks good

        Show
        Namit Jain added a comment - +1 looks good
        Hide
        He Yongqiang added a comment -

        >>does it work for combinehiveinputsplit also ?
        No. We should not use combine inputformat for this. CombineFileInputFormat use block size as the minimum split size. We need to explicitly specify the second job to use HiveInputFormat. Will update the patch to "explicitly specify the second job to use HiveInputFormat".

        Show
        He Yongqiang added a comment - >>does it work for combinehiveinputsplit also ? No. We should not use combine inputformat for this. CombineFileInputFormat use block size as the minimum split size. We need to explicitly specify the second job to use HiveInputFormat. Will update the patch to "explicitly specify the second job to use HiveInputFormat".
        Hide
        Namit Jain added a comment -

        The changes look good - does it work for combinehiveinputsplit also ?

        Show
        Namit Jain added a comment - The changes look good - does it work for combinehiveinputsplit also ?
        Hide
        He Yongqiang added a comment -

        >>Do you have performance numbers for the testcase ?
        Yes. In my testcase, a split of 256M join with 100K is now taking more than 5 hours. (join value can be ignored, so 256M and 100K are about pure key size).
        And the 'map join size' should not be determined only by the big size ( eg. 256M). The small size is more important in this case.

        The point is that KEY1 ("256M join 100K") should use a much smaller split size than KEY2 ("256M join 1K"). The problem here is that we are now doing KEY1 and KEY2 in a same job. So if we choose a split size according to KEY1, it maybe a bit small for KEY2.

        If we are going to choose to use bucket join for the followup mapjoin job. We will be able to choose split size independently for different keys (because we are doing that in different jobs).

        Show
        He Yongqiang added a comment - >>Do you have performance numbers for the testcase ? Yes. In my testcase, a split of 256M join with 100K is now taking more than 5 hours. (join value can be ignored, so 256M and 100K are about pure key size). And the 'map join size' should not be determined only by the big size ( eg. 256M). The small size is more important in this case. The point is that KEY1 ("256M join 100K") should use a much smaller split size than KEY2 ("256M join 1K"). The problem here is that we are now doing KEY1 and KEY2 in a same job. So if we choose a split size according to KEY1, it maybe a bit small for KEY2. If we are going to choose to use bucket join for the followup mapjoin job. We will be able to choose split size independently for different keys (because we are doing that in different jobs).
        Hide
        Namit Jain added a comment -

        Do you have performance numbers for the testcase ? I mean, small map size will lead to more mappers each of which is reading the
        other tables.

        Show
        Namit Jain added a comment - Do you have performance numbers for the testcase ? I mean, small map size will lead to more mappers each of which is reading the other tables.

          People

          • Assignee:
            He Yongqiang
            Reporter:
            He Yongqiang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development