Hive
  1. Hive
  2. HIVE-2146

Block Sampling should adjust number of reducers accordingly to make it useful

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Now number of reducers of block sampling is not modified, so that queries like:
      select c from tab tablesample(1 percent) group by c;
      can generate huge number of reducers although the input is sampled to be small.
      We need to shrink number of reducers to make block sampling more useful.
      Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

      1. HIVE-2146.1.patch
        2 kB
        Siying Dong
      2. HIVE-2146.2.patch
        2 kB
        Siying Dong

        Activity

        Carl Steinbach made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Hive-trunk-h0.20 #712 (See https://builds.apache.org/hudson/job/Hive-trunk-h0.20/712/)

        Show
        Hudson added a comment - Integrated in Hive-trunk-h0.20 #712 (See https://builds.apache.org/hudson/job/Hive-trunk-h0.20/712/ )
        Ning Zhang made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Fix Version/s 0.8.0 [ 12316178 ]
        Resolution Fixed [ 1 ]
        Hide
        Ning Zhang added a comment -

        Committed. Thanks Siying!

        Show
        Ning Zhang added a comment - Committed. Thanks Siying!
        Hide
        Ning Zhang added a comment -

        +1. Will commit if tests pass.

        Show
        Ning Zhang added a comment - +1. Will commit if tests pass.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/685/
        -----------------------------------------------------------

        Review request for hive, Ning Zhang and namit jain.

        Summary
        -------

        Now number of reducers of block sampling is not modified, so that queries like:
        select c from tab tablesample(1 percent) group by c;
        can generate huge number of reducers although the input is sampled to be small.
        We need to shrink number of reducers to make block sampling more useful.
        Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

        This addresses bug HIVE-2146.
        https://issues.apache.org/jira/browse/HIVE-2146

        Diffs


        trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 1098885

        Diff: https://reviews.apache.org/r/685/diff

        Testing
        -------

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/685/ ----------------------------------------------------------- Review request for hive, Ning Zhang and namit jain. Summary ------- Now number of reducers of block sampling is not modified, so that queries like: select c from tab tablesample(1 percent) group by c; can generate huge number of reducers although the input is sampled to be small. We need to shrink number of reducers to make block sampling more useful. Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess. This addresses bug HIVE-2146 . https://issues.apache.org/jira/browse/HIVE-2146 Diffs trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 1098885 Diff: https://reviews.apache.org/r/685/diff Testing ------- Thanks, Siying
        Siying Dong made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Siying Dong added a comment -
        Show
        Siying Dong added a comment - review board: https://reviews.apache.org/r/685/
        Siying Dong made changes -
        Attachment HIVE-2146.2.patch [ 12478119 ]
        Hide
        Siying Dong added a comment -

        for 2) the possibility that it can't be sampled is more likely to be the case that CombineHiveInputformat.getSplits() finally calls super.getSplits() for some reasons. In those cases, the data are not sampled at all.
        Another possible is that, for example, two alias of the MapReduce job include the same directory. We can't sample it then.

        For 1) and 3), I think about it more. I'll remove the extra bytesPerReducer added. The worst case is that we run one less reducer. Shouldn't be so bad.

        Show
        Siying Dong added a comment - for 2) the possibility that it can't be sampled is more likely to be the case that CombineHiveInputformat.getSplits() finally calls super.getSplits() for some reasons. In those cases, the data are not sampled at all. Another possible is that, for example, two alias of the MapReduce job include the same directory. We can't sample it then. For 1) and 3), I think about it more. I'll remove the extra bytesPerReducer added. The worst case is that we run one less reducer. Shouldn't be so bad.
        Ning Zhang made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Ning Zhang added a comment -

        Siying can you create a review request?

        I've a couple of comments as well:
        1) comment in line 387 is not complete
        2) comments in line 388-389: can you give an example in which case all input alias are sampled by syntax but it actually cannot be sampled?
        3) line 391: if we want to mimic the old estimation algorithm it seems we shouldn't + bytesPerReducer here? It is added in line 399 right?

        Show
        Ning Zhang added a comment - Siying can you create a review request? I've a couple of comments as well: 1) comment in line 387 is not complete 2) comments in line 388-389: can you give an example in which case all input alias are sampled by syntax but it actually cannot be sampled? 3) line 391: if we want to mimic the old estimation algorithm it seems we shouldn't + bytesPerReducer here? It is added in line 399 right?
        Siying Dong made changes -
        Status In Progress [ 3 ] Patch Available [ 10002 ]
        Siying Dong made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Siying Dong made changes -
        Attachment HIVE-2146.1.patch [ 12478013 ]
        Siying Dong made changes -
        Attachment HIVE-2146.1.patch [ 12478011 ]
        Siying Dong made changes -
        Attachment HIVE-2146.1.patch [ 12478011 ]
        Siying Dong made changes -
        Field Original Value New Value
        Assignee Siying Dong [ sdong ]
        Siying Dong created issue -

          People

          • Assignee:
            Siying Dong
            Reporter:
            Siying Dong
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development