Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9025

join38.q (without map join) produces incorrect result when testing with multiple reducers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.14.0
    • 1.0.0
    • Logical Optimizer
    • None

    Description

      I have this query from a modified version of join38.q, which does NOT use map join:

      FROM src a JOIN tmp b ON (a.key = b.col11)
      SELECT a.value, b.col5, count(1) as count
      where b.col11 = 111
      group by a.value, b.col5;
      

      If I set mapred.reduce.tasks to 1, the result is correct. But, if I set it to be a larger number (3 for instance), then result will be

      val_111	105	1
      

      which is wrong.

      I think the issue is that, for this case, ConstantPropagationProcFactory will overwrite the partition cols for the reduce sink desc, with an empty list. Then, later on in ReduceSinkOperator#computeHashCode, since partitionEval is length 0, it will use an random number as hashcode, for each separate row. As result, rows with same key will be distributed to different reducers, and hence leads to incorrect result.

      Attachments

        1. HIVE-9025.1.patch
          40 kB
          Ted Xu
        2. HIVE-9025.patch
          22 kB
          Ted Xu

        Issue Links

          Activity

            People

              tedxu Ted Xu
              csun Chao Sun
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: