Hive
  1. Hive
  2. HIVE-560

column pruning not working with map joins

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      drop table tst1;
      drop table tst2;

      create table tst1(a1 string, a2 string, a3 string, a4 string);
      create table tst2(b1 string, b2 string, b3 string, b4 string);

      explain select /*+ MAPJOIN(a) */ a.a1, a.a2 from tst1 a join tst2 b ON a.a2=b.b2;

      the select is after the join - column pruning is not happening

      1. hive-560-2009-06-25.patch
        271 kB
        He Yongqiang
      2. hive-560-2009-06-24-normal(mapjoin-after-column-pruning).patch
        272 kB
        He Yongqiang
      3. hive-560-2009-06-24-mapjoin-before-column-pruning.patch
        267 kB
        He Yongqiang
      4. hive-560-2009-06-19-5.patch
        204 kB
        He Yongqiang
      5. hive-560-2009-06-19-2.patch
        27 kB
        He Yongqiang

        Issue Links

          Activity

          Hide
          He Yongqiang added a comment -

          A draft version. But it does not work for join32.q(both with and without mapjoin hint). And ppd_join2.q's output data seems not correct.
          Other tests passed.

          Show
          He Yongqiang added a comment - A draft version. But it does not work for join32.q(both with and without mapjoin hint). And ppd_join2.q's output data seems not correct. Other tests passed.
          Hide
          He Yongqiang added a comment -

          hive-560-2009-06-19-5.patch passed all test clidriver and testparse tests in my local.
          Also it fixed the problems of the earlier patch.
          it added a proc in column pruner for processing join op.

          Show
          He Yongqiang added a comment - hive-560-2009-06-19-5.patch passed all test clidriver and testparse tests in my local. Also it fixed the problems of the earlier patch. it added a proc in column pruner for processing join op.
          Hide
          Namit Jain added a comment -

          I will take a look and get back to you

          Show
          Namit Jain added a comment - I will take a look and get back to you
          Hide
          Namit Jain added a comment -

          Can you explain how is it working ? I was not able to understand it from the code.
          Also, there still seems to be a problem: for eg. in join1.q, src1.value is not used but
          it is not pruned right away.

          Show
          Namit Jain added a comment - Can you explain how is it working ? I was not able to understand it from the code. Also, there still seems to be a problem: for eg. in join1.q, src1.value is not used but it is not pruned right away.
          Hide
          He Yongqiang added a comment -

          From the explan, i think src1.value is pruned, isn't it?

          STAGE PLANS:
            Stage: Stage-1
              Map Reduce
                Alias -> Map Operator Tree:
                  src2 
                      Reduce Output Operator
                        key expressions:
                              expr: key
                              type: string
                        sort order: +
                        Map-reduce partition columns:
                              expr: key
                              type: string
                        tag: 1
                        value expressions:
                              expr: value
                              type: string
                  src1 
                      Reduce Output Operator
                        key expressions:
                              expr: key
                              type: string
                        sort order: +
                        Map-reduce partition columns:
                              expr: key
                              type: string
                        tag: 0
                        value expressions:
                              expr: key
                              type: string
                Reduce Operator Tree:
                  Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {VALUE._col0}
                      1 {VALUE._col1}
                    Select Operator
                      expressions:
                            expr: _col0
                            type: string
                            expr: _col3
                            type: string
                      Select Operator
                        expressions:
                              expr: UDFToInteger(_col0)
                              type: int
                              expr: _col1
                              type: string
                        File Output Operator
                          compressed: false
                          GlobalTableId: 1
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                              name: dest_j1
          
          Show
          He Yongqiang added a comment - From the explan, i think src1.value is pruned, isn't it? STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: src2 Reduce Output Operator key expressions: expr: key type: string sort order: + Map-reduce partition columns: expr: key type: string tag: 1 value expressions: expr: value type: string src1 Reduce Output Operator key expressions: expr: key type: string sort order: + Map-reduce partition columns: expr: key type: string tag: 0 value expressions: expr: key type: string Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col1} Select Operator expressions: expr: _col0 type: string expr: _col3 type: string Select Operator expressions: expr: UDFToInteger(_col0) type: int expr: _col1 type: string File Output Operator compressed: false GlobalTableId: 1 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: dest_j1
          Hide
          He Yongqiang added a comment - - edited

          Some commets about the modification:
          join is pruned by seeing which of its output column names is used by its children and removing its unused exprs.
          And for each parent reduce sink, the join also remember which of the parent output columns it used. These are saved in the newly added field 'joinPrunedColLists' in pruner context.
          In reduce sink, if its child is a join, it will gets the information of which output is used by the child join, and prunes itself accordingly.

          Show
          He Yongqiang added a comment - - edited Some commets about the modification: join is pruned by seeing which of its output column names is used by its children and removing its unused exprs. And for each parent reduce sink, the join also remember which of the parent output columns it used. These are saved in the newly added field 'joinPrunedColLists' in pruner context. In reduce sink, if its child is a join, it will gets the information of which output is used by the child join, and prunes itself accordingly.
          Hide
          Namit Jain added a comment -

          You are assuming that Column Pruning is happening before map join conversion - which is true today, but may be a bad assumption in the long term.
          Ideally, the optimizations should be completely independent of each other.

          Therefore, both kinds of trees - before and after map join conversion should be supported.

          Show
          Namit Jain added a comment - You are assuming that Column Pruning is happening before map join conversion - which is true today, but may be a bad assumption in the long term. Ideally, the optimizations should be completely independent of each other. Therefore, both kinds of trees - before and after map join conversion should be supported.
          Hide
          He Yongqiang added a comment -

          Modifed the code to let column pruning be independ with map join conversion.
          hive-560-2009-06-24-mapjoin-before-column-pruning.patch and hive-560-2009-06-24-normal(mapjoin-after-column-pruning).patch share the same set of modifications.
          They both passed tests in my local.

          hive-560-2009-06-24-mapjoin-before-column-pruning.patch is not created against truck. And in it, map join conversion is done before column prune started( by reordering them in optimizer).

          hive-560-2009-06-24-normal(mapjoin-after-column-pruning).patch is the one we needed and i retest and patched with truck code.

          Show
          He Yongqiang added a comment - Modifed the code to let column pruning be independ with map join conversion. hive-560-2009-06-24-mapjoin-before-column-pruning.patch and hive-560-2009-06-24-normal(mapjoin-after-column-pruning).patch share the same set of modifications. They both passed tests in my local. hive-560-2009-06-24-mapjoin-before-column-pruning.patch is not created against truck. And in it, map join conversion is done before column prune started( by reordering them in optimizer). hive-560-2009-06-24-normal(mapjoin-after-column-pruning).patch is the one we needed and i retest and patched with truck code.
          Hide
          Namit Jain added a comment -

          List<Boolean> removed = conf.getRemovedList().get(alias);
          for (int i = 0; i < sz; i++) {
          for (; pos < removed.size(); pos++) {
          if (pos < structFields.size() && !removed.get(pos))

          { structFieldObjectInspectors.add(structFields.get(pos) .getFieldObjectInspector()); pos++; break; }

          }
          }
          }

          Looks like there is some bug in above code in MapJoinOperator - what if nothing is removed i.e all columns are selected.
          Also, it might be a good idea not to use removedList at runtime, but create the correct list from removedList at compile time itself.

          Show
          Namit Jain added a comment - List<Boolean> removed = conf.getRemovedList().get(alias); for (int i = 0; i < sz; i++) { for (; pos < removed.size(); pos++) { if (pos < structFields.size() && !removed.get(pos)) { structFieldObjectInspectors.add(structFields.get(pos) .getFieldObjectInspector()); pos++; break; } } } } Looks like there is some bug in above code in MapJoinOperator - what if nothing is removed i.e all columns are selected. Also, it might be a good idea not to use removedList at runtime, but create the correct list from removedList at compile time itself.
          Hide
          He Yongqiang added a comment -
                if (conf.getOutputColumnNames().size() < structFields.size()) {
                  List<ObjectInspector> structFieldObjectInspectors = new ArrayList<ObjectInspector>();
                  for (Byte alias : order) {
                    int sz = conf.getExprs().get(alias).size();
                    int pos = 0;
                    List<Boolean> removed = conf.getRemovedList().get(alias);
                    for (int i = 0; i < sz; i++) {
                      for (; pos < removed.size(); pos++) {
                        if (pos < structFields.size() && !removed.get(pos)) {
                          structFieldObjectInspectors.add(structFields.get(pos)
                              .getFieldObjectInspector());
                          pos++;
                          break;
                        }
                      }
                    }
                  }
          

          if (conf.getOutputColumnNames().size() < structFields.size()) can avoid the situations you mentioned(if all columns are selected).

          Show
          He Yongqiang added a comment - if (conf.getOutputColumnNames().size() < structFields.size()) { List<ObjectInspector> structFieldObjectInspectors = new ArrayList<ObjectInspector>(); for (Byte alias : order) { int sz = conf.getExprs().get(alias).size(); int pos = 0; List<Boolean> removed = conf.getRemovedList().get(alias); for (int i = 0; i < sz; i++) { for (; pos < removed.size(); pos++) { if (pos < structFields.size() && !removed.get(pos)) { structFieldObjectInspectors.add(structFields.get(pos) .getFieldObjectInspector()); pos++; break; } } } } if (conf.getOutputColumnNames().size() < structFields.size()) can avoid the situations you mentioned(if all columns are selected).
          Hide
          He Yongqiang added a comment -

          >>it might be a good idea not to use removedList at runtime, but create the correct list from removedList at compile time itself.
          you mean what we should use is a retain list instead of a remove list?

          Show
          He Yongqiang added a comment - >>it might be a good idea not to use removedList at runtime, but create the correct list from removedList at compile time itself. you mean what we should use is a retain list instead of a remove list?
          Hide
          Namit Jain added a comment -

          yes, and dont increment pos in 2 places - can you rewrite this portion

          Show
          Namit Jain added a comment - yes, and dont increment pos in 2 places - can you rewrite this portion
          Hide
          He Yongqiang added a comment -

          Will try to upload a new one today.

          Show
          He Yongqiang added a comment - Will try to upload a new one today.
          Hide
          Namit Jain added a comment -

          Thanks, otherwise it looks good

          Show
          Namit Jain added a comment - Thanks, otherwise it looks good
          Hide
          Namit Jain added a comment -

          In ColumnPrunerProcFactory: line 482, you are checking for
          mapJoin with mapJoin

          if(mapJoin) {

          ...
          if (mapJoin)

          { ... }

          }

          The inner check can be omitted.

          Show
          Namit Jain added a comment - In ColumnPrunerProcFactory: line 482, you are checking for mapJoin with mapJoin if(mapJoin) { ... if (mapJoin) { ... } } The inner check can be omitted.
          Hide
          He Yongqiang added a comment -

          done. Namit, thanks! I will test it locally and upload.

          Show
          He Yongqiang added a comment - done. Namit, thanks! I will test it locally and upload.
          Hide
          He Yongqiang added a comment -

          Upload a new patch integrating with Namit's comments (Thanks Namit).
          Since the modification is not major comparing to the previous one, I test the new patch with a full test of test parse, but only a subset of test cli driver testcases (only queries with join*.q are tested).

          Show
          He Yongqiang added a comment - Upload a new patch integrating with Namit's comments (Thanks Namit). Since the modification is not major comparing to the previous one, I test the new patch with a full test of test parse, but only a subset of test cli driver testcases (only queries with join*.q are tested).
          Hide
          He Yongqiang added a comment -

          reupload hive-560-2009-06-25.patch.

          Show
          He Yongqiang added a comment - reupload hive-560-2009-06-25.patch.
          Hide
          Namit Jain added a comment -

          Committed. Thanks Yongqiang

          Show
          Namit Jain added a comment - Committed. Thanks Yongqiang

            People

            • Assignee:
              He Yongqiang
              Reporter:
              Namit Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development