Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16923 Hive-on-Spark DPP Improvements
  3. HIVE-17087

Remove unnecessary HoS DPP trees during map-join conversion

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.0.0
    • Spark
    • None

    Description

      Ran the following query in the TestSparkCliDriver:

      set hive.spark.dynamic.partition.pruning=true;
      set hive.auto.convert.join=true;
      
      create table partitioned_table1 (col int) partitioned by (part_col int);
      create table partitioned_table2 (col int) partitioned by (part_col int);
      create table regular_table (col int);
      insert into table regular_table values (1);
      
      alter table partitioned_table1 add partition (part_col = 1);
      insert into table partitioned_table1 partition (part_col = 1) values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
      
      alter table partitioned_table2 add partition (part_col = 1);
      insert into table partitioned_table2 partition (part_col = 1) values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
      
      explain select * from partitioned_table1, partitioned_table2 where partitioned_table1.part_col = partitioned_table2.part_col;
      

      and got the following explain plan:

      STAGE DEPENDENCIES:
        Stage-2 is a root stage
        Stage-3 depends on stages: Stage-2
        Stage-1 depends on stages: Stage-3
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-2
          Spark
      #### A masked pattern was here ####
            Vertices:
              Map 3 
                  Map Operator Tree:
                      TableScan
                        alias: partitioned_table1
                        Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                        Select Operator
                          expressions: col (type: int), part_col (type: int)
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col1 (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                            Group By Operator
                              keys: _col0 (type: int)
                              mode: hash
                              outputColumnNames: _col0
                              Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                              Spark Partition Pruning Sink Operator
                                partition key expr: part_col
                                Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                                target column name: part_col
                                target work: Map 2
      
        Stage: Stage-3
          Spark
      #### A masked pattern was here ####
            Vertices:
              Map 2 
                  Map Operator Tree:
                      TableScan
                        alias: partitioned_table2
                        Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                        Select Operator
                          expressions: col (type: int), part_col (type: int)
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                          Spark HashTable Sink Operator
                            keys:
                              0 _col1 (type: int)
                              1 _col1 (type: int)
                  Local Work:
                    Map Reduce Local Work
      
        Stage: Stage-1
          Spark
      #### A masked pattern was here ####
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: partitioned_table1
                        Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                        Select Operator
                          expressions: col (type: int), part_col (type: int)
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 10 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                          Map Join Operator
                            condition map:
                                 Inner Join 0 to 1
                            keys:
                              0 _col1 (type: int)
                              1 _col1 (type: int)
                            outputColumnNames: _col0, _col1, _col2, _col3
                            input vertices:
                              1 Map 2
                            Statistics: Num rows: 11 Data size: 12 Basic stats: COMPLETE Column stats: NONE
                            File Output Operator
                              compressed: false
                              Statistics: Num rows: 11 Data size: 12 Basic stats: COMPLETE Column stats: NONE
                              table:
                                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  Local Work:
                    Map Reduce Local Work
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
      

      Stage-2 seems unnecessary, given that Stage-1 is going to do a full table scan of partitioned_table1 when running the map-join

      Attachments

        1. HIVE-17087.5.patch
          33 kB
          Sahil Takiar
        2. HIVE-17087.4.patch
          33 kB
          Sahil Takiar
        3. HIVE-17087.3.patch
          33 kB
          Sahil Takiar
        4. HIVE-17087.2.patch
          34 kB
          Sahil Takiar
        5. HIVE-17087.1.patch
          31 kB
          Sahil Takiar

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stakiar Sahil Takiar Assign to me
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment