Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21196

Support semijoin reduction on multiple column join

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Currently for a query involving join on multiple columns creates  separate semi join edges for each key which in turn create a bloom filter for each of them, like below,

      EXPLAIN select count from srcpart_date_n7 join srcpart_small_n3 on (srcpart_date_n7.key = srcpart_small_n3.key1 and srcpart_date_n7.value = srcpart_small_n3.value1)

      Map 1 <- Reducer 5 (BROADCAST_EDGE)
              Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
              Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
              Reducer 5 <- Map 4 (CUSTOM_SIMPLE_EDGE)
      #### A masked pattern was here ####
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: srcpart_date_n7
                        filterExpr: (key is not null and value is not null and (key BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter)))) (type: boolean)
                        Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ((key BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter))) and key is not null and value is not null) (type: boolean)
                          Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: key (type: string), value (type: string)
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: string), _col1 (type: string)
                              sort order: ++
                              Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
                              Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized, llap
                  LLAP IO: all inputs
              Map 4 
                  Map Operator Tree:
                      TableScan
                        alias: srcpart_small_n3
                        filterExpr: (key1 is not null and value1 is not null) (type: boolean)
                        Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
                        Filter Operator
                          predicate: (key1 is not null and value1 is not null) (type: boolean)
                          Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
                          Select Operator
                            expressions: key1 (type: string), value1 (type: string)
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
                            Reduce Output Operator
                              key expressions: _col0 (type: string), _col1 (type: string)
                              sort order: ++
                              Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
                              Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
                            Select Operator
                              expressions: _col0 (type: string)
                              outputColumnNames: _col0
                              Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL
                              Group By Operator
                                aggregations: min(_col0), max(_col0), bloom_filter(_col0, expectedEntries=20)
                                mode: hash
                                outputColumnNames: _col0, _col1, _col2
                                Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
                                Reduce Output Operator
                                  sort order: 
                                  Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
                                  value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: binary)
                  Execution mode: vectorized, llap
                  LLAP IO: all inputs
              Reducer 2 
                  Execution mode: llap
                  Reduce Operator Tree:
                    Merge Join Operator
                      condition map:
                           Inner Join 0 to 1
                      keys:
                        0 _col0 (type: string), _col1 (type: string)
                        1 _col0 (type: string), _col1 (type: string)
                      Statistics: Num rows: 2200 Data size: 391600 Basic stats: PARTIAL Column stats: NONE
                      Group By Operator
                        aggregations: count()
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
                        Reduce Output Operator
                          sort order: 
                          Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
                          value expressions: _col0 (type: bigint)
              Reducer 3 
                  Execution mode: vectorized, llap
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: count(VALUE._col0)
                      mode: mergepartial
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE
                        table:
                            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              Reducer 5 
                  Execution mode: vectorized, llap
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: min(VALUE._col0), max(VALUE._col1), bloom_filter(VALUE._col2, expectedEntries=20)
                      mode: final
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
                      Reduce Output Operator
                        sort order: 
                        Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL
                        value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: binary)
      

      Instead it should create one branch for a join with one bloom filter.

       

      The implementation for bloom filter requires getting a hash out of all the key columns and converting it to a long and feeding it to bloom filter as input. This requires a new UDF which does this. It will be called at both bloom filter generation and lookup phases.

      The min and max will stay independent as they are today for each columns.
      A vectorized implementation of such UDF is also required.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zabetak Stamatis Zampetakis Assign to me
            djaiswal Deepak Jaiswal
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 3h 40m
              3h 40m

              Slack

                Issue deployment