[HIVE-17178] Spark Partition Pruning Sink Operator can't target multiple Works - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0
Component/s: Spark
Labels:
None

Target Version/s:

3.0.0

Description

A Spark Partition Pruning Sink Operator cannot be used to target multiple Map Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated if a single table needs to be used to target multiple Map Works.

The following query shows the issue:

set hive.spark.dynamic.partition.pruning=true;
set hive.auto.convert.join=true;

create table part_table_1 (col int) partitioned by (part_col int);
create table part_table_2 (col int) partitioned by (part_col int);
create table regular_table (col int);

insert into table regular_table values (1);

alter table part_table_1 add partition (part_col=1);
insert into table part_table_1 partition (part_col=1) values (1), (2), (3), (4);

alter table part_table_1 add partition (part_col=2);
insert into table part_table_1 partition (part_col=2) values (1), (2), (3), (4);

alter table part_table_2 add partition (part_col=1);
insert into table part_table_2 partition (part_col=1) values (1), (2), (3), (4);

alter table part_table_2 add partition (part_col=2);
insert into table part_table_2 partition (part_col=2) values (1), (2), (3), (4);

explain select * from regular_table, part_table_1, part_table_2 where regular_table.col = part_table_1.part_col and regular_table.col = part_table_2.part_col;

The explain plan is

STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: regular_table
                  Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: col is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: col (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: int)
                          1 _col1 (type: int)
                          2 _col1 (type: int)
                      Select Operator
                        expressions: _col0 (type: int)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: part_col
                            Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                            target column name: part_col
                            target work: Map 2
                      Select Operator
                        expressions: _col0 (type: int)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: part_col
                            Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
                            target column name: part_col
                            target work: Map 3
            Local Work:
              Map Reduce Local Work
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: part_table_2
                  Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: col (type: int), part_col (type: int)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                    Spark HashTable Sink Operator
                      keys:
                        0 _col0 (type: int)
                        1 _col1 (type: int)
                        2 _col1 (type: int)
                    Select Operator
                      expressions: _col1 (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                      Group By Operator
                        keys: _col0 (type: int)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                        Spark Partition Pruning Sink Operator
                          partition key expr: part_col
                          Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                          target column name: part_col
                          target work: Map 2
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: part_table_1
                  Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: col (type: int), part_col (type: int)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 8 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                           Inner Join 0 to 2
                      keys:
                        0 _col0 (type: int)
                        1 _col1 (type: int)
                        2 _col1 (type: int)
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4
                      input vertices:
                        0 Map 1
                        2 Map 3
                      Statistics: Num rows: 17 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 17 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                        table:
                            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

The DPP subtrees on Map 1 are exactly the same. We should be able to combine them, which avoids doing duplicate work.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-17178.6.patch
08/Mar/18 07:05
158 kB
Rui Li
HIVE-17178.5.patch
26/Feb/18 14:01
157 kB
Rui Li
HIVE-17178.4.patch
18/Feb/18 02:05
157 kB
Rui Li
HIVE-17178.3.patch
13/Feb/18 07:04
156 kB
Rui Li
HIVE-17178.2.patch
22/Jan/18 13:24
61 kB
Rui Li
HIVE-17178.1.patch
18/Jan/18 13:24
42 kB
Rui Li

Issue Links

is required by

HIVE-17193 HoS: don't combine map works that are targets of different DPPs

Closed

links to

RB link

Spark Partition Pruning Sink Operator can't target multiple Works

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates