Description
Enabling the SparkMapJoinResolver and SparkReduceSinkMapJoinProc, I see the following:
explain select * from src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key + src2.key = src3.key);
produces too many stages (six), and too many HashTableSink.
STAGE DEPENDENCIES: Stage-5 is a root stage Stage-4 depends on stages: Stage-5 Stage-3 depends on stages: Stage-4 Stage-7 is a root stage Stage-6 depends on stages: Stage-7 Stage-0 is a root stage STAGE PLANS: Stage: Stage-5 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:3 Vertices: Map 1 Map Operator Tree: TableScan alias: src2 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) Stage: Stage-4 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:2 Vertices: Map 3 Map Operator Tree: TableScan alias: src1 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) outputColumnNames: _col0, _col1, _col5, _col6 input vertices: 1 Map 1 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col0 + _col5) is not null (type: boolean) Statistics: Num rows: 8 Data size: 1653 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {_col0} {_col1} {_col5} {_col6} 1 {key} {value} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) Stage: Stage-3 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:1 Vertices: Map 2 Map Operator Tree: TableScan alias: src3 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: UDFToDouble(key) is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col0} {_col1} {_col5} {_col6} 1 {key} {value} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) outputColumnNames: _col0, _col1, _col5, _col6, _col10, _col11 input vertices: 0 Map 3 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string), _col5 (type: string), _col6 (type: string), _col10 (type: string), _col11 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-7 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:3 Vertices: Map 1 Map Operator Tree: TableScan alias: src2 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) Stage: Stage-6 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:2 Vertices: Map 3 Map Operator Tree: TableScan alias: src1 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) outputColumnNames: _col0, _col1, _col5, _col6 input vertices: 1 Map 1 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col0 + _col5) is not null (type: boolean) Statistics: Num rows: 8 Data size: 1653 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {_col0} {_col1} {_col5} {_col6} 1 {key} {value} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
Attachments
Attachments
Issue Links
- links to