Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
This is from TPC-DS Q7
Because we don't compute the selectivity of sub-expression in a HiveJoin we assume that selective and non-selective joins have the similar cost.
select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 from store_sales, customer_demographics, item where store_sales.ss_item_sk = item.i_item_sk and store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk and cd_gender = 'F' and cd_marital_status = 'W' and cd_education_status = 'Primary' group by i_item_id order by i_item_id limit 100
Cardinality
item 462,000 customer_demographics 1,920,800 store_sales 82,510,879,939
NDVs
item.i_item_sk 439501 customer_demographics.cd_demo_sk 1835839 store_sales.ss_cdemo_sk 1835839
From the logs
2015-04-20 21:09:58,055 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(60)) - Join algorithm selection for: HiveJoin(condition=[=($0, $10)], joinType=[inner], algorithm=[none], cost=[not available]) HiveJoin(condition=[=($1, $6)], joinType=[inner], algorithm=[MapJoin], cost=[{8.251089518344444E10 rows, 2.324083308641975E8 cpu, 275417.5666666666 io}]) HiveProject(ss_item_sk=[$1], ss_cdemo_sk=[$3], ss_quantity=[$9], ss_list_price=[$11], ss_sales_price=[$12], ss_coupon_amt=[$18]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.store_sales]]) HiveProject(cd_demo_sk=[$0], cd_gender=[$1], cd_marital_status=[$2], cd_education_status=[$3]) HiveFilter(condition=[AND(=($1, 'F'), =($2, 'W'), =($3, 'Primary'))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.customer_demographics]]) HiveProject(i_item_sk=[$0], i_item_id=[$1]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.item]]) 2015-04-20 21:09:58,056 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(69)) - CommonJoin cost: {6.553102534841269E8 rows, 4.0217814199458417E18 cpu, 3.499540319862703E7 io} 2015-04-20 21:09:58,056 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(69)) - MapJoin cost: {6.553102534841269E8 rows, 2.13444462E11 cpu, 1.0720709999999998E7 io} 2015-04-20 21:09:58,056 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(78)) - MapJoin selected 2015-04-20 21:09:58,057 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(60)) - Join algorithm selection for: HiveJoin(condition=[=($1, $8)], joinType=[inner], algorithm=[none], cost=[not available]) HiveJoin(condition=[=($0, $6)], joinType=[inner], algorithm=[MapJoin], cost=[{8.2511341939E10 rows, 2.13444462E11 cpu, 1.0720709999999998E7 io}]) HiveProject(ss_item_sk=[$1], ss_cdemo_sk=[$3], ss_quantity=[$9], ss_list_price=[$11], ss_sales_price=[$12], ss_coupon_amt=[$18]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.store_sales]]) HiveProject(i_item_sk=[$0], i_item_id=[$1]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.item]]) HiveProject(cd_demo_sk=[$0], cd_gender=[$1], cd_marital_status=[$2], cd_education_status=[$3]) HiveFilter(condition=[AND(=($1, 'F'), =($2, 'W'), =($3, 'Primary'))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_30000.customer_demographics]]) 2015-04-20 21:09:58,058 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(69)) - CommonJoin cost: {8.251089518344444E10 rows, 2.6089279242468144E21 cpu, 4.901146588836599E9 io} 2015-04-20 21:09:58,058 DEBUG [main]: cost.HiveCostModel (HiveCostModel.java:getJoinCost(69)) - MapJoin cost: {8.251089518344444E10 rows, 2.324083308641975E8 cpu, 275417.5666666666 io}