Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 3.4.0
-
None
-
ghx-label-9
Description
In the following query, the left semi join gets flipped to a right semi join due to the cardinality of the tables but the parallelism of the HashJoin fragment (see Fragment F01) remains as hosts=1, instances=1. The right behavior should be to inherit the parallelism of the new probe input table store_sales, so it should be hosts=3, instances=3 to avoid under-parallelizing the HashJoin.
[localhost:21000] default> set explain_level=2; EXPLAIN_LEVEL set to 2 [localhost:21000] default> use tpcds_parquet; Query: use tpcds_parquet [localhost:21000] tpcds_parquet> explain select count(*) from store_returns where sr_customer_sk in (select ss_customer_sk from store_sales); Query: explain select count(*) from store_returns where sr_customer_sk in (select ss_customer_sk from store_sales) Max Per-Host Resource Reservation: Memory=10.31MB Threads=6 Per-Host Resource Estimates: Memory=85MB Analyzed query: SELECT count(*) FROM tpcds_parquet.store_returns LEFT SEMI JOIN (SELECT ss_customer_sk FROM tpcds_parquet.store_sales) `$a$1` (`$c$1`) ON sr_customer_sk = `$a$1`.`$c$1` "" F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | Per-Host Resources: mem-estimate=10.02MB mem-reservation=0B thread-reservation=1 PLAN-ROOT SINK | output exprs: count(*) | mem-estimate=0B mem-reservation=0B thread-reservation=0 | 09:AGGREGATE [FINALIZE] | output: count:merge(*) | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB thread-reservation=0 | tuple-ids=3 row-size=8B cardinality=1 | in pipelines: 09(GETNEXT), 04(OPEN) | 08:EXCHANGE [UNPARTITIONED] | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 | tuple-ids=3 row-size=8B cardinality=1 | in pipelines: 04(GETNEXT) | F01:PLAN FRAGMENT [HASH(tpcds_parquet.store_sales.ss_customer_sk)] hosts=1 instances=1 Per-Host Resources: mem-estimate=23.88MB mem-reservation=5.81MB thread-reservation=1 runtime-filters-memory=1.00MB 04:AGGREGATE | output: count(*) | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB thread-reservation=0 | tuple-ids=3 row-size=8B cardinality=1 | in pipelines: 04(GETNEXT), 06(OPEN) | 03:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED] | hash predicates: tpcds_parquet.store_sales.ss_customer_sk = sr_customer_sk | runtime filters: RF000[bloom] <- sr_customer_sk | mem-estimate=2.88MB mem-reservation=2.88MB spill-buffer=128.00KB thread-reservation=0 | tuple-ids=0 row-size=4B cardinality=287.51K | in pipelines: 06(GETNEXT), 00(OPEN) | |--07:EXCHANGE [HASH(sr_customer_sk)] | | mem-estimate=1.10MB mem-reservation=0B thread-reservation=0 | | tuple-ids=0 row-size=4B cardinality=287.51K | | in pipelines: 00(GETNEXT) | | | F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 | Per-Host Resources: mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=2 | 00:SCAN HDFS [tpcds_parquet.store_returns, RANDOM] | HDFS partitions=1/1 files=1 size=15.43MB | stored statistics: | table: rows=287.51K size=15.43MB | columns: all | extrapolated-rows=disabled max-scan-range-rows=287.51K | mem-estimate=24.00MB mem-reservation=1.00MB thread-reservation=1 | tuple-ids=0 row-size=4B cardinality=287.51K | in pipelines: 00(GETNEXT) | 06:AGGREGATE [FINALIZE] | group by: tpcds_parquet.store_sales.ss_customer_sk | mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB thread-reservation=0 | tuple-ids=6 row-size=4B cardinality=90.63K | in pipelines: 06(GETNEXT), 01(OPEN) | 05:EXCHANGE [HASH(tpcds_parquet.store_sales.ss_customer_sk)] | mem-estimate=142.01KB mem-reservation=0B thread-reservation=0 | tuple-ids=6 row-size=4B cardinality=90.63K | in pipelines: 01(GETNEXT) | F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 Per-Host Resources: mem-estimate=27.00MB mem-reservation=3.50MB thread-reservation=2 runtime-filters-memory=1.00MB 02:AGGREGATE [STREAMING] | group by: tpcds_parquet.store_sales.ss_customer_sk | mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB thread-reservation=0 | tuple-ids=6 row-size=4B cardinality=90.63K | in pipelines: 01(GETNEXT) | 01:SCAN HDFS [tpcds_parquet.store_sales, RANDOM] HDFS partitions=1824/1824 files=1824 size=200.95MB runtime filters: RF000[bloom] -> tpcds_parquet.store_sales.ss_customer_sk stored statistics: table: rows=2.88M size=200.95MB partitions: 1824/1824 rows=2.88M columns: all extrapolated-rows=disabled max-scan-range-rows=130.09K mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1 tuple-ids=1 row-size=4B cardinality=2.88M in pipelines: 01(GETNEXT)
The same behavior is seen for inner joins as well but to reproduce that I have to comment out the 'ordering' of tables for the joins to force creating an initial join order that is sub-optimal.
Attachments
Issue Links
- is related to
-
IMPALA-5612 Join inversion should avoid reducing the degree of parallelism
- Resolved