Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
0.8.0
-
None
-
None
Description
#Fri Jan 09 20:39:31 EST 2015
git.commit.id.abbrev=487d98e
With hash join enabled (default), the following join query hangs (running for about 30 min now). The join condition has mostly duplicates. Each table has 1 million rows. Data can be downloaded here:
https://s3.amazonaws.com/apache-drill/files/complex.json.gz
0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin` = false; +------------+------------+ | ok | summary | +------------+------------+ | true | planner.enable_mergejoin updated. | +------------+------------+ 1 row selected (0.025 seconds) 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin` = true; +------------+------------+ | ok | summary | +------------+------------+ | true | planner.enable_hashjoin updated. | +------------+------------+ 1 row selected (0.045 seconds) 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select a.id, b.gbyi, a.str from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit 20; +------------+------------+------------+ | id | gbyi | str | +------------+------------+------------+
physical plan:
0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select a.id, b.gbyi, a.str from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit 20; +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 Project(id=[$0], gbyi=[$1], str=[$2]) 00-02 SelectionVectorRemover 00-03 Limit(fetch=[20]) 00-04 SingleMergeExchange(sort0=[0 ASC]) 01-01 SelectionVectorRemover 01-02 TopN(limit=[20]) 01-03 HashToRandomExchange(dist0=[[$0]]) 02-01 Project(id=[$1], gbyi=[$3], str=[$2]) 02-02 HashJoin(condition=[=($0, $3)], joinType=[inner]) 02-04 HashToRandomExchange(dist0=[[$0]]) 03-01 Project(gbyi=[$0], id=[$2], str=[$1]) 03-02 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`, `id`, `str`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]]) 02-03 Project(gbyi0=[$0]) 02-05 HashToRandomExchange(dist0=[[$0]]) 04-01 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
If I turn merge join on, the query finishes rather quickly, like within a minute.
0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin` = false; +------------+------------+ | ok | summary | +------------+------------+ | true | planner.enable_hashjoin updated. | +------------+------------+ 1 row selected (0.026 seconds) 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin` = true; +------------+------------+ | ok | summary | +------------+------------+ | true | planner.enable_mergejoin updated. | +------------+------------+ 1 row selected (0.024 seconds) 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select a.id, b.gbyi, a.str from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit 20; +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 Project(id=[$0], gbyi=[$1], str=[$2]) 00-02 SelectionVectorRemover 00-03 Limit(fetch=[20]) 00-04 SingleMergeExchange(sort0=[0 ASC]) 01-01 SelectionVectorRemover 01-02 TopN(limit=[20]) 01-03 HashToRandomExchange(dist0=[[$0]]) 02-01 Project(id=[$1], gbyi=[$3], str=[$2]) 02-02 MergeJoin(condition=[=($0, $3)], joinType=[inner]) 02-04 SelectionVectorRemover 02-06 Sort(sort0=[$0], dir0=[ASC]) 02-08 HashToRandomExchange(dist0=[[$0]]) 03-01 Project(gbyi=[$0], id=[$2], str=[$1]) 03-02 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`, `id`, `str`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]]) 02-03 Project(gbyi0=[$0]) 02-05 SelectionVectorRemover 02-07 Sort(sort0=[$0], dir0=[ASC]) 02-09 HashToRandomExchange(dist0=[[$0]]) 04-01 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])