[IMPALA-11679] Inconsistent push down of limit with unpartitioned row_number() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Frontend
Labels:
None

Epic Color:
ghx-label-9

Description

In case of unpartitioned row_number() having a <= predicate on row number and limit means the same, so these two queries should lead to an equivalent plan:

a:
select * from (select l_orderkey, row_number() OVER (ORDER by l_orderkey) as rnum from tpch_parquet.lineitem) s
where rnum <= 10000;

b:
select * from (select l_orderkey, row_number() OVER (ORDER by l_orderkey) as rnum from
tpch_parquet.lineitem) s
limit 10000;

Currently a. will use to a top-n node while b. will use a sort node.
For rnum <= 1000 a. will also use a top-n node

Meanwhile if there is also a rnum > X clause (essentially an OFFSET), then limit has lower bounds for using top-n:

c:
select * from (select l_orderkey, row_number() OVER (ORDER by l_orderkey) as rnum fromtpch_parquet.lineitem) s
where rnum > 900 and rnum <= 1000

d:
select * from (select l_orderkey, row_number() OVER (ORDER by l_orderkey) as rnum from tpch_parquet.lineitem) s
where rnum > 900 limit 1000

c. will use a top-n node while d. will use a sort node

Besides not using the more optimal top-n (for low limits) another problem is that the analyitic-eval-node will process all rows, even when all further rows will be dropped by the predicate on row_number(). This is problematic as it runs on a single node/thread.

A solution could be to recognize < and > predicates on unpartitioned row_number() as limit and offset.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Csaba Ringhofer

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Oct/22 16:26

Updated:: 21/Oct/22 16:29