[DRILL-6118] Handle item star columns during project / filter push down and directory pruning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.12.0
Fix Version/s: 1.13.0
Component/s: None
Labels:
- doc-impacting
- ready-to-commit

Description

Project push down, filter push down and partition pruning does not work with dynamically expanded column with is represented as star in ITEM operator: ITEM($0, 'column_name') where $0 is a star.
This often occurs when view, sub-select or cte with star is issued.
To solve this issue we can create DrillFilterItemStarReWriterRule which will rewrite such ITEM operator before filter push down and directory pruning. For project into scan push down logic will be handled separately in already existing rule DrillPushProjectIntoScanRule. Basically, we can consider the following queries the same:
select col1 from t
select col1 from (select * from t)

Use cases
Since item star columns where not considered during project / filter push down and directory pruning, push down and pruning did not happen. This was causing Drill to read all columns from file (when only several are needed) or ready all files instead. Views with star query is the most common example. Such behavior significantly degrades performance for item star queries comparing to queries without item star.

EXAMPLES

Data set
will create table with three files each in dedicated sub-folder:

use dfs.tmp;
create table `order_ctas/t1` as select cast(o_orderdate as date) as o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-01' and date '1992-01-03';
create table `order_ctas/t2` as select cast(o_orderdate as date) as o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-04' and date '1992-01-06';
create table `order_ctas/t3` as select cast(o_orderdate as date) as o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date '1992-01-07' and date '1992-01-09';

Filter push down
select * from order_ctas where o_orderdate = date '1992-01-01' will read only one file

00-00    Screen
00-01      Project(**=[$0])
00-02        Project(T1¦¦**=[$0])
00-03          SelectionVectorRemover
00-04            Filter(condition=[=($1, 1992-01-01)])
00-05              Project(T1¦¦**=[$0], o_orderdate=[$1])
00-06                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])

select * from (select * from order_ctas) where o_orderdate = date '1992-01-01' will ready all three files

00-00    Screen
00-01      Project(**=[$0])
00-02        SelectionVectorRemover
00-03          Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])

Directory pruning
select * from order_ctas where dir0 = 't1' will read data only from one folder

00-00    Screen
00-01      Project(**=[$0])
00-02        Project(**=[$0])
00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])

select * from (select * from order_ctas) where dir0 = 't1' will read content of all three folders

00-00    Screen
00-01      Project(**=[$0])
00-02        SelectionVectorRemover
00-03          Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])

Project into Scan push down
select o_orderdate, count(1) from order_ctas group by o_orderdate will ready only one column from the files

00-00    Screen
00-01      Project(o_orderdate=[$0], EXPR$1=[$1])
00-02        HashAgg(group=[{0}], EXPR$1=[COUNT()])
00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`o_orderdate`]]])

select o_orderdate, count(1) from (select * from order_ctas) group by o_orderdate will ready all columns from the files

00-00    Screen
00-01      Project(col_vrchr=[$0], EXPR$1=[$1])
00-02        StreamAgg(group=[{0}], EXPR$1=[COUNT()])
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            Project(col_vrchr=[ITEM($0, 'o_orderdate')])
00-05         Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])

This Jira aims to fix all three described cases above in order to improve performance for queries with item star columns.

Attachments

Issue Links

is related to

DRILL-6199 Filter push down doesn't work with more than one nested subqueries

Closed

relates to

DRILL-6219 Filter pushdown doesn't work with star operator if there is a subquery with it's own filter

Open

links to

GitHub Pull Request #1104

Activity

People

Assignee:: Arina Ielchiieva

Reporter:: Arina Ielchiieva

Reviewer:: Chunhui Shi

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 30/Jan/18 10:53

Updated:: 16/Apr/18 21:16

Resolved:: 25/Feb/18 06:03