[DRILL-4387] Improve execution side when it handles skipAll query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.0
Component/s: None
Labels:
None

Description

~~DRILL-4279~~ changes the planner side and the RecordReader in the execution side when they handles skipAll query. However, it seems there are other places in the codebase that do not handle skipAll query efficiently. In particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty column list with star column. This essentially will force the execution side (RecordReader) to fetch all the columns for data source. Such behavior will lead to big performance overhead for the SCAN operator.

To improve Drill's performance, we should change those places as well, as a follow-up work after ~~DRILL-4279~~.

One simple example of this problem is:

   SELECT DISTINCT substring(dir1, 5) from  dfs.`/Path/To/ParquetTable`;

The query does not require any regular column from the parquet file. However, ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the column list. In case table has dozens or hundreds of columns, this will make SCAN operator much more expensive than necessary.

Attachments

Issue Links

Dependent

ZOOKEEPER-704 GSoC 2010: Read-Only Mode

Open

Activity

People

Assignee:: Jinfeng Ni

Reporter:: Jinfeng Ni

Reviewer:: Khurram Faraaz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Feb/16 16:56

Updated:: 14/Apr/21 05:57

Resolved:: 22/Feb/16 21:02