[DRILL-5991] Performance improvements for Hive tables with skip header / footer logic - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.12.0
Fix Version/s: None
Component/s: Storage - Hive
Labels:
None

Description

Currently when Hive table has header / footer all input split of the file are processed by one reader. This has performance impact better way would be to keep one reader per split and see if we can figure out a way to tell readers how many rows they should skip.

To create reader for each input split and maintain skip header / footer functionality we need to know how many rows are in input split. Unfortunately, input split does not hold such information, only number of bytes. We can't apply skip header functionality for the first input split and skip footer for the last input either since we don't know how many rows will be skipped, it can be the situation that we need to skip the whole first input split and partially second. Also we use Hadoop reader for the data and don't have information about number of rows in input split.

Possible improvements:
1. For table with header only before creating readers we can start skipping header and when done, create reader at that position, for other untouched input splits create separate readers though all readers will be on the same node.
2. Consider Drill text reader usage instead of Hadoop one (as we do for parquet files) which might provide more flexibility in terms of offsetting bytes etc. This should be investigated further.

Attachments

Issue Links

is related to

DRILL-5941 Skip header / footer logic works incorrectly for Hive tables when file has several input splits

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Arina Ielchiieva

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Nov/17 10:29

Updated:: 24/Nov/17 10:30