[HIVE-2088] Improve Hive Query split calculation performance over large partitions (potentially 5x speedup) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

While working on improving start up time for query spanning over large number of partitions I found a particular inefficiency in CombineHiveInputFormat.

It seems that for each input partition we create a CombineFilter to make sure that files are combined within a particular partition.

CombineHiveInputFormat:

Path filterPath = path;
if (!poolSet.contains(filterPath))

{ LOG.info("CombineHiveInputSplit creating pool for " + path + "; using filter path " + filterPath); combine.createPool(job, new CombineFilter(filterPath)); poolSet.add(filterPath); }

These filters are passed then passed to CombineFileInputFormat along with all the input paths.
CombineFileInputFormat computes a list of all the files in the input paths.
It them loops over each filter and then checks whether a particular file belongs to a particular filter.

ConbineFileInputFormat:

for (MultiPathFilter onepool : pools) {
ArrayList<Path> myPaths = new ArrayList<Path>();

// pick one input path. If it matches all the filters in a pool,
// add it to the output set
for (int i = 0; i < paths.length; i++) {
...

This results in computation of the order O(p*f) where p is the number of partitions and f is the total number of files in all partitions.
For a case of 10,000 partitions with 10 files in each partition, this results in 1,000,000,000 iterations.

We can replace this code with processing splits for one input path at a time without passing any filters at all.

Do you happen to see a case where this approach will not work?

Attachments

Activity

People

Assignee:: Vaibhav Aggarwal

Reporter:: Vaibhav Aggarwal

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Apr/11 01:18

Updated:: 24/Oct/11 18:35

Resolved:: 24/Oct/11 18:35