[TAJO-1974] When calculating partitioned table volume, avoid to list partition directories. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Physical Operator, QueryMaster
Labels:
None

Description

Currently, after storing the data of partitioned table, Tajo calculates the volume of table using listing partition directories. To list directories, Tajo use FileSystem::getContentSummary of HDFS generic APIs.

In case of small to medium-size partition directories, it should not be a problem. But in case of large-size partition directories, it should be a problem. For example, three years of data, organized into hourly directories, results in 26,280 directories. If each directory contains 5 files, this will makes a grand total of 131,400 files. It seems to be a medium deal in HDFS, but it might results in very poor performance in S3. Thus we need to avoid to list partition directories.

I think we can get the volume of each partition directories in PhysicalOperator. If all tasks set the volume of partition, Query doesn’t need to list partition directories using HDFS api.

Attachments

Issue Links

is related to

TAJO-2111 Optimize Partition Table Split Computation for Amazon S3

Open

Activity

People

Assignee:: JaeHwa Jung

Reporter:: JaeHwa Jung

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Nov/15 01:52

Updated:: 21/Apr/16 01:52

Resolved:: 21/Apr/16 01:48