Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
Currently, after storing the data of partitioned table, Tajo calculates the volume of table using listing partition directories. To list directories, Tajo use FileSystem::getContentSummary of HDFS generic APIs.
In case of small to medium-size partition directories, it should not be a problem. But in case of large-size partition directories, it should be a problem. For example, three years of data, organized into hourly directories, results in 26,280 directories. If each directory contains 5 files, this will makes a grand total of 131,400 files. It seems to be a medium deal in HDFS, but it might results in very poor performance in S3. Thus we need to avoid to list partition directories.
I think we can get the volume of each partition directories in PhysicalOperator. If all tasks set the volume of partition, Query doesn’t need to list partition directories using HDFS api.
Attachments
Issue Links
- is related to
-
TAJO-2111 Optimize Partition Table Split Computation for Amazon S3
- Open