Currently, after storing the data of partitioned table, Tajo calculates the volume of table using listing partition directories. To list directories, Tajo use FileSystem::getContentSummary of HDFS generic APIs.
In case of small to medium-size partition directories, it should not be a problem. But in case of large-size partition directories, it should be a problem. For example, three years of data, organized into hourly directories, results in 26,280 directories. If each directory contains 5 files, this will makes a grand total of 131,400 files. It seems to be a medium deal in HDFS, but it might results in very poor performance in S3. Thus we need to avoid to list partition directories.
I think we can get the volume of each partition directories in PhysicalOperator. If all tasks set the volume of partition, Query doesn’t need to list partition directories using HDFS api.