Currently, Split computation of partitioned table proceed as follows.
- Listing all partition directories of specified partitioned table
- Listing all files of each partition directories
For examples, assume a table with 1000 partitions and each partitions includes 10 files. In above case, AWS S3 API will be called 1000 times and it will become a huge bottleneck.
To improve current computation, we have to use S3::listObjects and implement the following algorithm to efficiently list multiple input locations:
- Given a list of S3 locations, apply prefix listing to a common prefix to get the metadata of 1000 objects at a time.
- While applying prefix listing in the above step, skip those objects that do not fall under the input list of S3 locations to avoid ending up listing large number of irrelevant objects in pathogenic cases.
Honestly, I'm inspired by Qubole's blog post as follows https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.