[TAJO-2111] Optimize Partition Table Split Computation for Amazon S3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: S3, Storage
Labels:
None

Description

Currently, Split computation of partitioned table proceed as follows.

Listing all partition directories of specified partitioned table
Listing all files of each partition directories

For examples, assume a table with 1000 partitions and each partitions includes 10 files. In above case, AWS S3 API will be called 1000 times and it will become a huge bottleneck.

To improve current computation, we have to use S3::listObjects and implement the following algorithm to efficiently list multiple input locations:

Given a list of S3 locations, apply prefix listing to a common prefix to get the metadata of 1000 objects at a time.
While applying prefix listing in the above step, skip those objects that do not fall under the input list of S3 locations to avoid ending up listing large number of irrelevant objects in pathogenic cases.

Honestly, I'm inspired by Qubole's blog post as follows https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.

Attachments

Issue Links

relates to

TAJO-1974 When calculating partitioned table volume, avoid to list partition directories.

Resolved

Activity

People

Assignee:: JaeHwa Jung

Reporter:: JaeHwa Jung

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Apr/16 16:44

Updated:: 21/Apr/16 01:48