Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-1959 Improve AWS S3 file system support
  3. TAJO-2111

Optimize Partition Table Split Computation for Amazon S3

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: S3, Storage
    • Labels:
      None

      Description

      Currently, Split computation of partitioned table proceed as follows.

      • Listing all partition directories of specified partitioned table
      • Listing all files of each partition directories

      For examples, assume a table with 1000 partitions and each partitions includes 10 files. In above case, AWS S3 API will be called 1000 times and it will become a huge bottleneck.

      To improve current computation, we have to use S3::listObjects and implement the following algorithm to efficiently list multiple input locations:

      • Given a list of S3 locations, apply prefix listing to a common prefix to get the metadata of 1000 objects at a time.
      • While applying prefix listing in the above step, skip those objects that do not fall under the input list of S3 locations to avoid ending up listing large number of irrelevant objects in pathogenic cases.

      Honestly, I'm inspired by Qubole's blog post as follows https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                blrunner JaeHwa Jung
                Reporter:
                blrunner JaeHwa Jung
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: