Uploaded image for project: 'Tajo (Retired)'
  1. Tajo (Retired)
  2. TAJO-1959 Improve AWS S3 file system support
  3. TAJO-2111

Optimize Partition Table Split Computation for Amazon S3

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • S3, Storage
    • None

    Description

      Currently, Split computation of partitioned table proceed as follows.

      • Listing all partition directories of specified partitioned table
      • Listing all files of each partition directories

      For examples, assume a table with 1000 partitions and each partitions includes 10 files. In above case, AWS S3 API will be called 1000 times and it will become a huge bottleneck.

      To improve current computation, we have to use S3::listObjects and implement the following algorithm to efficiently list multiple input locations:

      • Given a list of S3 locations, apply prefix listing to a common prefix to get the metadata of 1000 objects at a time.
      • While applying prefix listing in the above step, skip those objects that do not fall under the input list of S3 locations to avoid ending up listing large number of irrelevant objects in pathogenic cases.

      Honestly, I'm inspired by Qubole's blog post as follows https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.

      Attachments

        Issue Links

          Activity

            People

              blrunner JaeHwa Jung
              blrunner JaeHwa Jung
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: