Uploaded image for project: 'Tajo (Retired)'
  1. Tajo (Retired)
  2. TAJO-1959 Improve AWS S3 file system support
  3. TAJO-2030

Use list S3 files using AmazonS3Client instead of using S3A

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 0.13.0
    • S3
    • None

    Description

      AWS S3 provides bulk listing API. It takes the common prefix of all input paths as a parameter and returns all the objects whose prefixes start with the common prefix in blocks of 1000.

      If we will use AmazonS3Client for listing S3 files instead of using S3A, this will improve performance. To prove this idea, I adopted PrestoFileSystem instead of S3AFileSystem. When pruning partition filters, PrestoFileSystem was faster much more than S3AFileSystem.

      Here is my benchmark results for the following queries:

      1 partition : select count(*) from lineitem where l_shipdate = '1992-01-02';
      30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' and l_shipdate < '1992-02-01';
      90 partitions: select count(*) from lineitem  where l_shipdate >= '1992-01-01' and l_shipdate < '1992-04-01';
      151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-06-01';
      
      (#) of partitions PrestoFileSystem(ms) S3AFileSystem(ms)
      1 677 800
      30 2753 6977
      90 6825 13772
      151 13834 25701

      For the reference, I used tpc-h 1g dataset and set l_shipdate column of lineitem table to partition column.

      I think there are ways to resolve this as following:

      • Borrow PrestoFileSystem and related codes from Presto
      • Implement necessary codes to S3TableSpace by referencing Presto

      Attachments

        Activity

          People

            blrunner JaeHwa Jung
            blrunner JaeHwa Jung
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: