Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3482

S3: Consider bulk listing of files in the catalog vs individually accessing them

    XMLWordPrintableJSON

Details

    Description

      The following query creates 2.4K partitions when using the tpch_300_parquet dataset:

      insert into table tmps3db.orders_part partition(o_orderdate) select
      O_ORDERKEY,
      O_CUSTKEY,
      O_ORDERSTATUS,
      O_TOTALPRICE,
      O_ORDERPRIORITY,
      O_CLERK,
      O_SHIPPRIORITY,
      O_COMMENT,
      O_ORDERDATE
      from
      tpch_300_parquet.orders

      If we skip the staging step (see IMPALA-3452), the INSERTs themselves complete in less than 2 minutes. However, ~21 minutes is spent in the catalog after the INSERT to update all the partition information. This is because the catalog makes multiple individual requests per file per partition to S3. S3 has employed protection mechanisms to detect and slow down when many individual requests come from a single IP:
      (http://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html)

      We should consider listing files in the parent directory of the partitions in batches of 1000 (see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html; for why we choose 1000) so that the number of requests to S3 is minimized, and we get steady latency and we also make better use of bandwidth.

      Attachments

        1. invalidate_cs_3.jfr
          1.90 MB
          Mostafa Mokhtar

        Issue Links

          Activity

            People

              sailesh Sailesh Mukil
              sailesh Sailesh Mukil
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: