Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3482

S3: Consider bulk listing of files in the catalog vs individually accessing them

    Details

      Description

      The following query creates 2.4K partitions when using the tpch_300_parquet dataset:

      insert into table tmps3db.orders_part partition(o_orderdate) select
      O_ORDERKEY,
      O_CUSTKEY,
      O_ORDERSTATUS,
      O_TOTALPRICE,
      O_ORDERPRIORITY,
      O_CLERK,
      O_SHIPPRIORITY,
      O_COMMENT,
      O_ORDERDATE
      from
      tpch_300_parquet.orders

      If we skip the staging step (see IMPALA-3452), the INSERTs themselves complete in less than 2 minutes. However, ~21 minutes is spent in the catalog after the INSERT to update all the partition information. This is because the catalog makes multiple individual requests per file per partition to S3. S3 has employed protection mechanisms to detect and slow down when many individual requests come from a single IP:
      (http://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html)

      We should consider listing files in the parent directory of the partitions in batches of 1000 (see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html; for why we choose 1000) so that the number of requests to S3 is minimized, and we get steady latency and we also make better use of bandwidth.

        Attachments

        1. invalidate_cs_3.jfr
          1.90 MB
          Mostafa Mokhtar

          Issue Links

            Activity

              People

              • Assignee:
                sailesh Sailesh Mukil
                Reporter:
                sailesh Sailesh Mukil
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: