The following query creates 2.4K partitions when using the tpch_300_parquet dataset:
insert into table tmps3db.orders_part partition(o_orderdate) select
If we skip the staging step (see
IMPALA-3452), the INSERTs themselves complete in less than 2 minutes. However, ~21 minutes is spent in the catalog after the INSERT to update all the partition information. This is because the catalog makes multiple individual requests per file per partition to S3. S3 has employed protection mechanisms to detect and slow down when many individual requests come from a single IP:
We should consider listing files in the parent directory of the partitions in batches of 1000 (see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html; for why we choose 1000) so that the number of requests to S3 is minimized, and we get steady latency and we also make better use of bandwidth.