Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
Impala 2.6.0
Description
The following query creates 2.4K partitions when using the tpch_300_parquet dataset:
insert into table tmps3db.orders_part partition(o_orderdate) select
O_ORDERKEY,
O_CUSTKEY,
O_ORDERSTATUS,
O_TOTALPRICE,
O_ORDERPRIORITY,
O_CLERK,
O_SHIPPRIORITY,
O_COMMENT,
O_ORDERDATE
from
tpch_300_parquet.orders
If we skip the staging step (see IMPALA-3452), the INSERTs themselves complete in less than 2 minutes. However, ~21 minutes is spent in the catalog after the INSERT to update all the partition information. This is because the catalog makes multiple individual requests per file per partition to S3. S3 has employed protection mechanisms to detect and slow down when many individual requests come from a single IP:
(http://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html)
We should consider listing files in the parent directory of the partitions in batches of 1000 (see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html; for why we choose 1000) so that the number of requests to S3 is minimized, and we get steady latency and we also make better use of bandwidth.
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-4611 Checking perms on S3 files is a very expensive no-op
- Resolved
- relates to
-
IMPALA-4172 Switch from using getFileBlockLocations to BlockLocation methods (Potential 50% speedup in metadata loading)
- Resolved