[IMPALA-3482] S3: Consider bulk listing of files in the catalog vs individually accessing them - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: Impala 2.6.0
Fix Version/s: Impala 2.9.0
Component/s: Frontend
Labels:

Target Version:

Impala 2.12.0

Description

The following query creates 2.4K partitions when using the tpch_300_parquet dataset:

insert into table tmps3db.orders_part partition(o_orderdate) select
O_ORDERKEY,
O_CUSTKEY,
O_ORDERSTATUS,
O_TOTALPRICE,
O_ORDERPRIORITY,
O_CLERK,
O_SHIPPRIORITY,
O_COMMENT,
O_ORDERDATE
from
tpch_300_parquet.orders

If we skip the staging step (see ~~IMPALA-3452~~), the INSERTs themselves complete in less than 2 minutes. However, ~21 minutes is spent in the catalog after the INSERT to update all the partition information. This is because the catalog makes multiple individual requests per file per partition to S3. S3 has employed protection mechanisms to detect and slow down when many individual requests come from a single IP:
(http://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html)

We should consider listing files in the parent directory of the partitions in batches of 1000 (see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html; for why we choose 1000) so that the number of requests to S3 is minimized, and we get steady latency and we also make better use of bandwidth.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

invalidate_cs_3.jfr
07/Dec/16 01:36
1.90 MB
Mostafa Mokhtar

Issue Links

is related to

IMPALA-4611 Checking perms on S3 files is a very expensive no-op

Resolved

relates to

IMPALA-4172 Switch from using getFileBlockLocations to BlockLocation methods (Potential 50% speedup in metadata loading)

Resolved

Activity

People

Assignee:: Sailesh Mukil

Reporter:: Sailesh Mukil

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/May/16 20:58

Updated:: 10/Apr/18 16:55

Resolved:: 10/Apr/18 16:55