[SPARK-10340] Use S3 bulk listing for S3-backed Hive tables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.4.1, 1.5.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

AWS S3 provides bulk listing API. It takes the common prefix of all input paths as a parameter and returns all the objects whose prefixes start with the common prefix in blocks of 1000.

Since ~~SPARK-9926~~ allow us to list multiple partitions all together, we can significantly speed up input split calculation using S3 bulk listing. This optimization is particularly useful for queries like select * from partitioned_table limit 10.

This is a common optimization for S3. For eg, here is a blog post from Qubole on this topic.

Attachments

Issue Links

depends upon

SPARK-9926 Parallelize file listing for partitioned Hive table

Resolved

duplicates

SPARK-9926 Parallelize file listing for partitioned Hive table

Resolved

is superceded by

HADOOP-12810 FileSystem#listLocatedStatus causes unnecessary RPC calls

Closed

links to

[Github] Pull Request #8512 (piaozhexiu)

Activity

People

Assignee:: Cheolsoo Park

Reporter:: Cheolsoo Park

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 28/Aug/15 20:24

Updated:: 08/Apr/16 11:15

Resolved:: 08/Apr/16 11:15