Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.4.1, 1.5.0
-
None
-
None
Description
AWS S3 provides bulk listing API. It takes the common prefix of all input paths as a parameter and returns all the objects whose prefixes start with the common prefix in blocks of 1000.
Since SPARK-9926 allow us to list multiple partitions all together, we can significantly speed up input split calculation using S3 bulk listing. This optimization is particularly useful for queries like select * from partitioned_table limit 10.
This is a common optimization for S3. For eg, here is a blog post from Qubole on this topic.
Attachments
Issue Links
- depends upon
-
SPARK-9926 Parallelize file listing for partitioned Hive table
- Resolved
- duplicates
-
SPARK-9926 Parallelize file listing for partitioned Hive table
- Resolved
- is superceded by
-
HADOOP-12810 FileSystem#listLocatedStatus causes unnecessary RPC calls
- Closed
- links to