Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
2.4.0
-
None
-
None
Description
FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow.
This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance.
Attachments
Attachments
Issue Links
- depends upon
-
HADOOP-10634 Add recursive list apis to FileSystem to give implementations an opportunity for optimization
- Resolved
- is depended upon by
-
HADOOP-14302 Test MR split optimisation with recursive listing
- Open
-
HADOOP-16829 Über-jira: S3A Hadoop 3.3.1 features
- Resolved
- is related to
-
MAPREDUCE-7092 MR examples to work better against cloud stores
- Resolved