[HADOOP-3095] Validating input paths and creating splits is slow on S3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.18.0
Component/s: fs, fs/s3
Labels:
None

Hadoop Flags:

Incompatible change, Reviewed
Release Note:

Hide
Added overloaded method getFileBlockLocations(FileStatus, long, long). This is an incompatible change for FileSystem implementations which override getFileBlockLocations(Path, long, long). They should have the signature of this method changed to getFileBlockLocations(FileStatus, long, long) to work correctly.

Show
Added overloaded method getFileBlockLocations(FileStatus, long, long). This is an incompatible change for FileSystem implementations which override getFileBlockLocations(Path, long, long). They should have the signature of this method changed to getFileBlockLocations(FileStatus, long, long) to work correctly.

Description

A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.

But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:

1. fs.isDirectory
2. fs.exists
3. fs.getLength
4. fs.getLength
5. fs.exists (from fs.getFileBlockLocations)
6. fs.getBlockSize

So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).

For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

faster-job-init.patch
01/Apr/08 06:21
9 kB
Owen O'Malley
hadoop-3095.patch
02/Jun/08 13:34
15 kB
Thomas White
hadoop-3095-v2.patch
02/Jun/08 19:45
17 kB
Thomas White
hadoop-3095-v3.patch
03/Jun/08 12:17
17 kB
Thomas White
hadoop-3095-v4.patch
03/Jun/08 20:51
21 kB
Thomas White

Issue Links

is depended upon by

HADOOP-2565 DFSPath cache of FileStatus can become stale

Resolved

is related to

HADOOP-3664 Remove deprecated methods introduced in changes to validating input paths (HADOOP-3095)

Closed

Activity

People

Assignee:: Thomas White

Reporter:: Thomas White

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Mar/08 17:31

Updated:: 02/May/13 02:29

Resolved:: 04/Jun/08 20:31