Hadoop Common
  1. Hadoop Common
  2. HADOOP-3095

Validating input paths and creating splits is slow on S3

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: fs, fs/s3
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Added overloaded method getFileBlockLocations(FileStatus, long, long). This is an incompatible change for FileSystem implementations which override getFileBlockLocations(Path, long, long). They should have the signature of this method changed to getFileBlockLocations(FileStatus, long, long) to work correctly.
      Show
      Added overloaded method getFileBlockLocations(FileStatus, long, long). This is an incompatible change for FileSystem implementations which override getFileBlockLocations(Path, long, long). They should have the signature of this method changed to getFileBlockLocations(FileStatus, long, long) to work correctly.

      Description

      A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.

      But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:

      1. fs.isDirectory
      2. fs.exists
      3. fs.getLength
      4. fs.getLength
      5. fs.exists (from fs.getFileBlockLocations)
      6. fs.getBlockSize

      So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).

      For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.

      1. hadoop-3095-v4.patch
        21 kB
        Tom White
      2. hadoop-3095-v3.patch
        17 kB
        Tom White
      3. hadoop-3095-v2.patch
        17 kB
        Tom White
      4. hadoop-3095.patch
        15 kB
        Tom White
      5. faster-job-init.patch
        9 kB
        Owen O'Malley

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          We should fix this generally, not just for S3, since, once we remove the HDFS status cache, it will be similarly slow. A good approach would be to rework FileInputFormat to retain FileStatus instances. FileInputFormat#listPaths() should be deprecated in favor of a listStatus() method.

          As for validateInput, this can be deprecated too, or at least gutted. Its utility was in the days when getSplits() was called on the JobTracker, and it was nice to get such error messages in the client before submitting the job. But now that getSplits() is called in the client, any needed checks can be done there (i.e., it should throw exceptions for non-existent input paths). So my vote would be to long-term eliminate validateInput() and short-term deprecate and gut its implementation in FileInputFormat.

          Show
          Doug Cutting added a comment - We should fix this generally, not just for S3, since, once we remove the HDFS status cache, it will be similarly slow. A good approach would be to rework FileInputFormat to retain FileStatus instances. FileInputFormat#listPaths() should be deprecated in favor of a listStatus() method. As for validateInput, this can be deprecated too, or at least gutted. Its utility was in the days when getSplits() was called on the JobTracker, and it was nice to get such error messages in the client before submitting the job. But now that getSplits() is called in the client, any needed checks can be done there (i.e., it should throw exceptions for non-existent input paths). So my vote would be to long-term eliminate validateInput() and short-term deprecate and gut its implementation in FileInputFormat.
          Hide
          Owen O'Malley added a comment -

          This changed in trunk. Look at HADOOP-2027. It should use getFileBlockLocations for each path now.

          Show
          Owen O'Malley added a comment - This changed in trunk. Look at HADOOP-2027 . It should use getFileBlockLocations for each path now.
          Hide
          Doug Cutting added a comment -

          FileInputFormat#getSplits() in trunk still calls FileSystem#exists(), FileSystem#isDirectory(), FileSystem#getLength() (twice) and FileSystem#getBlockSize() for each input file. These could all be eliminated if it used listStatus() instead of listPaths(). The cost of these is negligible for HDFS currently, since these are all cached in the Path instance. But once we eliminate that cache, performance on HDFS will suffer.

          FileSystem#getBlockLocations() is also called once per file and would still need to be called, since FileStatus doesn't (yet) include that information.

          Also, do you see a reason for validateInput to continue to exist? Without the status cache it too will become very slow, and its purpose is now outdated.

          Show
          Doug Cutting added a comment - FileInputFormat#getSplits() in trunk still calls FileSystem#exists(), FileSystem#isDirectory(), FileSystem#getLength() (twice) and FileSystem#getBlockSize() for each input file. These could all be eliminated if it used listStatus() instead of listPaths(). The cost of these is negligible for HDFS currently, since these are all cached in the Path instance. But once we eliminate that cache, performance on HDFS will suffer. FileSystem#getBlockLocations() is also called once per file and would still need to be called, since FileStatus doesn't (yet) include that information. Also, do you see a reason for validateInput to continue to exist? Without the status cache it too will become very slow, and its purpose is now outdated.
          Hide
          Owen O'Malley added a comment -

          Sorry about that. I thought we'd cleaned up more of the input split creation in HADOOP-2027. Of course we should keep the FileStatus around instead of calling FileSystem methods over and over again.

          Show
          Owen O'Malley added a comment - Sorry about that. I thought we'd cleaned up more of the input split creation in HADOOP-2027 . Of course we should keep the FileStatus around instead of calling FileSystem methods over and over again.
          Hide
          Owen O'Malley added a comment -

          I haven't really tested this, but it seems like the right direction. Thoughts?

          Show
          Owen O'Malley added a comment - I haven't really tested this, but it seems like the right direction. Thoughts?
          Hide
          Tom White added a comment -

          This looks good to me. A couple of points:

          1. As it stands FileInputFormat#listPaths() is deprecated and not called. However, this will silently change the behaviour of applications which have overridden this method - and which are not recompiled - since it will never be called. It would be nice to issue a warning at runtime, but I'm not sure how we do this though, since it depends on being able to detect the existence of an overridden method. Any thoughts?

          2. There is still an FileSystem#exists() call in FileSystem#getBlockLocations(). Could we change/overload the latter to be FileSystem#getBlockLocations(FileStatus, long, long)?

          Show
          Tom White added a comment - This looks good to me. A couple of points: 1. As it stands FileInputFormat#listPaths() is deprecated and not called. However, this will silently change the behaviour of applications which have overridden this method - and which are not recompiled - since it will never be called. It would be nice to issue a warning at runtime, but I'm not sure how we do this though, since it depends on being able to detect the existence of an overridden method. Any thoughts? 2. There is still an FileSystem#exists() call in FileSystem#getBlockLocations(). Could we change/overload the latter to be FileSystem#getBlockLocations(FileStatus, long, long)?
          Hide
          Lohit Vijayarenu added a comment -

          >2. There is still an FileSystem#exists() call in FileSystem#getBlockLocations(). Could we change/overload the latter to be FileSystem#getBlockLocations(FileStatus, long, long)?

          HADOOP-2634 might be related. It tries to implement exists() using getFileStatus.

          Show
          Lohit Vijayarenu added a comment - >2. There is still an FileSystem#exists() call in FileSystem#getBlockLocations(). Could we change/overload the latter to be FileSystem#getBlockLocations(FileStatus, long, long)? HADOOP-2634 might be related. It tries to implement exists() using getFileStatus.
          Hide
          Doug Cutting added a comment -

          Shouldn't we deprecate validateInput() too?

          Also, I second Tom's concern about breaking applications that override listPaths(). If we assume that application code only overrides listPaths() but doesn't call it, then we might change the default implementation to return a special value, indicating that listPaths() has not been overridden. Then listStatus() can call this and use the returned paths if the value is not the special value, also emitting a warning that the method has been deprecated. If we wanted to get really fancy, we could only return the special value when called under listStatus() by setting a threadlocal, for 100% back-compatibility. Ugly, huh?

          Show
          Doug Cutting added a comment - Shouldn't we deprecate validateInput() too? Also, I second Tom's concern about breaking applications that override listPaths(). If we assume that application code only overrides listPaths() but doesn't call it, then we might change the default implementation to return a special value, indicating that listPaths() has not been overridden. Then listStatus() can call this and use the returned paths if the value is not the special value, also emitting a warning that the method has been deprecated. If we wanted to get really fancy, we could only return the special value when called under listStatus() by setting a threadlocal, for 100% back-compatibility. Ugly, huh?
          Hide
          Tom White added a comment -

          I've extended Owen's patch to

          1. Add an overloaded version of FileSystem#getBlockLocations() that takes a FileStatus object so implementations of FileSystem have a way of avoiding an exists call per file.
          2. Deprecate validateInput.
          3. For back compatibility call listPath and listStatus, and if the paths are the same use the FileStatus objects, otherwise fall back to the existing behaviour (use the Path objects) and issue a warning. When we remove listPath in a subsequent release we can remove this check.

          I couldn't see how to get the ThreadLocal approach suggested by Doug to work if a subclass implementation of listPath calls its superclass (which is a common case, since the subclass typically modifies the list returned by the superclass). Calling both listPath and listStatus does add some overhead, but with validateInput gone the patch shouldn't make things worse overall.

          Show
          Tom White added a comment - I've extended Owen's patch to 1. Add an overloaded version of FileSystem#getBlockLocations() that takes a FileStatus object so implementations of FileSystem have a way of avoiding an exists call per file. 2. Deprecate validateInput. 3. For back compatibility call listPath and listStatus, and if the paths are the same use the FileStatus objects, otherwise fall back to the existing behaviour (use the Path objects) and issue a warning. When we remove listPath in a subsequent release we can remove this check. I couldn't see how to get the ThreadLocal approach suggested by Doug to work if a subclass implementation of listPath calls its superclass (which is a common case, since the subclass typically modifies the list returned by the superclass). Calling both listPath and listStatus does add some overhead, but with validateInput gone the patch shouldn't make things worse overall.
          Hide
          Doug Cutting added a comment -

          Tom, this looks good to me. Perhaps we should deprecate FileSystem#getBlockLocations(Path) and update all implementations to define FileSystem#getBlockLocations(FileStatus). We'll need to make that change for HADOOP-2565 anyway, and since the primary client of this method calls only the new signature, I don't see a need for implementations to support two signatures here.

          Show
          Doug Cutting added a comment - Tom, this looks good to me. Perhaps we should deprecate FileSystem#getBlockLocations(Path) and update all implementations to define FileSystem#getBlockLocations(FileStatus). We'll need to make that change for HADOOP-2565 anyway, and since the primary client of this method calls only the new signature, I don't see a need for implementations to support two signatures here.
          Hide
          Tom White added a comment -

          Reworked patch to make implementations override FileSystem#getFileBlockLocations(FileStatus) rather than FileSystem#getFileBlockLocations(Path) - which is now deprecated. The latter is now just delegates to the former, by calling

          return getFileBlockLocations(getFileStatus(f), start, len);
          

          KosmosFileSystem didn't override FileSystem#getFileBlockLocations(Path) as it had a method called getBlockLocations instead, which would have broken data locality in mapreduce. I've fixed this and marked it as @Override.

          Show
          Tom White added a comment - Reworked patch to make implementations override FileSystem#getFileBlockLocations(FileStatus) rather than FileSystem#getFileBlockLocations(Path) - which is now deprecated. The latter is now just delegates to the former, by calling return getFileBlockLocations(getFileStatus(f), start, len); KosmosFileSystem didn't override FileSystem#getFileBlockLocations(Path) as it had a method called getBlockLocations instead, which would have broken data locality in mapreduce. I've fixed this and marked it as @Override.
          Hide
          Doug Cutting added a comment -

          +1 This looks fine to me.

          Show
          Doug Cutting added a comment - +1 This looks fine to me.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383246/hadoop-3095-v2.patch
          against trunk revision 662515.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 448 javac compiler warnings (more than the trunk's current 410 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383246/hadoop-3095-v2.patch against trunk revision 662515. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 448 javac compiler warnings (more than the trunk's current 410 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/console This message is automatically generated.
          Hide
          Tom White added a comment -

          New patch to fix unit test failures. Two failures were unrelated to this patch, but three showed up a problem with Path objects in FileStatus not being fully qualified. The problem arises since FileInputFormat#listStatus doesn't return fully-qualified paths so #getSplits can't pick up the correct FileSystem(s).

          I've changed DistributedFileSystem to return a fully-qualified Path in getFileStatus(). However, this doesn't fully solve the problem as other FileSystem implementations don't return fully-qualified paths in FileStatus. Should we strengthen the contract of FileSystem to require them to do so?

          (The increased warning count is due to the addition of deprecated methods that are still called, mainly from tests.)

          Show
          Tom White added a comment - New patch to fix unit test failures. Two failures were unrelated to this patch, but three showed up a problem with Path objects in FileStatus not being fully qualified. The problem arises since FileInputFormat#listStatus doesn't return fully-qualified paths so #getSplits can't pick up the correct FileSystem(s). I've changed DistributedFileSystem to return a fully-qualified Path in getFileStatus(). However, this doesn't fully solve the problem as other FileSystem implementations don't return fully-qualified paths in FileStatus. Should we strengthen the contract of FileSystem to require them to do so? (The increased warning count is due to the addition of deprecated methods that are still called, mainly from tests.)
          Hide
          Owen O'Malley added a comment -

          I think that getFileStatus should return a fully qualified Path. I've got a start to a patch that does it for the ramfs, because that was causing me trouble.

          Show
          Owen O'Malley added a comment - I think that getFileStatus should return a fully qualified Path. I've got a start to a patch that does it for the ramfs, because that was causing me trouble.
          Hide
          Doug Cutting added a comment -

          Yes, all Path instances returned from a FileSystem method should be fully-qualified.

          Show
          Doug Cutting added a comment - Yes, all Path instances returned from a FileSystem method should be fully-qualified.
          Hide
          Tom White added a comment -

          This patch makes all FileSystem implementations return fully-qualified Path objects in FileStatus.

          Show
          Tom White added a comment - This patch makes all FileSystem implementations return fully-qualified Path objects in FileStatus.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383321/hadoop-3095-v4.patch
          against trunk revision 662913.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2556/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383321/hadoop-3095-v4.patch against trunk revision 662913. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2556/console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383321/hadoop-3095-v4.patch
          against trunk revision 662976.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 447 javac compiler warnings (more than the trunk's current 409 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383321/hadoop-3095-v4.patch against trunk revision 662976. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 447 javac compiler warnings (more than the trunk's current 409 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/console This message is automatically generated.
          Hide
          Tom White added a comment -

          The failed contrib test passes when I run it on my local machine.

          Show
          Tom White added a comment - The failed contrib test passes when I run it on my local machine.
          Hide
          Owen O'Malley added a comment -

          I just committed this. Thanks, Tom!

          Show
          Owen O'Malley added a comment - I just committed this. Thanks, Tom!

            People

            • Assignee:
              Tom White
              Reporter:
              Tom White
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development