Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-972

distcp can timeout during rename operation to s3

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: distcp, documentation
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      rename() in S3 is implemented as copy + delete. The S3 copy operation can perform very slowly, which may cause task timeout.

      1. MAPREDUCE-972.2.patch
        3 kB
        Aaron Kimball
      2. MAPREDUCE-972.3.patch
        3 kB
        Aaron Kimball
      3. MAPREDUCE-972.4.patch
        8 kB
        Aaron Kimball
      4. MAPREDUCE-972.5.patch
        11 kB
        Aaron Kimball
      5. MAPREDUCE-972.6.patch
        2 kB
        Aaron Kimball
      6. MAPREDUCE-972.patch
        4 kB
        Aaron Kimball

        Issue Links

          Activity

          Hide
          Aaron Kimball added a comment -

          Attaching a patch which starts a background thread to increment mapper progress when the rename operation is running.

          We benchmarked S3 copy performance at ~4 MB/sec, which means that files in the 3--5 GB size range may cause task timeouts during their renames into their final locations. This patch will fix this issue.

          This patch was tested manually by running distcp to upload data to s3n and verifying that renames still worked as expected, and that log messages confirmed creation and destruction of the background progress thread.

          Show
          Aaron Kimball added a comment - Attaching a patch which starts a background thread to increment mapper progress when the rename operation is running. We benchmarked S3 copy performance at ~4 MB/sec, which means that files in the 3--5 GB size range may cause task timeouts during their renames into their final locations. This patch will fix this issue. This patch was tested manually by running distcp to upload data to s3n and verifying that renames still worked as expected, and that log messages confirmed creation and destruction of the background progress thread.
          Hide
          Todd Lipcon added a comment -

          Couple bits of feedback:

          • ProgressThread could do with some short javadoc
          • isComplete should be marked volatile or made into an AtomicBoolean - then you don't have to worry about synchronization on it or the odd copying into myComplete
          • can ProgressThread be a static class? I think so.

          Other than that, lgtm.

          Show
          Todd Lipcon added a comment - Couple bits of feedback: ProgressThread could do with some short javadoc isComplete should be marked volatile or made into an AtomicBoolean - then you don't have to worry about synchronization on it or the odd copying into myComplete can ProgressThread be a static class? I think so. Other than that, lgtm.
          Hide
          Aaron Kimball added a comment -

          New patch to address your comments.

          Show
          Aaron Kimball added a comment - New patch to address your comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12419256/MAPREDUCE-972.2.patch
          against trunk revision 813660.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/27/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/27/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/27/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/27/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419256/MAPREDUCE-972.2.patch against trunk revision 813660. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/27/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/27/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/27/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/27/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Fix findbugs warning.

          Show
          Aaron Kimball added a comment - Fix findbugs warning.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12419318/MAPREDUCE-972.3.patch
          against trunk revision 813944.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/29/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/29/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/29/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/29/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419318/MAPREDUCE-972.3.patch against trunk revision 813944. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/29/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/29/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/29/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/29/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          That Hadoop QA output seems a bit confused; it's a failing build applying MAPREDUCE-973. Can anyone explain what's up there?

          Show
          Aaron Kimball added a comment - That Hadoop QA output seems a bit confused; it's a failing build applying MAPREDUCE-973 . Can anyone explain what's up there?
          Hide
          Aaron Kimball added a comment -

          Attempting to cycle patch again..

          Show
          Aaron Kimball added a comment - Attempting to cycle patch again..
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12419318/MAPREDUCE-972.3.patch
          against trunk revision 816782.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/55/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/55/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/55/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/55/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419318/MAPREDUCE-972.3.patch against trunk revision 816782. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/55/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/55/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/55/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/55/console This message is automatically generated.
          Hide
          Tom White added a comment -

          Sorry to come late to this, but I wonder if the S3 rename problem could be solved more generally (not just in the context of distcp) using the approach outlined in HADOOP-5814. I think a lot of the code in your patch could be used.

          Show
          Tom White added a comment - Sorry to come late to this, but I wonder if the S3 rename problem could be solved more generally (not just in the context of distcp) using the approach outlined in HADOOP-5814 . I think a lot of the code in your patch could be used.
          Hide
          Aaron Kimball added a comment -

          Hmm. In HADOOP-5814, the FSDataOutputStream is given a Progressable in its c'tor, which is then guaranteed valid up through the call to close(). Using the same Progressable in the close method, and anywhere else in the lifetime of the OutputStream makes sense.

          But in NativeS3FileSystem itself, Progressable objects are only directly provided to the create() and append() methods. It's not guaranteed that one of these will be called before the call to rename(). Moreover, since FileSystem implementations are cached, with JVM reuse, it's possible that any Progressable memoized in such a method might not apply to the current task.

          I don't see a way to do this for rename(), delete(), etc, without modifying the API of FileSystem itself, which would be a pretty big change.

          Show
          Aaron Kimball added a comment - Hmm. In HADOOP-5814 , the FSDataOutputStream is given a Progressable in its c'tor, which is then guaranteed valid up through the call to close(). Using the same Progressable in the close method, and anywhere else in the lifetime of the OutputStream makes sense. But in NativeS3FileSystem itself, Progressable objects are only directly provided to the create() and append() methods. It's not guaranteed that one of these will be called before the call to rename(). Moreover, since FileSystem implementations are cached, with JVM reuse, it's possible that any Progressable memoized in such a method might not apply to the current task. I don't see a way to do this for rename(), delete(), etc, without modifying the API of FileSystem itself, which would be a pretty big change.
          Hide
          Tom White added a comment -

          You're right, HADOOP-5814 is about calling progress during writes, which is different to rename. The approach here seems fine to me.

          Could you write a unit test for this? Perhaps by having a modified filesystem with a rename() method that can take a long time to complete (under the control of the test).

          Show
          Tom White added a comment - You're right, HADOOP-5814 is about calling progress during writes, which is different to rename. The approach here seems fine to me. Could you write a unit test for this? Perhaps by having a modified filesystem with a rename() method that can take a long time to complete (under the control of the test).
          Hide
          Aaron Kimball added a comment -

          Here's a new patch with a test case. This creates a new FileSystem implementation that subclasses LocalFileSystem, and includes a Thread.sleep() inside rename(). If you disable the background progress thread, the test case will fail.

          Show
          Aaron Kimball added a comment - Here's a new patch with a test case. This creates a new FileSystem implementation that subclasses LocalFileSystem, and includes a Thread.sleep() inside rename() . If you disable the background progress thread, the test case will fail.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12422018/MAPREDUCE-972.4.patch
          against trunk revision 824750.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/69/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/69/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/69/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/69/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422018/MAPREDUCE-972.4.patch against trunk revision 824750. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/69/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/69/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/69/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/69/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Only failing test is hftpAccessControl (MAPREDUCE-1029)

          Show
          Aaron Kimball added a comment - Only failing test is hftpAccessControl ( MAPREDUCE-1029 )
          Hide
          Tom White added a comment -

          Thanks for writing a test, Aaron. I notice that you have changed DistCp to always create a new FileSystem instance for the sake of the test. You could avoid this change by giving SlowRenameLocalFileSystem a new scheme (e.g. filetest) so that a cached instance of LocalFileSystem is never retrieved.

          Show
          Tom White added a comment - Thanks for writing a test, Aaron. I notice that you have changed DistCp to always create a new FileSystem instance for the sake of the test. You could avoid this change by giving SlowRenameLocalFileSystem a new scheme (e.g. filetest) so that a cached instance of LocalFileSystem is never retrieved.
          Hide
          Aaron Kimball added a comment -

          Trickier than it sounds Merely subclassing LocalFileSystem and applying a different scheme (I was using slow://) doesn't work, because exists() checks the full URI of each path, and LocalFileSystem.exists() expects file:// in there.. So I'd have to rewrite that method – and I don't know how many others as well. This seemed a more straightforward way to contain the problem.

          Show
          Aaron Kimball added a comment - Trickier than it sounds Merely subclassing LocalFileSystem and applying a different scheme (I was using slow:// ) doesn't work, because exists() checks the full URI of each path, and LocalFileSystem.exists() expects file:// in there.. So I'd have to rewrite that method – and I don't know how many others as well. This seemed a more straightforward way to contain the problem.
          Hide
          Aaron Kimball added a comment -

          Here's a new patch. SlowRenameFileSystem now contains shims over various LocalFS methods for URI conversion.

          Show
          Aaron Kimball added a comment - Here's a new patch. SlowRenameFileSystem now contains shims over various LocalFS methods for URI conversion.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12422133/MAPREDUCE-972.5.patch
          against trunk revision 825083.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/73/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/73/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/73/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/73/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422133/MAPREDUCE-972.5.patch against trunk revision 825083. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/73/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/73/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/73/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/73/console This message is automatically generated.
          Hide
          Tom White added a comment -

          +1 looks good to me.

          Show
          Tom White added a comment - +1 looks good to me.
          Hide
          Chris Douglas added a comment -

          Really sorry to find this issue so late, but a progress thread that forbids tasks from timing out is not a good solution, particularly for distcp, where task timeouts are both legal and useful. If s3 requires a more elaborate rename mechanism, is there a way to push this into its implementation? While distcp may be a heavier user than most user jobs, the latter would also appreciate a more robust solution.

          Starting and waiting a thread for every rename is also not an ideal design; the current polls isComplete only every three seconds, slowing all the renames.

          Show
          Chris Douglas added a comment - Really sorry to find this issue so late, but a progress thread that forbids tasks from timing out is not a good solution, particularly for distcp, where task timeouts are both legal and useful. If s3 requires a more elaborate rename mechanism, is there a way to push this into its implementation? While distcp may be a heavier user than most user jobs, the latter would also appreciate a more robust solution. Starting and waiting a thread for every rename is also not an ideal design; the current polls isComplete only every three seconds, slowing all the renames.
          Hide
          Aaron Kimball added a comment -

          As discussed earlier, the FileSystem API does not provide a means for operations such as rename() to get access to a Progressable. I do not see a straightforward way to improve the S3FS / S3N implementations without extending the FileSystem API to add operations such as rename(src, dst, progress). Are you +1 on doing that?

          Either way, I agree with your criticisms of the progress thread implementation. I have the following plan for improving this:

          • Make the progress thread's lifetime equal to that of the mapper. The first rename() operation starts it, and the join() moves to close()
          • Progress thread is only active when a rename() operation is underway. Use a volatile boolean to track this state. Otherwise it just sleeps.
          • Use Thread.interrupt() / isInterrupted() to interrupt the sleep in the main loop, so that we don't have to wait the full three seconds before the thread exits.
          • Add distcp.rename.timeout as a parameter which sets a max lifetime for the inner loop of the progress thread. Default value will be 10 seconds, but if it detects that the destination filesystem is s3n:// or s3fs://, ups this to fifteen minutes.
          • Aaron
          Show
          Aaron Kimball added a comment - As discussed earlier, the FileSystem API does not provide a means for operations such as rename() to get access to a Progressable. I do not see a straightforward way to improve the S3FS / S3N implementations without extending the FileSystem API to add operations such as rename(src, dst, progress) . Are you +1 on doing that? Either way, I agree with your criticisms of the progress thread implementation. I have the following plan for improving this: Make the progress thread's lifetime equal to that of the mapper. The first rename() operation starts it, and the join() moves to close() Progress thread is only active when a rename() operation is underway. Use a volatile boolean to track this state. Otherwise it just sleeps. Use Thread.interrupt() / isInterrupted() to interrupt the sleep in the main loop, so that we don't have to wait the full three seconds before the thread exits. Add distcp.rename.timeout as a parameter which sets a max lifetime for the inner loop of the progress thread. Default value will be 10 seconds, but if it detects that the destination filesystem is s3n:// or s3fs://, ups this to fifteen minutes. Aaron
          Hide
          Chris Douglas added a comment -

          I see. Extending the FileSystem API is a non-starter, so we can move on from that. Progress threads in general are discouraged (e.g. HADOOP-5052).

          If I understand your proposal, the progress thread would report starting from the first rename, but stop after some configurable interval. In most cases, I'm not sure how this would differ from simply setting the task timeout higher, since progress is reported between renames. Also, this wouldn't help renames after the thread exits.

          Would it be sufficient to add a call to progress() right before the rename (after the delete)? In that case, setting the task timeout higher would extend the time allowed for each rename, which is the right level of granularity, anyway. It won't do this automatically for s3 destinations, but pushing that detail into distcp is not ideal, either. One could add a FilterFileSystem that resets a persistent progress thread for each rename, manage all the signaling/locking etc., but its behavior seems indistinguishable from this much simpler tweak. Would this be sufficient?

          Show
          Chris Douglas added a comment - I see. Extending the FileSystem API is a non-starter, so we can move on from that. Progress threads in general are discouraged (e.g. HADOOP-5052 ). If I understand your proposal, the progress thread would report starting from the first rename, but stop after some configurable interval. In most cases, I'm not sure how this would differ from simply setting the task timeout higher, since progress is reported between renames. Also, this wouldn't help renames after the thread exits. Would it be sufficient to add a call to progress() right before the rename (after the delete)? In that case, setting the task timeout higher would extend the time allowed for each rename, which is the right level of granularity, anyway. It won't do this automatically for s3 destinations, but pushing that detail into distcp is not ideal, either. One could add a FilterFileSystem that resets a persistent progress thread for each rename, manage all the signaling/locking etc., but its behavior seems indistinguishable from this much simpler tweak. Would this be sufficient?
          Hide
          Aaron Kimball added a comment -

          My proposal is slightly different than that:

          The progress thread is in one of three states:

          1) inRename = true && isComplete == false
          2) inRename = false && isComplete == false
          3) isComplete = true

          When inRename is set to true, the progress thread will call progress() every few seconds, for up to a max of distcp.rename.timeout seconds. If it is still in this state after distcp.rename.timeout seconds have elapsed since the state began, it will set inRename to false.

          When inRename is false, it just sits there, waiting for another rename operation to start. It sleeps and occasionally polls for a state change on inRename or isComplete. Changing inRename back to true again will go into the previously-described state; distcp.rename.timeout starts anew from this time point.

          If isComplete is true, the thread exits immediately. The Mapper.close() method will set isComplete to true to ensure that the thread shuts down. (As the thread is setDaemon(true), the JVM will exit even without this detail, but it is good hygeine to do so anyway.)

          It is not sufficient to simply call progress() right before rename(). Experience has shown that when uploading large files to S3, the rename() operation itself can take in excess of 10 minutes. rename() in S3 is implemented as copy-and-delete. For multi-GB files, this can take a long time.

          If we just tell people to set their global task timeout to 30 minutes, then this will delay task restarts under other conditions where the timeout value is expected to be considerably shorter (e.g., an individual file write() operation). This can adversely affect distcp performance in the general case.

          Show
          Aaron Kimball added a comment - My proposal is slightly different than that: The progress thread is in one of three states: 1) inRename = true && isComplete == false 2) inRename = false && isComplete == false 3) isComplete = true When inRename is set to true, the progress thread will call progress() every few seconds, for up to a max of distcp.rename.timeout seconds. If it is still in this state after distcp.rename.timeout seconds have elapsed since the state began, it will set inRename to false. When inRename is false, it just sits there, waiting for another rename operation to start. It sleeps and occasionally polls for a state change on inRename or isComplete. Changing inRename back to true again will go into the previously-described state; distcp.rename.timeout starts anew from this time point. If isComplete is true, the thread exits immediately. The Mapper.close() method will set isComplete to true to ensure that the thread shuts down. (As the thread is setDaemon(true) , the JVM will exit even without this detail, but it is good hygeine to do so anyway.) It is not sufficient to simply call progress() right before rename(). Experience has shown that when uploading large files to S3, the rename() operation itself can take in excess of 10 minutes. rename() in S3 is implemented as copy-and-delete. For multi-GB files, this can take a long time. If we just tell people to set their global task timeout to 30 minutes, then this will delay task restarts under other conditions where the timeout value is expected to be considerably shorter (e.g., an individual file write() operation). This can adversely affect distcp performance in the general case.
          Hide
          Chris Douglas added a comment -

          Why wouldn't one just set the task timeout for the distcp job to 30 minutes?

          Show
          Chris Douglas added a comment - Why wouldn't one just set the task timeout for the distcp job to 30 minutes?
          Hide
          Aaron Kimball added a comment -

          Because there are other operations in the distcp job (e.g., the write() calls made during the actual upload) that should timeout far faster than once per thirty minutes in the event of an error. Using a single timeout value for all operations makes program execution overall considerably less efficient than it should be. Writes and renames in distcp can expect different running times; we should treat them this way.

          Show
          Aaron Kimball added a comment - Because there are other operations in the distcp job (e.g., the write() calls made during the actual upload) that should timeout far faster than once per thirty minutes in the event of an error. Using a single timeout value for all operations makes program execution overall considerably less efficient than it should be. Writes and renames in distcp can expect different running times; we should treat them this way.
          Hide
          Doug Cutting added a comment -

          > Extending the FileSystem API is a non-starter [ ... ]

          Isn't the FileSystem API mid-rewrite right now in HADOOP-6223? So now might actually be the rare time to consider something like this. It's unfortunate that Rename.Options is an Enum, so it'd be hard to add a progress function there without changing that. Perhaps Rename.Options.OVERWRITE could still be a constant, but Rename.Options#createProgress(Progressible) could return a subclass of Rename.Options that wraps a Progressible or somesuch. I don't mean to push this approach, rather just to question whether it should be ruled out completely. If it seems reasonable for file rename implementations to take a long time, then adding a progress callback might be a reasonable approach.

          Show
          Doug Cutting added a comment - > Extending the FileSystem API is a non-starter [ ... ] Isn't the FileSystem API mid-rewrite right now in HADOOP-6223 ? So now might actually be the rare time to consider something like this. It's unfortunate that Rename.Options is an Enum, so it'd be hard to add a progress function there without changing that. Perhaps Rename.Options.OVERWRITE could still be a constant, but Rename.Options#createProgress(Progressible) could return a subclass of Rename.Options that wraps a Progressible or somesuch. I don't mean to push this approach, rather just to question whether it should be ruled out completely. If it seems reasonable for file rename implementations to take a long time, then adding a progress callback might be a reasonable approach.
          Hide
          Aaron Kimball added a comment -

          Hm. Looking at FilterFileSystem, I think that's the most general and non-invasive solution. I can add a FilterFS that includes the methods setProgressable() and setRenameTimeout(). The rename operation will then manage the background progress thread in the state-machine style described above. Since this isn't distcp-specific, I'll probably rename the config variable to something like fs.progressable.rename.timeout. (Other suggestions welcome.)

          Distcp will then wrap its dstfs in a ProgressFs instance.

          Show
          Aaron Kimball added a comment - Hm. Looking at FilterFileSystem, I think that's the most general and non-invasive solution. I can add a FilterFS that includes the methods setProgressable() and setRenameTimeout() . The rename operation will then manage the background progress thread in the state-machine style described above. Since this isn't distcp-specific, I'll probably rename the config variable to something like fs.progressable.rename.timeout . (Other suggestions welcome.) Distcp will then wrap its dstfs in a ProgressFs instance.
          Hide
          Chris Douglas added a comment -

          Isn't the FileSystem API mid-rewrite right now in HADOOP-6223? So now might actually be the rare time to consider something like this. It's unfortunate that Rename.Options is an Enum, so it'd be hard to add a progress function there without changing that. Perhaps Rename.Options.OVERWRITE could still be a constant, but Rename.Options#createProgress(Progressible) could return a subclass of Rename.Options that wraps a Progressible or somesuch. I don't mean to push this approach, rather just to question whether it should be ruled out completely. If it seems reasonable for file rename implementations to take a long time, then adding a progress callback might be a reasonable approach.

          I agree. Long-running renames should be considered/debated in the design of FileContext/AFS, particularly since we often use rename to promote output. Adding a cstr for FileContext that takes a Progressable, then adding a Progressable to all the AFS APIs would probably work. The Util inner class could also be created with it, so listStatus and copy could also update progress. Most implementations can ignore it, but that would at least push the workaround for S3 into the right layer.

          However, for this issue, patching either the old or new API is a non-starter. DistCp uses old APIs, and I'd much rather upgrade it and address the more general Progressable questions in other issues. Approaching all that in a single issue, particularly one devoted to a timeout in S3, imports a lot of baggage. Interesting, important baggage, but this is only a use case in that broader context.

          Using a single timeout value for all operations makes program execution overall considerably less efficient than it should be. Writes and renames in distcp can expect different running times; we should treat them this way.

          Every file is copied twice. I'm not sure a long task timeout leaves too much performance on the table. Your point about tuning timeouts for particular operations is taken, but the payoff is too low for the complexity this adds. Both this and raising the task timeout for the job are hacks, but as Doug points out: we're going to have to solve this in general, too. The task timeout is a hack we have and know.

          Looking at FilterFileSystem, I think that's the most general and non-invasive solution.

          Yes, that would be the cleanest place to add a thread, but it's still not much of a win over bumping the task timeout for the job. Updating the DistCp guide with notes for S3 users is an unambiguous win.

          Show
          Chris Douglas added a comment - Isn't the FileSystem API mid-rewrite right now in HADOOP-6223 ? So now might actually be the rare time to consider something like this. It's unfortunate that Rename.Options is an Enum, so it'd be hard to add a progress function there without changing that. Perhaps Rename.Options.OVERWRITE could still be a constant, but Rename.Options#createProgress(Progressible) could return a subclass of Rename.Options that wraps a Progressible or somesuch. I don't mean to push this approach, rather just to question whether it should be ruled out completely. If it seems reasonable for file rename implementations to take a long time, then adding a progress callback might be a reasonable approach. I agree. Long-running renames should be considered/debated in the design of FileContext/AFS, particularly since we often use rename to promote output. Adding a cstr for FileContext that takes a Progressable, then adding a Progressable to all the AFS APIs would probably work. The Util inner class could also be created with it, so listStatus and copy could also update progress. Most implementations can ignore it, but that would at least push the workaround for S3 into the right layer. However, for this issue, patching either the old or new API is a non-starter. DistCp uses old APIs, and I'd much rather upgrade it and address the more general Progressable questions in other issues. Approaching all that in a single issue, particularly one devoted to a timeout in S3, imports a lot of baggage. Interesting, important baggage, but this is only a use case in that broader context. Using a single timeout value for all operations makes program execution overall considerably less efficient than it should be. Writes and renames in distcp can expect different running times; we should treat them this way. Every file is copied twice. I'm not sure a long task timeout leaves too much performance on the table. Your point about tuning timeouts for particular operations is taken, but the payoff is too low for the complexity this adds. Both this and raising the task timeout for the job are hacks, but as Doug points out: we're going to have to solve this in general, too. The task timeout is a hack we have and know. Looking at FilterFileSystem, I think that's the most general and non-invasive solution. Yes , that would be the cleanest place to add a thread, but it's still not much of a win over bumping the task timeout for the job. Updating the DistCp guide with notes for S3 users is an unambiguous win.
          Hide
          Doug Cutting added a comment -

          > However, for this issue, patching either the old or new API is a non-starter.

          +1 We should not change the FileSystem API in this issue.

          There appear to be both short and long-term fixes for this issue. The shortest-term is just to bump up the timeout for S3 distcp jobs. The longest is adding FileSystem support for long renames.

          Questions:

          • Will such a short-term fix suffice until the long-term can be addressed (0.22, probably)? If so, then there's perhaps no in point considering a more complex interim solution.
          • Should we associate this Jira issue with the short or long-term fix? If long, then we might make it depend on a FileSystem API change, to support progress callbacks.
          Show
          Doug Cutting added a comment - > However, for this issue, patching either the old or new API is a non-starter. +1 We should not change the FileSystem API in this issue. There appear to be both short and long-term fixes for this issue. The shortest-term is just to bump up the timeout for S3 distcp jobs. The longest is adding FileSystem support for long renames. Questions: Will such a short-term fix suffice until the long-term can be addressed (0.22, probably)? If so, then there's perhaps no in point considering a more complex interim solution. Should we associate this Jira issue with the short or long-term fix? If long, then we might make it depend on a FileSystem API change, to support progress callbacks.
          Hide
          Aaron Kimball added a comment -

          Will such a short-term fix suffice until the long-term can be addressed (0.22, probably)? If so, then there's perhaps no in point considering a more complex interim solution.

          Probably

          Should we associate this Jira issue with the short or long-term fix? If long, then we might make it depend on a FileSystem API change, to support progress callbacks.

          Why not both?

          Show
          Aaron Kimball added a comment - Will such a short-term fix suffice until the long-term can be addressed (0.22, probably)? If so, then there's perhaps no in point considering a more complex interim solution. Probably Should we associate this Jira issue with the short or long-term fix? If long, then we might make it depend on a FileSystem API change, to support progress callbacks. Why not both?
          Hide
          Aaron Kimball added a comment -

          Created MAPREDUCE-1127 (short) and HADOOP-6324 (long-term) issues; linked.

          Show
          Aaron Kimball added a comment - Created MAPREDUCE-1127 (short) and HADOOP-6324 (long-term) issues; linked.
          Hide
          Chris Douglas added a comment -

          Should we associate this Jira issue with the short or long-term fix? If long, then we might make it depend on a FileSystem API change, to support progress callbacks.

          It's more of a use case/motivation for the FileSystem API changes. If DistCp were ported to use new APIs supporting progress updates, then this would be resolved as a side effect, right?

          Can this be resolved as a documentation issue? A patch to the DistCp guide can go into 0.21.

          Show
          Chris Douglas added a comment - Should we associate this Jira issue with the short or long-term fix? If long, then we might make it depend on a FileSystem API change, to support progress callbacks. It's more of a use case/motivation for the FileSystem API changes. If DistCp were ported to use new APIs supporting progress updates, then this would be resolved as a side effect, right? Can this be resolved as a documentation issue? A patch to the DistCp guide can go into 0.21.
          Hide
          Aaron Kimball added a comment -

          That seems reasonable. Here is some documentation to attach to branches 0.20, 0.21 explaining the workaround.

          Show
          Aaron Kimball added a comment - That seems reasonable. Here is some documentation to attach to branches 0.20, 0.21 explaining the workaround.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12422721/MAPREDUCE-972.6.patch
          against trunk revision 826979.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/194/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/194/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/194/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/194/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422721/MAPREDUCE-972.6.patch against trunk revision 826979. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/194/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/194/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/194/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/194/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          +1

          I committed this. Thanks, Aaron!

          Show
          Chris Douglas added a comment - +1 I committed this. Thanks, Aaron!

            People

            • Assignee:
              Aaron Kimball
              Reporter:
              Aaron Kimball
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development