Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5968

Work directory is not deleted when downloadCacheObject throws IOException

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.1
    • Fix Version/s: 1.3.0
    • Component/s: mrv1
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Work directory is not deleted in DistCache if Exception happen in downloadCacheObject. In downloadCacheObject, the cache file will be copied to temporarily work directory first, then the work directory will be renamed to the final directory. If IOException happens during the copy, the work directory will not be deleted. This will cause garbage data left in local disk cache. For example If the MR application use Distributed Cache to send a very large Archive/file(50G), if the disk is full during the copy, then the IOException will be triggered, the work directory will be not deleted or renamed and the work directory will occupy a big chunk of disk space.

      1. MAPREDUCE-5968.branch1_new.patch
        7 kB
        zhihai xu
      2. MAPREDUCE-5968.branch1_new1.patch
        7 kB
        zhihai xu
      3. MAPREDUCE-5968.branch1.patch
        7 kB
        zhihai xu

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        11m 40s 1 zhihai xu 11/Jul/14 21:13
        Patch Available Patch Available Resolved Resolved
        24d 10h 21m 1 Karthik Kambatla (Inactive) 05/Aug/14 07:35
        Karthik Kambatla (Inactive) made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Fix Version/s 1.3.0 [ 12324153 ]
        Resolution Fixed [ 1 ]
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Thanks for the contribution, zhihai xu. Just committed this to branch-1.

        Show
        Karthik Kambatla (Inactive) added a comment - Thanks for the contribution, zhihai xu . Just committed this to branch-1.
        Karthik Kambatla (Inactive) made changes -
        Summary Work directory is not deleted in DistCache if Exception happen in downloadCacheObject. Work directory is not deleted when downloadCacheObject throws IOException
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Latest patch looks good to me. +1. Committing this.

        Show
        Karthik Kambatla (Inactive) added a comment - Latest patch looks good to me. +1. Committing this.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12659787/MAPREDUCE-5968.branch1_new1.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4787//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12659787/MAPREDUCE-5968.branch1_new1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4787//console This message is automatically generated.
        Hide
        zhihai xu added a comment -

        I just upload a new patch MAPREDUCE-5968.branch1_new1.patch to remove the duplicate localFs.delete(workDir, true);.
        thanks Karthik for your review.

        Show
        zhihai xu added a comment - I just upload a new patch MAPREDUCE-5968 .branch1_new1.patch to remove the duplicate localFs.delete(workDir, true);. thanks Karthik for your review.
        zhihai xu made changes -
        Attachment MAPREDUCE-5968.branch1_new1.patch [ 12659787 ]
        Hide
        zhihai xu added a comment -

        Hi Karthik,
        I attached new patch to address these issues. please review it.

        thanks
        zhihai

        Show
        zhihai xu added a comment - Hi Karthik, I attached new patch to address these issues. please review it. thanks zhihai
        zhihai xu made changes -
        Attachment MAPREDUCE-5968.branch1_new.patch [ 12659779 ]
        Hide
        zhihai xu added a comment -

        thanks for the comments. these are good findings.
        1. Based on your suggestion, I can optimize the code: remove delWorkDir, check whether it exist before delete the work dir in final block.
        2. define "work" as a constant in the class, so it can be reused by both production and test code.

        Show
        zhihai xu added a comment - thanks for the comments. these are good findings. 1. Based on your suggestion, I can optimize the code: remove delWorkDir, check whether it exist before delete the work dir in final block. 2. define " work " as a constant in the class, so it can be reused by both production and test code.
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Patch looks mostly good. A couple of minor comments:

        1. Do we need to set delWorkDir to false at both places? The latter is always executed and the former can be skipped.
          +      // promote the output to the final location
          +      if (!localFs.rename(workDir, finalDir)) {
          +        localFs.delete(workDir, true);
          +        delWorkDir = false;
          +        if (!localFs.exists(finalDir)) {
          +          throw new IOException("Failed to promote distributed cache object " +
          +                                workDir + " to " + finalDir);
          +        }
          +        // someone else promoted first
          +        return 0;
          +      }
          +      delWorkDir = false;
          
        2. I understand the "work" comes from how work directory name is generated. Can we create the work directory name in a method that can be accessed from both the production and test code so the test continues to be useful in the future.
          +    String workDir = destination.getParent().toString() + "-work-";
          
        Show
        Karthik Kambatla (Inactive) added a comment - Patch looks mostly good. A couple of minor comments: Do we need to set delWorkDir to false at both places? The latter is always executed and the former can be skipped. + // promote the output to the final location + if (!localFs.rename(workDir, finalDir)) { + localFs.delete(workDir, true ); + delWorkDir = false ; + if (!localFs.exists(finalDir)) { + throw new IOException( "Failed to promote distributed cache object " + + workDir + " to " + finalDir); + } + // someone else promoted first + return 0; + } + delWorkDir = false ; I understand the " work " comes from how work directory name is generated. Can we create the work directory name in a method that can be accessed from both the production and test code so the test continues to be useful in the future. + String workDir = destination.getParent().toString() + "-work-" ;
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12658867/MAPREDUCE-5968.branch1.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4780//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658867/MAPREDUCE-5968.branch1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4780//console This message is automatically generated.
        zhihai xu made changes -
        Attachment MAPREDUCE-5968.branch1.patch [ 12658867 ]
        zhihai xu made changes -
        Attachment MAPREDUCE-5968.branch1.patch [ 12658808 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12658808/MAPREDUCE-5968.branch1.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4779//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658808/MAPREDUCE-5968.branch1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4779//console This message is automatically generated.
        Hide
        zhihai xu added a comment -

        Add a unit test to verify the workDir is removed using a Test File System to create a fake IOException to trigger this error condition in this issue. Without the fix in downloadCacheObject, the unit test will fail.

        Show
        zhihai xu added a comment - Add a unit test to verify the workDir is removed using a Test File System to create a fake IOException to trigger this error condition in this issue. Without the fix in downloadCacheObject, the unit test will fail.
        zhihai xu made changes -
        Attachment MAPREDUCE-5968.branch1.patch [ 12658808 ]
        zhihai xu made changes -
        Attachment MAPREDUCE-5968.branch1.patch [ 12655286 ]
        Karthik Kambatla (Inactive) made changes -
        Affects Version/s 1.2.1 [ 12324149 ]
        Affects Version/s 0.23.10 [ 12324662 ]
        Target Version/s 1.3.0 [ 12324153 ]
        zhihai xu made changes -
        Affects Version/s 0.23.10 [ 12324662 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12655286/MAPREDUCE-5968.branch1.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4729//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655286/MAPREDUCE-5968.branch1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4729//console This message is automatically generated.
        zhihai xu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        zhihai xu made changes -
        Attachment MAPREDUCE-5968.branch1.patch [ 12655286 ]
        zhihai xu made changes -
        Field Original Value New Value
        Description Work directory is not deleted in DistCache if Exception happen in downloadCacheObject. In downloadCacheObject, the cache file will be copied to temporarily work directory first, then the work directory will be renamed to the final directory. If IOException happens during the copy, the
         work directory will not be deleted. This will cause garbage data left in local disk cache. For example If the MR application use Distributed Cache to send a very large Archive/file(50G), if the disk is full during the copy, then the IOException will be triggered, the work directory will be not deleted or renamed and occupy a big chunk of disk space.

        Work directory is not deleted in DistCache if Exception happen in downloadCacheObject. In downloadCacheObject, the cache file will be copied to temporarily work directory first, then the work directory will be renamed to the final directory. If IOException happens during the copy, the work directory will not be deleted. This will cause garbage data left in local disk cache. For example If the MR application use Distributed Cache to send a very large Archive/file(50G), if the disk is full during the copy, then the IOException will be triggered, the work directory will be not deleted or renamed and the work directory will occupy a big chunk of disk space.

        zhihai xu created issue -

          People

          • Assignee:
            zhihai xu
            Reporter:
            zhihai xu
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development