Hadoop Common
  1. Hadoop Common
  2. HADOOP-157

job fails because pendingCreates is not cleaned up after a task fails

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.1
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None

      Description

      When a task fails under map/reduce, if the client doesn't abandon the files in progress (usually because it was killed), the lease on the name node lasts 1 minute. During that minute, I see 3 backup copies of the task fail because pendingCreates is non-null.

        Activity

        Owen O'Malley created issue -
        Hide
        Owen O'Malley added a comment -

        This patch improves the failures reporting.

        1. I created org.apache.hadoop.ipc.RemoteException class that includes the class name of the exception that was the cause.
        2. The ipc client throws this RemoteException rather than java.rmi.RemoteException.
        3. The DFSClient.create waits and retries if the file is already being created.
        4. Killed tasks do not complain when they have non-zero exit codes from their process.
        5. Improved the error message when tasks are killed for not updating their progress.
        6. Dfs' ClientProtocol.addBlock now takes the client name rather than the client machine.
        7. Problems renewing dfs leases are now logged.
        8. More details in the exception messages when DfsClient.create fails.
        9. addBlock now checks to make sure it is the same client that owns the lease who is adding to the file.
        10. FileUnderConstruction now records who is creating the file.
        11. Some new exception classes defined for problems that DFSClient wants to catch

        Show
        Owen O'Malley added a comment - This patch improves the failures reporting. 1. I created org.apache.hadoop.ipc.RemoteException class that includes the class name of the exception that was the cause. 2. The ipc client throws this RemoteException rather than java.rmi.RemoteException. 3. The DFSClient.create waits and retries if the file is already being created. 4. Killed tasks do not complain when they have non-zero exit codes from their process. 5. Improved the error message when tasks are killed for not updating their progress. 6. Dfs' ClientProtocol.addBlock now takes the client name rather than the client machine. 7. Problems renewing dfs leases are now logged. 8. More details in the exception messages when DfsClient.create fails. 9. addBlock now checks to make sure it is the same client that owns the lease who is adding to the file. 10. FileUnderConstruction now records who is creating the file. 11. Some new exception classes defined for problems that DFSClient wants to catch
        Owen O'Malley made changes -
        Field Original Value New Value
        Attachment pending-creates-wait.patch [ 12325679 ]
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Owen!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Owen!
        Doug Cutting made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Doug Cutting made changes -
        Workflow jira [ 12361404 ] no reopen closed [ 12373085 ]
        Doug Cutting made changes -
        Workflow no reopen closed [ 12373085 ] no-reopen-closed [ 12373421 ]
        Doug Cutting made changes -
        Workflow no-reopen-closed [ 12373421 ] no-reopen-closed, patch-avail [ 12377732 ]
        Owen O'Malley made changes -
        Component/s dfs [ 12310710 ]

          People

          • Assignee:
            Owen O'Malley
            Reporter:
            Owen O'Malley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development