Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-8640

DU thread transient failures propagate to callers

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.0.0-alpha, 1.2.1
    • Fix Version/s: None
    • Component/s: fs, io
    • Labels:
      None

      Description

      When running some stress tests, I saw a failure where the DURefreshThread failed due to the filesystem changing underneath it:

      org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/data/4/dfs/dn/current/BP-1928785663-172.20.90.20-1343880685858/current/rbw/blk_4637779214690837894': No such file or directory
      

      (the block was probably finalized while the du process was running, which caused it to fail)

      The next block write, then, called getUsed(), and the exception got propagated causing the write to fail. Since it was a pseudo-distributed cluster, the client was unable to pick a different node to write to and failed.

      The current behavior of propagating the exception to the next (and only the next) caller doesn't seem well-thought-out.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tlipcon Todd Lipcon
              • Votes:
                3 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: