Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-8640

DU thread transient failures propagate to callers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.0.0-alpha, 1.2.1
    • None
    • fs, io
    • None

    Description

      When running some stress tests, I saw a failure where the DURefreshThread failed due to the filesystem changing underneath it:

      org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/data/4/dfs/dn/current/BP-1928785663-172.20.90.20-1343880685858/current/rbw/blk_4637779214690837894': No such file or directory
      

      (the block was probably finalized while the du process was running, which caused it to fail)

      The next block write, then, called getUsed(), and the exception got propagated causing the write to fail. Since it was a pseudo-distributed cluster, the client was unable to pick a different node to write to and failed.

      The current behavior of propagating the exception to the next (and only the next) caller doesn't seem well-thought-out.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tlipcon Todd Lipcon
              Votes:
              3 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: