Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-5546

race condition crashes "hadoop ls -R" when directories are moved/removed

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      This seems to be a rare race condition where we have a sequence of events like this:
      1. org.apache.hadoop.shell.Ls calls DFS#getFileStatus on directory D.
      2. someone deletes or moves directory D
      3. org.apache.hadoop.shell.Ls calls PathData#getDirectoryContents(D), which calls DFS#listStatus(D). This throws FileNotFoundException.
      4. ls command terminates with FNF

      1. HDFS-5546.1.patch
        1 kB
        Kousuke Saruta

        Activity

        Hide
        Colin Patrick McCabe added a comment -

        The best solution is probably to catch the FNF in #3, and simply not put that directory in the listing, since it doesn't exist by that name any more. I guess you could argue that we should re-list the parent directory in this case to make sure new stuff doesn't exist in it (like a renamed version of D), but that seems like it would open a difficult can of worms since we'd have arbitrary levels of backtracking. Also, we can't really know whether any of the work we had already done is still valid, since the names of directories could all have changed.

        Show
        Colin Patrick McCabe added a comment - The best solution is probably to catch the FNF in #3, and simply not put that directory in the listing, since it doesn't exist by that name any more. I guess you could argue that we should re-list the parent directory in this case to make sure new stuff doesn't exist in it (like a renamed version of D), but that seems like it would open a difficult can of worms since we'd have arbitrary levels of backtracking. Also, we can't really know whether any of the work we had already done is still valid, since the names of directories could all have changed.
        Hide
        Kousuke Saruta added a comment -

        I've tried to make a patch for this issue.
        How do you look that?

        Show
        Kousuke Saruta added a comment - I've tried to make a patch for this issue. How do you look that?
        Hide
        Colin Patrick McCabe added a comment -
        @@ -86,7 +87,11 @@ protected void processOptions(LinkedList<String> args)
           protected void processPathArgument(PathData item) throws IOException {
             // implicitly recurse once for cmdline directories
             if (dirRecurse && item.stat.isDirectory()) {
        -      recursePath(item);
        +      try {
        +        recursePath(item);
        +      } catch (FileNotFoundException e){
        +        displayError(e);
        +      } 
             } else {
               super.processPathArgument(item);
             }
        

        This will result in the first moved/removed file aborting the entire recursive ls with an error message. Basically, the same behavior as now, only with a process exit code of 0 rather than nonzero. That's not what we want.

        Show
        Colin Patrick McCabe added a comment - @@ -86,7 +87,11 @@ protected void processOptions(LinkedList< String > args) protected void processPathArgument(PathData item) throws IOException { // implicitly recurse once for cmdline directories if (dirRecurse && item.stat.isDirectory()) { - recursePath(item); + try { + recursePath(item); + } catch (FileNotFoundException e){ + displayError(e); + } } else { super .processPathArgument(item); } This will result in the first moved/removed file aborting the entire recursive ls with an error message. Basically, the same behavior as now, only with a process exit code of 0 rather than nonzero. That's not what we want.
        Hide
        Kousuke Saruta added a comment -

        I see, and I will try to modify that.

        Show
        Kousuke Saruta added a comment - I see, and I will try to modify that.
        Hide
        Colin Patrick McCabe added a comment -

        Thanks, Kousuke. I think the goal is to have it continue, ignoring the failure to stat that entry. This will be a bit tricky since when listing a single file, we can't ignore that error. It probably makes sense to print out a warning at the end, as well.

        Show
        Colin Patrick McCabe added a comment - Thanks, Kousuke. I think the goal is to have it continue, ignoring the failure to stat that entry. This will be a bit tricky since when listing a single file, we can't ignore that error. It probably makes sense to print out a warning at the end, as well.

          People

          • Assignee:
            Kousuke Saruta
            Reporter:
            Colin Patrick McCabe
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development