Hadoop Common
  1. Hadoop Common
  2. HADOOP-3497

File globbing with a PathFilter is too restrictive

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.20.0
    • Component/s: fs
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.
      Show
      Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.

      Description

      Consider the file hierarchy

      /a
      /a/b
      

      Calling the globStatus method on FileSystem with a path of

      /*/*

      and a PathFilter that only accepts /a/b returns no matches. It should return a single match: /a/b.

      1. hadoop-3497-test.patch
        2 kB
        Tom White
      2. hadoop-3497.patch
        6 kB
        Tom White
      3. hadoop-3497-v2.patch
        5 kB
        Tom White
      4. hadoop-3497-v3.patch
        6 kB
        Tom White

        Activity

        Tom White created issue -
        Hide
        Tom White added a comment -

        Here's a test that exposes the issue. The problem is that the testing is done per path component, so since the parent directory /a doesn't match the filter the whole path /a/b is rejected.

        Show
        Tom White added a comment - Here's a test that exposes the issue. The problem is that the testing is done per path component, so since the parent directory /a doesn't match the filter the whole path /a/b is rejected.
        Tom White made changes -
        Field Original Value New Value
        Attachment hadoop-3497-test.patch [ 12383456 ]
        Hide
        Tom White added a comment -

        Patch for review. I've removed the user filtering at each level in a path, instead the filtering is done on the full path name after the globbing step.

        Show
        Tom White added a comment - Patch for review. I've removed the user filtering at each level in a path, instead the filtering is done on the full path name after the globbing step.
        Tom White made changes -
        Attachment hadoop-3497.patch [ 12387747 ]
        Hide
        Hairong Kuang added a comment -

        Tom, nice change! Two comments:
        1. I think the checking if paths are accepted by the filter or not (lines 932-936 in Filesystem.java) should be performed no matter last component of the path pattern has a pattern or not. Your patch does the checking only when last component does not have a pattern.
        2. This patch probably should be marked as an incompatible change. The trunk implements a wrong semantics.

        Show
        Hairong Kuang added a comment - Tom, nice change! Two comments: 1. I think the checking if paths are accepted by the filter or not (lines 932-936 in Filesystem.java) should be performed no matter last component of the path pattern has a pattern or not. Your patch does the checking only when last component does not have a pattern. 2. This patch probably should be marked as an incompatible change. The trunk implements a wrong semantics.
        Tom White made changes -
        Hadoop Flags [Incompatible change]
        Release Note Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.
        Hide
        Tom White added a comment -

        New patch to address Hairong's comments.

        Show
        Tom White added a comment - New patch to address Hairong's comments.
        Tom White made changes -
        Attachment hadoop-3497-v2.patch [ 12388141 ]
        Tom White made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hairong Kuang added a comment -

        +1. The patch looks good to me.

        Show
        Hairong Kuang added a comment - +1. The patch looks good to me.
        Tom White made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Tom White added a comment -

        Resubmitting to hudson.

        Show
        Tom White added a comment - Resubmitting to hudson.
        Tom White made changes -
        Assignee Tom White [ tomwhite ]
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12388141/hadoop-3497-v2.patch
        against trunk revision 686420.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388141/hadoop-3497-v2.patch against trunk revision 686420. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/console This message is automatically generated.
        Tom White made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Tom White added a comment -

        Retrying on hudson (it didn't run properly last time).

        Show
        Tom White added a comment - Retrying on hudson (it didn't run properly last time).
        Tom White made changes -
        Fix Version/s 0.19.0 [ 12313211 ]
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12388141/hadoop-3497-v2.patch
        against trunk revision 689380.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388141/hadoop-3497-v2.patch against trunk revision 689380. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/console This message is automatically generated.
        Hide
        Tom White added a comment -

        Cancelling while test failures are investigated.

        Show
        Tom White added a comment - Cancelling while test failures are investigated.
        Tom White made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Fix Version/s 0.19.0 [ 12313211 ]
        Hide
        Tom White added a comment -

        The test that is failing is TestFileInputFormatPathFilter#testWithPathFilterWithoutGlob. This creates files named a, b, aa, bb in a directory, then uses an input format with a filter that only accepts files whose last component is 1 character long. Only files a and b should match. The input path is the directory, not a glob path, and to work it relies on the following following behaviour of FileSystem#globStatus.

        If you call FileSystem#globStatus(Path pathPattern, PathFilter filter) with a pathPattern that has a fixed (non-globbing) final component, then the status for that path will always be returned, regardless of the filter.

        So, for a path /a which exists

        fs.globStatus(new Path("/a"), new PathFilter() {
          @Override
          public boolean accept(Path path) {
            return false;
          }})
        

        will return the status for /a, even though the filter rejects every path!

        This seems wrong, and should really be changed. It has a potential impact on applications however, since a filter is now being applied that previously wasn't. Does this seem the right thing to do?

        I've attached a patch which fixes the test.

        Show
        Tom White added a comment - The test that is failing is TestFileInputFormatPathFilter#testWithPathFilterWithoutGlob. This creates files named a, b, aa, bb in a directory, then uses an input format with a filter that only accepts files whose last component is 1 character long. Only files a and b should match. The input path is the directory, not a glob path, and to work it relies on the following following behaviour of FileSystem#globStatus. If you call FileSystem#globStatus(Path pathPattern, PathFilter filter) with a pathPattern that has a fixed (non-globbing) final component, then the status for that path will always be returned, regardless of the filter. So, for a path /a which exists fs.globStatus( new Path( "/a" ), new PathFilter() { @Override public boolean accept(Path path) { return false ; }}) will return the status for /a, even though the filter rejects every path! This seems wrong, and should really be changed. It has a potential impact on applications however, since a filter is now being applied that previously wasn't. Does this seem the right thing to do? I've attached a patch which fixes the test.
        Tom White made changes -
        Attachment hadoop-3497-v3.patch [ 12390290 ]
        Tom White made changes -
        Fix Version/s 0.20.0 [ 12313438 ]
        Tom White made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hairong Kuang added a comment -

        +1 on the change. Tom, could you please put the above semantic change in the release note?

        Show
        Hairong Kuang added a comment - +1 on the change. Tom, could you please put the above semantic change in the release note?
        Hide
        Tom White added a comment -

        Successfully ran the unit tests and test-patch

             [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        
        
        Show
        Tom White added a comment - Successfully ran the unit tests and test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        Hide
        Tom White added a comment -

        I've just committed this.

        Show
        Tom White added a comment - I've just committed this.
        Tom White made changes -
        Hadoop Flags [Incompatible change] [Incompatible change, Reviewed]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #680 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/ )
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development