Issue Details (XML | Word | Printable)

Key: HADOOP-3497
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Tom White
Reporter: Tom White
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

File globbing with a PathFilter is too restrictive

Created: 05/Jun/08 12:43 PM   Updated: 23/Apr/09 07:17 PM
Return to search
Component/s: fs
Affects Version/s: 0.17.0
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File hadoop-3497-test.patch 2008-06-05 12:47 PM Tom White 2 kB
Text File Licensed for inclusion in ASF works hadoop-3497-v2.patch 2008-08-13 01:11 PM Tom White 5 kB
Text File Licensed for inclusion in ASF works hadoop-3497-v3.patch 2008-09-17 04:47 PM Tom White 6 kB
Text File Licensed for inclusion in ASF works hadoop-3497.patch 2008-08-07 05:01 PM Tom White 6 kB

Hadoop Flags: Reviewed, Incompatible change
Release Note:
Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.
Resolution Date: 04/Dec/08 02:12 PM


 Description  « Hide
Consider the file hierarchy
/a
/a/b

Calling the globStatus method on FileSystem with a path of

/*/*
and a PathFilter that only accepts /a/b returns no matches. It should return a single match: /a/b.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Tom White added a comment - 05/Jun/08 12:47 PM
Here's a test that exposes the issue. The problem is that the testing is done per path component, so since the parent directory /a doesn't match the filter the whole path /a/b is rejected.

Tom White added a comment - 07/Aug/08 05:01 PM
Patch for review. I've removed the user filtering at each level in a path, instead the filtering is done on the full path name after the globbing step.

Hairong Kuang added a comment - 08/Aug/08 11:29 PM
Tom, nice change! Two comments:
1. I think the checking if paths are accepted by the filter or not (lines 932-936 in Filesystem.java) should be performed no matter last component of the path pattern has a pattern or not. Your patch does the checking only when last component does not have a pattern.
2. This patch probably should be marked as an incompatible change. The trunk implements a wrong semantics.

Tom White added a comment - 13/Aug/08 01:11 PM
New patch to address Hairong's comments.

Hairong Kuang added a comment - 13/Aug/08 04:56 PM
+1. The patch looks good to me.

Tom White added a comment - 15/Aug/08 08:55 PM
Resubmitting to hudson.

Hadoop QA added a comment - 17/Aug/08 07:17 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12388141/hadoop-3497-v2.patch
against trunk revision 686420.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3064/console

This message is automatically generated.


Tom White added a comment - 27/Aug/08 09:37 AM
Retrying on hudson (it didn't run properly last time).

Hadoop QA added a comment - 27/Aug/08 04:23 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12388141/hadoop-3497-v2.patch
against trunk revision 689380.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3122/console

This message is automatically generated.


Tom White added a comment - 17/Sep/08 11:55 AM
Cancelling while test failures are investigated.

Tom White added a comment - 17/Sep/08 04:47 PM
The test that is failing is TestFileInputFormatPathFilter#testWithPathFilterWithoutGlob. This creates files named a, b, aa, bb in a directory, then uses an input format with a filter that only accepts files whose last component is 1 character long. Only files a and b should match. The input path is the directory, not a glob path, and to work it relies on the following following behaviour of FileSystem#globStatus.

If you call FileSystem#globStatus(Path pathPattern, PathFilter filter) with a pathPattern that has a fixed (non-globbing) final component, then the status for that path will always be returned, regardless of the filter.

So, for a path /a which exists

fs.globStatus(new Path("/a"), new PathFilter() {
  @Override
  public boolean accept(Path path) {
    return false;
  }})

will return the status for /a, even though the filter rejects every path!

This seems wrong, and should really be changed. It has a potential impact on applications however, since a filter is now being applied that previously wasn't. Does this seem the right thing to do?

I've attached a patch which fixes the test.


Hairong Kuang added a comment - 02/Dec/08 10:23 PM
+1 on the change. Tom, could you please put the above semantic change in the release note?

Tom White added a comment - 04/Dec/08 02:02 PM
Successfully ran the unit tests and test-patch
     [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.


Tom White added a comment - 04/Dec/08 02:12 PM
I've just committed this.

Hudson added a comment - 06/Dec/08 02:02 PM