Issue Details (XML | Word | Printable)

Key: HADOOP-2055
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Alejandro Abdelnur
Reporter: Alejandro Abdelnur
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

JobConf should have a setInputPathFilter method

Created: 15/Oct/07 10:55 AM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works patch2055.txt 2008-03-26 10:05 AM Alejandro Abdelnur 8 kB
Environment: all

Hadoop Flags: Reviewed
Release Note:
This issue provides users the ability to specify what paths to ignore for processing in the job input directory (apart from the filenames that start with "_" and "."). Defines two new APIs - FileInputFormat.setInputPathFilter(JobConf, PathFilter), and, FileInputFormat.getInputPathFilter(JobConf).
Resolution Date: 28/Mar/08 12:44 PM


 Description  « Hide
It should be possible to set a PathFilter for the input to avoid taking certain files as input data within the input directories.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Owen O'Malley added a comment - 15/Oct/07 03:53 PM
This should be a static method on the FileInputFormat instead of JobConf, since it won't affect the framework, but only the FileInputFormat's behavior.

Owen O'Malley added a comment - 15/Oct/07 03:55 PM
The method should probably also have a getter and most of them look like:
public static void setInputPathFilter(JobConf job, PathFilter filter);
public static PathFilter getInputPathFilter(JobConf job);

Alejandro Abdelnur added a comment - 16/Oct/07 04:07 PM
Having a static method on the FileInputFormat it would make difficult for an application that dispatches hadoop jobs (ie a webapp) to set filters on per job basis.

IMO, it should be configurable at job level.


Doug Cutting added a comment - 16/Oct/07 05:10 PM
> IMO, it should be configurable at job level.

Please look more closely at the static methods Owen suggested. The job is a parameter.


eric baldeschwieler added a comment - 17/Oct/07 11:19 AM
we support globing in input paths now. Doesn't that address this need?

IE *.foo


Alejandro Abdelnur added a comment - 23/Oct/07 11:40 PM
Owen, Doug, got the static methos thing, that would work.

Eric, using wildcards would not work as it allows you to tell what you want to include, but now what you don't want to include.

For example, if I have some files like the CRC files (to track other type of information) and I would like to skip them.


Alejandro Abdelnur made changes - 25/Mar/08 10:19 AM
Field Original Value New Value
Assignee Alejandro Abdelnur [ tucu00 ]
Alejandro Abdelnur added a comment - 25/Mar/08 10:33 AM
I've figured out (IMO) a cleaner way of implementing this feature:

Adding the following 2 instance methods to the JobConf:

  • void setInputPathFilter(class<? extends PathFilter> pathFilter);
  • InputPathFilter getInputPathFilter();

Modifying the FileInputFormat's listPaths() method to apply the hiddenFileFilter and (if set) the filter set in the jobconf.

And still globbing works for regex inclusion, even if a path filter is set.

By being able to specify a custom PathFilter it will be possible to create more complex filters such as exclusion ones and doing selections not possible to be done via regex.


Alejandro Abdelnur made changes - 25/Mar/08 10:36 AM
Fix Version/s 0.17.0 [ 12312913 ]
Summary JobConf should have a setInputPathFilter(PathFilter filter) method JobConf should have a setInputPathFilter method
Alejandro Abdelnur made changes - 25/Mar/08 10:37 AM
Attachment patch2055.txt [ 12378554 ]
Alejandro Abdelnur made changes - 25/Mar/08 10:42 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Devaraj Das made changes - 25/Mar/08 11:01 AM
Component/s mapred [ 12310690 ]
Hadoop QA added a comment - 25/Mar/08 05:17 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378554/patch2055.txt
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/console

This message is automatically generated.


Alejandro Abdelnur made changes - 26/Mar/08 10:02 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Alejandro Abdelnur added a comment - 26/Mar/08 10:05 AM
refactored patch to Owen's suggestion as the functionality is specific to File InputFormats.

Alejandro Abdelnur made changes - 26/Mar/08 10:05 AM
Attachment patch2055.txt [ 12378623 ]
Alejandro Abdelnur made changes - 26/Mar/08 10:05 AM
Attachment patch2055.txt [ 12378554 ]
Alejandro Abdelnur made changes - 26/Mar/08 10:06 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 26/Mar/08 12:34 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378623/patch2055.txt
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/console

This message is automatically generated.


Devaraj Das added a comment - 28/Mar/08 12:44 PM
I just committed this. Thanks, Alejandro!

Devaraj Das made changes - 28/Mar/08 12:44 PM
Resolution Fixed [ 1 ]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Repository Revision Date User Message
ASF #642211 Fri Mar 28 12:45:15 UTC 2008 ddas HADOOP-2055. Allows users to set PathFilter on the FileInputFormat. Contributed by Alejandro Abdelnur.
Files Changed
MODIFY /hadoop/core/trunk/CHANGES.txt
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/FileInputFormat.java

Repository Revision Date User Message
ASF #642260 Fri Mar 28 15:26:47 UTC 2008 ddas HADOOP-2055. Adding the testcase missed in the earlier commit of this issue.
Files Changed
ADD /hadoop/core/trunk/src/test/org/apache/hadoop/mapred/TestFileInputFormatPathFilter.java

Hudson added a comment - 29/Mar/08 12:07 PM

Devaraj Das made changes - 17/Apr/08 06:06 AM
Release Note This issue provides users the ability to specify what paths to ignore for processing in the job input directory (apart from the filenames that start with "_" and "."). Defines two new APIs - FileInputFormat.setInputPathFilter(JobConf, PathFilter), and, FileInputFormat.getInputPathFilter(JobConf).
Hadoop Flags [Reviewed]
Nigel Daley made changes - 21/May/08 08:05 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:52 PM
Component/s mapred [ 12310690 ]