Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1501

FileInputFormat to support multi-level/recursive directory listing

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.1
    • Component/s: None
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      As we have seen multiple times in the mailing list, users want to have the capability of getting all files out of a multi-level directory structure.

      4/1/2008: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3Ce75c02ef0804011433x144813e6x2450da7883de3aca@mail.gmail.com%3E

      2/3/2009: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3C7F80089C-3E7F-4330-90BA-6F1C5B0B0F3F@nist.gov%3E

      6/2/2009: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3C4A258A16.8050300@darose.net%3E

      One solution that our users had is to write a new FileInputFormat, but that means all existing FileInputFormat subclasses need to be changed in order to support this feature.

      We can easily provide a JobConf option (which defaults to false) to FileInputFormat.listStatus(...) to recursively go into directory structure.

        Issue Links

          Activity

          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12436481/MAPREDUCE-1501.1.trunk.patch
          against trunk revision 912471.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/469/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/469/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/469/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/469/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436481/MAPREDUCE-1501.1.trunk.patch against trunk revision 912471. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/469/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/469/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/469/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/469/console This message is automatically generated.
          Hide
          Zheng Shao added a comment -

          There are 2 test failures but I don't think they are related. Resubmitting patch to get it tested again.

          Show
          Zheng Shao added a comment - There are 2 test failures but I don't think they are related. Resubmitting patch to get it tested again.
          Hide
          Ian Soboroff added a comment -

          I am one of the authors of the emails cited in the description. In my implementation (which I did not submit as a JIRA), I have path filters to make sure we don't add . and .. and other hidden directories. I haven't properly thought about this issue in months... does this patch need this kind of check?

          Show
          Ian Soboroff added a comment - I am one of the authors of the emails cited in the description. In my implementation (which I did not submit as a JIRA), I have path filters to make sure we don't add . and .. and other hidden directories. I haven't properly thought about this issue in months... does this patch need this kind of check?
          Hide
          Zheng Shao added a comment -

          Thanks for the feedback Ian.
          I don't think FileSystem.listPath() returns "." or "..". If it does, I believe the current code in trunk will also break. The new unit test will also fail if that's the case.

          Show
          Zheng Shao added a comment - Thanks for the feedback Ian. I don't think FileSystem.listPath() returns "." or "..". If it does, I believe the current code in trunk will also break. The new unit test will also fail if that's the case.
          Hide
          dhruba borthakur added a comment -

          I think Ian mentioned that you can enhance this feature by allowing the user to register a set of PathFilters. That will allow the job to process only a selected subset of the subdirectories.

          Show
          dhruba borthakur added a comment - I think Ian mentioned that you can enhance this feature by allowing the user to register a set of PathFilters. That will allow the job to process only a selected subset of the subdirectories.
          Hide
          Zheng Shao added a comment -

          Thanks Dhruba. I missed the part "and other hidden directories". We do call PathFilter on the sub directories as well (see addInputPathRecursively(...)). Is that good enough or we want to split the PathFilters for files and the PathFilters for directories?

          Show
          Zheng Shao added a comment - Thanks Dhruba. I missed the part "and other hidden directories". We do call PathFilter on the sub directories as well (see addInputPathRecursively(...)). Is that good enough or we want to split the PathFilters for files and the PathFilters for directories?
          Hide
          dhruba borthakur added a comment -

          That should be good enough, unless Ian has some other ideas.

          Show
          dhruba borthakur added a comment - That should be good enough, unless Ian has some other ideas.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12436481/MAPREDUCE-1501.1.trunk.patch
          against trunk revision 916823.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/339/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/339/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/339/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/339/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436481/MAPREDUCE-1501.1.trunk.patch against trunk revision 916823. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/339/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/339/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/339/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/339/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          The failed unit test is TestMiniMRLocalFS.testWithLocal and is not related to this patch. I will commit this patch.

          Show
          dhruba borthakur added a comment - The failed unit test is TestMiniMRLocalFS.testWithLocal and is not related to this patch. I will commit this patch.
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Zheng.

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Zheng.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #257 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/257/)
          . FileInputFormat supports multi-level, recursive
          directory listing. (Zheng Shao via dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #257 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/257/ ) . FileInputFormat supports multi-level, recursive directory listing. (Zheng Shao via dhruba)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #248 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/248/)
          . FileInputFormat supports multi-level, recursive
          directory listing. (Zheng Shao via dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #248 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/248/ ) . FileInputFormat supports multi-level, recursive directory listing. (Zheng Shao via dhruba)
          Hide
          Chris Douglas added a comment -
          +import com.sun.org.apache.commons.logging.Log;
          +import com.sun.org.apache.commons.logging.LogFactory;
          

          Should these imports be org.apache.hadoop.commons.logging, not com.sun... ?

          Is there a reason this feature was only added to a deprecated class, instead of the FileInputFormat in the mapreduce package?

          Show
          Chris Douglas added a comment - +import com.sun.org.apache.commons.logging.Log; +import com.sun.org.apache.commons.logging.LogFactory; Should these imports be org.apache.hadoop.commons.logging , not com.sun... ? Is there a reason this feature was only added to a deprecated class, instead of the FileInputFormat in the mapreduce package?
          Hide
          Zheng Shao added a comment -

          Reopened for Chris's comments.

          Show
          Zheng Shao added a comment - Reopened for Chris's comments.
          Hide
          Zheng Shao added a comment -

          Will open a new one to address this issue.

          Show
          Zheng Shao added a comment - Will open a new one to address this issue.

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development