Hadoop Common
  1. Hadoop Common
  2. HADOOP-3162

Map/reduce stops working with comma separated input paths

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      The public methods org.apache.hadoop.mapred.JobConf.setInputPath(Path) and org.apache.hadoop.mapred.JobConf.addInputPath(Path) are deprecated. And the methods have the semantics of branch 0.16.
      The following public APIs are added in org.apache.hadoop.mapred.FileInputFormat :
      public static void setInputPaths(JobConf job, Path... paths);
      public static void setInputPaths(JobConf job, String commaSeparatedPaths);
      public static void addInputPath(JobConf job, Path path);
      public static void addInputPaths(JobConf job, String commaSeparatedPaths);
      Earlier code calling JobConf.setInputPath(Path), JobConf.addInputPath(Path) should now call FileInputFormat.setInputPaths(JobConf, Path...) and FileInputFormat.addInputPath(Path) respectively
      Show
      The public methods org.apache.hadoop.mapred.JobConf.setInputPath(Path) and org.apache.hadoop.mapred.JobConf.addInputPath(Path) are deprecated. And the methods have the semantics of branch 0.16. The following public APIs are added in org.apache.hadoop.mapred.FileInputFormat : public static void setInputPaths(JobConf job, Path... paths); public static void setInputPaths(JobConf job, String commaSeparatedPaths); public static void addInputPath(JobConf job, Path path); public static void addInputPaths(JobConf job, String commaSeparatedPaths); Earlier code calling JobConf.setInputPath(Path), JobConf.addInputPath(Path) should now call FileInputFormat.setInputPaths(JobConf, Path...) and FileInputFormat.addInputPath(Path) respectively

      Description

      When a job is given a comma separated input file list, FileInputFormat class throws an exception, complaining the input is invalid:

      org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : hdfs:/
      namenode:port/gridmix/data/MonsterQueryBlockCompressed/part-
      00000,/gridmix/data/MonsterQueryBlockCompressed/part-00001,/gridmix/data/Monster
      QueryBlockCompressed/part-00002
      at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputForma
      t.java:213)
      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
      at org.apache.hadoop.mapred.GenericMRLoadGenerator.run(GenericMRLoadGene
      rator.java:189)

      1. patch-3162.txt
        64 kB
        Amareshwari Sriramadasu
      2. patch-3162.txt
        71 kB
        Amareshwari Sriramadasu
      3. patch-3162.txt
        71 kB
        Amareshwari Sriramadasu
      4. patch-3162.txt
        72 kB
        Amareshwari Sriramadasu
      5. patch-3162.txt
        72 kB
        Amareshwari Sriramadasu
      6. patch-3162.txt
        73 kB
        Amareshwari Sriramadasu

        Activity

        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #463 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/463/ )
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks, Amareshwari!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks, Amareshwari!
        Hide
        Runping Qi added a comment -

        the streaming cli should work with globing too with this patch.
        the following should work:

        ...... -input a/b/c,d{e,f},g/h
        
        Show
        Runping Qi added a comment - the streaming cli should work with globing too with this patch. the following should work: ...... -input a/b/c,d{e,f},g/h
        Hide
        Owen O'Malley added a comment -

        Except for the streaming part, I think this is the right call.

        On the streaming cli processing, I'm pretty conflicted. On the one hand, we can't deprecate the cli, but without changing the semantics, we can't use the choice patterns. I think the best we can do is to use the current patch and make 0.18 streaming use the normal path globbing instead of the comma separated paths set path.

        Show
        Owen O'Malley added a comment - Except for the streaming part, I think this is the right call. On the streaming cli processing, I'm pretty conflicted. On the one hand, we can't deprecate the cli, but without changing the semantics, we can't use the choice patterns. I think the best we can do is to use the current patch and make 0.18 streaming use the normal path globbing instead of the comma separated paths set path.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12380174/patch-3162.txt
        against trunk revision 645773.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 138 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2241/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2241/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2241/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380174/patch-3162.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 138 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2241/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2241/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2241/console This message is automatically generated.
        Hide
        Runping Qi added a comment -

        The patch looks good.

        +1.

        Show
        Runping Qi added a comment - The patch looks good. +1.
        Hide
        Amareshwari Sriramadasu added a comment -

        Here is patch with Runping's comments incorporated. The following are the changes from earlier patch:
        1. getEscapedPathString(String) is changed as getPathStrings(String). getPathStrings(commaSeparatedPaths) returns a string array of the paths in commaSeparatedPaths. So, this avoids escaping and un-escaping to get splits.
        2. All the calls to add/setInputPath(new Path(String)) are replaced with add/setInputPaths(conf, String)

        Show
        Amareshwari Sriramadasu added a comment - Here is patch with Runping's comments incorporated. The following are the changes from earlier patch: 1. getEscapedPathString(String) is changed as getPathStrings(String). getPathStrings(commaSeparatedPaths) returns a string array of the paths in commaSeparatedPaths. So, this avoids escaping and un-escaping to get splits. 2. All the calls to add/setInputPath(new Path(String)) are replaced with add/setInputPaths(conf, String)
        Hide
        Amareshwari Sriramadasu added a comment -

        Implementation is as follows:

        public static setInputPaths(JobConf job,   String commaSeparatedFilePaths) {
            // treat the comma in commaSeparatedFilePaths that are not enclosed by '{' and '}' as separators
            // split commaSeparatedFilePaths into string arrays using those separators
            // Let Path [] paths be the array of paths created from those strings
            setInputPaths(job, paths);
        }
        public static addInputPaths(JobConf job,   String commaSeparatedFilePaths) {
            // treat the comma in commaSeparatedFilePaths that are not enclosed by '{' and '}' as separators
            // split commaSeparatedFilePaths into string arrays using those separators
            // Let Path [] paths be the array of paths created from those strings
            for each path in path[]
              addInputPath(job, path);
        }
        

        Here, setInputPaths replaces existing paths, but addInputPaths adds the values to the existsing input path list.

        Show
        Amareshwari Sriramadasu added a comment - Implementation is as follows: public static setInputPaths(JobConf job, String commaSeparatedFilePaths) { // treat the comma in commaSeparatedFilePaths that are not enclosed by '{' and '}' as separators // split commaSeparatedFilePaths into string arrays using those separators // Let Path [] paths be the array of paths created from those strings setInputPaths(job, paths); } public static addInputPaths(JobConf job, String commaSeparatedFilePaths) { // treat the comma in commaSeparatedFilePaths that are not enclosed by '{' and '}' as separators // split commaSeparatedFilePaths into string arrays using those separators // Let Path [] paths be the array of paths created from those strings for each path in path[] addInputPath(job, path); } Here, setInputPaths replaces existing paths, but addInputPaths adds the values to the existsing input path list.
        Hide
        Runping Qi added a comment - - edited

        I assume the two new methods you refer to are meant to be:

        public static setInputPaths(JobConf job,   String commaSeparatedFilePaths);
        public static addInputPaths(JobConf job,   String commaSeparatedFilePaths);
        

        They don't break backward compatibility.
        The implementation in the patch seems overly complicated.
        I am not sure its correctness. .
        The intended behavior should look like:

        public static addInputPaths(JobConf job,   String commaSeparatedFilePaths) {
            // treat the comma in commaSeparatedFilePaths that are not enclosed by '{' and '}' as separators
            // split commaSeparatedFilePaths into string arrays using those separators
            // Let Path [] paths be the array of paths created from those strings
            return setInputPaths(job, paths);
        }
        

        Also,when you replace the code using the existing api with the one using the new api like:

        -      grepJob.setInputPath(new Path(args[0]));
        +      FileInputFormat.setInputPaths(grepJob, new Path(args[0]));
        

        That is incorrect.

        The correct one should be:

        -      grepJob.setInputPath(new Path(args[0]));
        +      FileInputFormat.setInputPaths(grepJob, args[0]);
        
        Show
        Runping Qi added a comment - - edited I assume the two new methods you refer to are meant to be: public static setInputPaths(JobConf job, String commaSeparatedFilePaths); public static addInputPaths(JobConf job, String commaSeparatedFilePaths); They don't break backward compatibility. The implementation in the patch seems overly complicated. I am not sure its correctness. . The intended behavior should look like: public static addInputPaths(JobConf job, String commaSeparatedFilePaths) { // treat the comma in commaSeparatedFilePaths that are not enclosed by '{' and '}' as separators // split commaSeparatedFilePaths into string arrays using those separators // Let Path [] paths be the array of paths created from those strings return setInputPaths(job, paths); } Also,when you replace the code using the existing api with the one using the new api like: - grepJob.setInputPath( new Path(args[0])); + FileInputFormat.setInputPaths(grepJob, new Path(args[0])); That is incorrect. The correct one should be: - grepJob.setInputPath( new Path(args[0])); + FileInputFormat.setInputPaths(grepJob, args[0]);
        Hide
        Amareshwari Sriramadasu added a comment -

        The newly added apis now takes care of both comma separated paths and glob paths. For example, if user gives commaSeparated string as a

        {b,c}d,ef,xyz. this is split as 3 paths: 1. a{b,c}

        d , 2. ef , 3. xyz and added to the input paths. So, this does not stop user from giving glob paths.

        Show
        Amareshwari Sriramadasu added a comment - The newly added apis now takes care of both comma separated paths and glob paths. For example, if user gives commaSeparated string as a {b,c}d,ef,xyz. this is split as 3 paths: 1. a{b,c} d , 2. ef , 3. xyz and added to the input paths. So, this does not stop user from giving glob paths.
        Hide
        Hairong Kuang added a comment -

        Runping, if streaming continues to use the deprecated API in JobConf, it keeps the old semantics. The two new methods break the backward compatibility. We should do it in release 18. We need more discussion on what's the right syntax and semantics for the comma seperated path names.

        Show
        Hairong Kuang added a comment - Runping, if streaming continues to use the deprecated API in JobConf, it keeps the old semantics. The two new methods break the backward compatibility. We should do it in release 18. We need more discussion on what's the right syntax and semantics for the comma seperated path names.
        Hide
        Runping Qi added a comment -

        The meaning of the comma in the user facing interface like streaming should not and need not change.
        path1,path2 should continue mean two paths separated by comma.
        It should not be changed. The comma should not be interpreted as a part of path.
        It is either a separator of path or a separator in glob.
        If we generalize the glob to accespt

        {a/b/,/d/e/f/,/g}

        , then we get a unified semantics of comma.
        Until then, the above two api methods are needed.

        Show
        Runping Qi added a comment - The meaning of the comma in the user facing interface like streaming should not and need not change. path1,path2 should continue mean two paths separated by comma. It should not be changed. The comma should not be interpreted as a part of path. It is either a separator of path or a separator in glob. If we generalize the glob to accespt {a/b/,/d/e/f/,/g} , then we get a unified semantics of comma. Until then, the above two api methods are needed.
        Hide
        Hairong Kuang added a comment -

        I do not think that we need add the following two methods to FileInputFormat:
        1. setInputPaths(JobConf conf, Sring commaSeparatedPaths)
        2. addInputPaths(JobConf conf, String commaSeparatedPaths).

        We have not discussed what a comma means in the user facing interfaces like streaming if a user provides a comma separated path name. In streaming, should we support commas in a path name? What if a user wants to use a glob that contains a comma? These questions need to be well discussed and documented before we make any code change to support it. We could do it in release 18.

        In release 17, the patch only needs to revert the old behavior of addInputPath and setInputPath of JobConf. Applications like streaming should continue to use JobConf.setInputPath while a user that needs to use a glob or a path containing commas can use the new APIs in FileInputFormat.

        Show
        Hairong Kuang added a comment - I do not think that we need add the following two methods to FileInputFormat: 1. setInputPaths(JobConf conf, Sring commaSeparatedPaths) 2. addInputPaths(JobConf conf, String commaSeparatedPaths). We have not discussed what a comma means in the user facing interfaces like streaming if a user provides a comma separated path name. In streaming, should we support commas in a path name? What if a user wants to use a glob that contains a comma? These questions need to be well discussed and documented before we make any code change to support it. We could do it in release 18. In release 17, the patch only needs to revert the old behavior of addInputPath and setInputPath of JobConf. Applications like streaming should continue to use JobConf.setInputPath while a user that needs to use a glob or a path containing commas can use the new APIs in FileInputFormat.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12379981/patch-3162.txt
        against trunk revision 645773.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 138 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2220/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2220/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2220/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2220/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379981/patch-3162.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 138 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2220/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2220/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2220/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2220/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Fixed javac warning. Warning was due to MultiFileWordCount.

        Show
        Amareshwari Sriramadasu added a comment - Fixed javac warning. Warning was due to MultiFileWordCount.
        Hide
        Devaraj Das added a comment -

        The private static method FileInputFormat.getEscapedPathString worries me a little bit, at the very least we should put it in StringUtils or someother place like that

        Note that the path stuff applies only to FileInputFormat. So IMO this definition should be here rather than in some other place.

        Show
        Devaraj Das added a comment - The private static method FileInputFormat.getEscapedPathString worries me a little bit, at the very least we should put it in StringUtils or someother place like that Note that the path stuff applies only to FileInputFormat. So IMO this definition should be here rather than in some other place.
        Hide
        Arun C Murthy added a comment -

        Amareshwari, could you please fix the javac warnings?

        Minor nit: The private static method FileInputFormat.getEscapedPathString worries me a little bit, at the very least we should put it in StringUtils or someother place like that... at least other input formats which do not extend FileInputFormat can use it. What do others think?

        Show
        Arun C Murthy added a comment - Amareshwari, could you please fix the javac warnings? Minor nit: The private static method FileInputFormat.getEscapedPathString worries me a little bit, at the very least we should put it in StringUtils or someother place like that... at least other input formats which do not extend FileInputFormat can use it. What do others think?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12379833/patch-3162.txt
        against trunk revision 645773.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 138 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac -1. The applied patch generated 498 javac compiler warnings (more than the trunk's current 497 warnings).

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2209/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2209/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2209/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2209/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379833/patch-3162.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 138 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 498 javac compiler warnings (more than the trunk's current 497 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2209/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2209/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2209/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2209/console This message is automatically generated.
        Hide
        Devaraj Das added a comment -

        +1

        Show
        Devaraj Das added a comment - +1
        Hide
        Amareshwari Sriramadasu added a comment -

        Looks like it skipped hudson queue. trying to run hudson again.

        Show
        Amareshwari Sriramadasu added a comment - Looks like it skipped hudson queue. trying to run hudson again.
        Hide
        Amareshwari Sriramadasu added a comment -

        Indentation changed. And added a comment.

        Show
        Amareshwari Sriramadasu added a comment - Indentation changed. And added a comment.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12379638/patch-3162.txt
        against trunk revision 645773.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 138 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2184/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2184/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2184/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2184/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379638/patch-3162.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 138 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2184/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2184/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2184/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2184/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Fixed javac warning.

        Show
        Amareshwari Sriramadasu added a comment - Fixed javac warning.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12379539/patch-3162.txt
        against trunk revision 643282.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 138 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac -1. The applied patch generated 499 javac compiler warnings (more than the trunk's current 498 warnings).

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2181/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2181/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2181/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2181/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379539/patch-3162.txt against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 138 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 499 javac compiler warnings (more than the trunk's current 498 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2181/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2181/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2181/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2181/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        trying to run hudson again

        Show
        Amareshwari Sriramadasu added a comment - trying to run hudson again
        Hide
        Amareshwari Sriramadasu added a comment - - edited

        This patch addresses the following:
        1. Adds the following apis in FileInputFormat:

        • public static void setInputPaths(JobConf job, Path... paths);
        • public static void setInputPaths(JobConf job, String commaSeparatedPaths);
        • public static void addInputPath(JobConf job, Path path);
        • public static void addInputPaths(JobConf job, String commaSeparatedPaths);
          2. Deprecates JobConf.setInputPath(Path) and JobConf.addInputPath(Path). And the methods have 0.16 semantics
          3. Moves testInputPath() from conf/TestJobConf to mapred/TestInputPath. Added tests for testing the new apis.
        Show
        Amareshwari Sriramadasu added a comment - - edited This patch addresses the following: 1. Adds the following apis in FileInputFormat: public static void setInputPaths(JobConf job, Path... paths); public static void setInputPaths(JobConf job, String commaSeparatedPaths); public static void addInputPath(JobConf job, Path path); public static void addInputPaths(JobConf job, String commaSeparatedPaths); 2. Deprecates JobConf.setInputPath(Path) and JobConf.addInputPath(Path). And the methods have 0.16 semantics 3. Moves testInputPath() from conf/TestJobConf to mapred/TestInputPath. Added tests for testing the new apis.
        Hide
        Amareshwari Sriramadasu added a comment -

        Canceling patch to address Runping's comment.

        Show
        Amareshwari Sriramadasu added a comment - Canceling patch to address Runping's comment.
        Hide
        Runping Qi added a comment -

        Since the inputSpecs of a streaming job comes from command lines, it
        may contain comma separated files (very common use).
        In order to continue to support that usage, it is better to change

        FileInputFormat.addInputPath(jobConf_,   new Path(((String) inputSpecs_.get(i))));
        

        into

        FileInputFormat.addInputPaths(jobConf_,   (String) inputSpecs_.get(i));
        

        (do we have

        public static addInputPaths(JobConf job,   String commaSeparatedFilePaths);
        

        If not, we need to add it.

        Show
        Runping Qi added a comment - Since the inputSpecs of a streaming job comes from command lines, it may contain comma separated files (very common use). In order to continue to support that usage, it is better to change FileInputFormat.addInputPath(jobConf_, new Path((( String ) inputSpecs_.get(i)))); into FileInputFormat.addInputPaths(jobConf_, ( String ) inputSpecs_.get(i)); (do we have public static addInputPaths(JobConf job, String commaSeparatedFilePaths); If not, we need to add it.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12379389/patch-3162.txt
        against trunk revision 643282.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 135 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2165/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2165/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2165/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2165/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379389/patch-3162.txt against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 135 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2165/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2165/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2165/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2165/console This message is automatically generated.
        Hide
        Hairong Kuang added a comment -

        Just to make it clear that currently dfs does not support globs like

        {/p1/f1,/p2/f2}

        . Instead, it supports globs like /p1/

        {f1,f2}

        . This means that the closure pattern needs to be in one path component.

        Show
        Hairong Kuang added a comment - Just to make it clear that currently dfs does not support globs like {/p1/f1,/p2/f2} . Instead, it supports globs like /p1/ {f1,f2} . This means that the closure pattern needs to be in one path component.
        Hide
        Amareshwari Sriramadasu added a comment -

        Added following apis in FileInputFormat
        public static void addInputPath(JobConf job, Path path);
        public static void setInputPaths(JobConf job, Path... paths);
        public static void setInputPaths(JobConf job, String... strings);
        public static Path[] getInputPaths(JobConf job);

        Show
        Amareshwari Sriramadasu added a comment - Added following apis in FileInputFormat public static void addInputPath(JobConf job, Path path); public static void setInputPaths(JobConf job, Path... paths); public static void setInputPaths(JobConf job, String... strings); public static Path[] getInputPaths(JobConf job);
        Hide
        Amareshwari Sriramadasu added a comment -

        Fixed javadoc warning

        Show
        Amareshwari Sriramadasu added a comment - Fixed javadoc warning
        Hide
        Amareshwari Sriramadasu added a comment -

        there is javadoc warning in the patch.

        Show
        Amareshwari Sriramadasu added a comment - there is javadoc warning in the patch.
        Hide
        Amareshwari Sriramadasu added a comment -

        Here is a patch adding apis Owen suggested.
        Deprecated the apis in JobConf. restored them to their 0.16 semantics.
        Added public static void setInputPaths(JobConf conf, String... strings)

        Show
        Amareshwari Sriramadasu added a comment - Here is a patch adding apis Owen suggested. Deprecated the apis in JobConf. restored them to their 0.16 semantics. Added public static void setInputPaths(JobConf conf, String... strings)
        Hide
        Amareshwari Sriramadasu added a comment -

        However, I'd prefer a new API method still hornor camma as the path separator

        This is conficting with glob pattern requirement, right? Now, comma seperated paths can be added with global pattern as you suggested.

        We can add the followign api along with the ones Owen suggested.
        public static void setInputPaths(JobConf conf, String... strings)

        Show
        Amareshwari Sriramadasu added a comment - However, I'd prefer a new API method still hornor camma as the path separator This is conficting with glob pattern requirement, right? Now, comma seperated paths can be added with global pattern as you suggested. We can add the followign api along with the ones Owen suggested. public static void setInputPaths(JobConf conf, String... strings)
        Hide
        Runping Qi added a comment -

        Withthe new api, I assume that add/setInputPath accept a Path with a global pattern having multiple paths separated by comma, like:

        new Path("{p1/f1,p2/f2}")
        

        However, I'd prefer a new API method still hornor camma as the path separator

        public static void setInputPaths(JobConf job, String commaSeparatedPaths);
        
        Show
        Runping Qi added a comment - Withthe new api, I assume that add/setInputPath accept a Path with a global pattern having multiple paths separated by comma, like: new Path( "{p1/f1,p2/f2}" ) However, I'd prefer a new API method still hornor camma as the path separator public static void setInputPaths(JobConf job, String commaSeparatedPaths);
        Hide
        Owen O'Malley added a comment -

        I propose that we make this mirror the change in HADOOP-3041 and add to FileInputFormat:

          public static void addInputPath(JobConf job, Path path);
          public static void setInputPaths(JobConf job, Path[] paths);
          public static Path[] getInputPaths(JobConf job);
        

        while in JobConf we deprecate the input path methods and restore them to their 0.16 semantics (pre- HADOOP-3064, with no quoting and split on ',')

        @Deprecated
        public void addInputPath(Path p);
        @Deprecated
        public void setInputPath(Path p);
        @Deprecated
        public Path[] getInputPaths();
        
        Show
        Owen O'Malley added a comment - I propose that we make this mirror the change in HADOOP-3041 and add to FileInputFormat: public static void addInputPath(JobConf job, Path path); public static void setInputPaths(JobConf job, Path[] paths); public static Path[] getInputPaths(JobConf job); while in JobConf we deprecate the input path methods and restore them to their 0.16 semantics (pre- HADOOP-3064 , with no quoting and split on ',') @Deprecated public void addInputPath(Path p); @Deprecated public void setInputPath(Path p); @Deprecated public Path[] getInputPaths();
        Hide
        Runping Qi added a comment -

        this is not a math problem

        Show
        Runping Qi added a comment - this is not a math problem
        Hide
        Hairong Kuang added a comment -

        But this common practice of using a path parameter to set multiple input paths is wrong.

        Show
        Hairong Kuang added a comment - But this common practice of using a path parameter to set multiple input paths is wrong.
        Hide
        Runping Qi added a comment -

        When people need to pass inputs to a map/reduce through a command line, they commonly mormal use one argument for the multiple inputs, separated by comma (,) char. Then they will most likely call jobconf.setInputPath(inputs).

        The patch 3064 broke a common practice of using map/reduce since day one.

        Show
        Runping Qi added a comment - When people need to pass inputs to a map/reduce through a command line, they commonly mormal use one argument for the multiple inputs, separated by comma (,) char. Then they will most likely call jobconf.setInputPath(inputs). The patch 3064 broke a common practice of using map/reduce since day one.
        Hide
        Hairong Kuang added a comment - - edited

        The problem was caused by the patch to HADOOP-3064. HADOOP-3064 allows that a path name contains a comma. Methods setInputPath and addInputPath of JobConf escape all the commas in the path name. So before this change, if a user's map/reduce job has the following code:

        jobConf.setInputPath(new Path("aa,bb")));
        

        jobConf.getInputPaths returned an array of two paths: "aa" and "bb".
        After this patch, jobConf.getInputPaths returns an array of one path: "aa,bb".

        I guess the failed job might have used jobConf.setInputPath(new Path("path1,path2")) to set two input paths. It should use the following statements instead:

        jobConf.setInputPath(new Path("path1"));
        jobConf.addInputPath(new Path("path2"));
        
        Show
        Hairong Kuang added a comment - - edited The problem was caused by the patch to HADOOP-3064 . HADOOP-3064 allows that a path name contains a comma. Methods setInputPath and addInputPath of JobConf escape all the commas in the path name. So before this change, if a user's map/reduce job has the following code: jobConf.setInputPath( new Path( "aa,bb" ))); jobConf.getInputPaths returned an array of two paths: "aa" and "bb". After this patch, jobConf.getInputPaths returns an array of one path: "aa,bb". I guess the failed job might have used jobConf.setInputPath(new Path("path1,path2")) to set two input paths. It should use the following statements instead: jobConf.setInputPath( new Path( "path1" )); jobConf.addInputPath( new Path( "path2" ));
        Hide
        Mukund Madhugiri added a comment -

        The last successful run I saw was on trunk revision 642046

        Show
        Mukund Madhugiri added a comment - The last successful run I saw was on trunk revision 642046
        Hide
        Runping Qi added a comment -



        The problem started to happen on the hadoop 0.17 trunk after 3/27.

        Show
        Runping Qi added a comment - The problem started to happen on the hadoop 0.17 trunk after 3/27.

          People

          • Assignee:
            Amareshwari Sriramadasu
            Reporter:
            Runping Qi
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development