Hadoop Common
  1. Hadoop Common
  2. HADOOP-3498

File globbing alternation should be able to span path components

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: fs
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Extended file globbing alternation to cross path components. For example, {/a/b,/c/d} expands to a path that matches the files /a/b and /c/d.

      Description

      For example,

      {/a/b,/c/d} should expand to /a/b and /c/d. This change would also permit a consistent syntax for specifying multiple input paths to MapReduce, streaming and Pig by specification of a single glob path with alternation {/a/b,/c/d}

      , rather than a collection of comma separated glob paths /a/b,/c/d.

      This change would also make globbing more consistent with bash, which supports this feature.

      1. hadoop-3498.patch
        10 kB
        Tom White
      2. hadoop-3498-v2.patch
        10 kB
        Tom White
      3. hadoop-3498-v3.patch
        11 kB
        Tom White

        Issue Links

          Activity

          Hide
          Tom White added a comment -

          This could be implemented by recursively expanding alternations to produce a list of Paths, as the first stage of glob processing.

          Show
          Tom White added a comment - This could be implemented by recursively expanding alternations to produce a list of Paths, as the first stage of glob processing.
          Hide
          Tom White added a comment -

          Patch for review. Alternations are only expanded if there is an embedded "/", for efficiency.

          Show
          Tom White added a comment - Patch for review. Alternations are only expanded if there is an embedded "/", for efficiency.
          Hide
          Hairong Kuang added a comment -

          This patch needs to scan filePattern multiple times if there are more than one curly braces in the pattern. Multilevel nested braces will lead to the number of scans close to exponential growth. If alternations are expanded no matter there is an embedded "/" or not, we could use an algorithm that requires only one pass of scanning.

          Show
          Hairong Kuang added a comment - This patch needs to scan filePattern multiple times if there are more than one curly braces in the pattern. Multilevel nested braces will lead to the number of scans close to exponential growth. If alternations are expanded no matter there is an embedded "/" or not, we could use an algorithm that requires only one pass of scanning.
          Hide
          Tom White added a comment -

          Hairong, Thanks for the feedback.

          I wanted to avoid expanding all alternations so that we didn't get an explosion of paths for things like:

          {a,b,c,d,e}{f,g,h,i,j}{k,l,m,n,o}

          This would continue to be processed as a single path and matched using regular expressions.

          The patch does scan filePattern multiple times, so I've changed it so that it doesn't scan previously-expanded parts of the path. This change dramatically reduces the number of character scans.

          Show
          Tom White added a comment - Hairong, Thanks for the feedback. I wanted to avoid expanding all alternations so that we didn't get an explosion of paths for things like: {a,b,c,d,e}{f,g,h,i,j}{k,l,m,n,o} This would continue to be processed as a single path and matched using regular expressions. The patch does scan filePattern multiple times, so I've changed it so that it doesn't scan previously-expanded parts of the path. This change dramatically reduces the number of character scans.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12388914/hadoop-3498-v2.patch
          against trunk revision 689380.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 272 release audit warnings (more than the trunk's current 271 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388914/hadoop-3498-v2.patch against trunk revision 689380. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 272 release audit warnings (more than the trunk's current 271 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3123/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          +1 The optimization looks good to me.

          Show
          Hairong Kuang added a comment - +1 The optimization looks good to me.
          Hide
          Tom White added a comment -

          New patch fixing license headers. The test failures look unrelated to this patch.

          Show
          Tom White added a comment - New patch fixing license headers. The test failures look unrelated to this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389496/hadoop-3498-v3.patch
          against trunk revision 692287.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3180/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3180/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3180/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3180/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389496/hadoop-3498-v3.patch against trunk revision 692287. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3180/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3180/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3180/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3180/console This message is automatically generated.
          Hide
          Tom White added a comment -

          I've just committed this.

          Show
          Tom White added a comment - I've just committed this.
          Hide
          Nigel Daley added a comment -

          I wish this patch had updated org.apache.hadoop.cli.TestCLI. Tom, if that makes sense, can you open another Jira?

          Show
          Nigel Daley added a comment - I wish this patch had updated org.apache.hadoop.cli.TestCLI. Tom, if that makes sense, can you open another Jira?
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #595 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/595/ )

            People

            • Assignee:
              Tom White
              Reporter:
              Tom White
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development