Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1968

Wildcard input syntax (glob) should support {}

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.14.1
    • 0.15.0
    • fs
    • None

    Description

      We have users who have organized data by day and would like to select several days in a single input specification. For example they would like to be able to say:

      '/data/2007

      {0830,0831,0901}

      /typeX/'

      To input 3 days data into map-reduce (or Pig in this case).

      (Also the use of regexp to resolve glob paterns looks like it might introduce some other bugs. I'd appreciate it if someone took another look at the code to see if there are any file characters that could
      be interpreted as regexp "instructions").

      Attachments

        1. curlyGlob.patch
          13 kB
          Hairong Kuang
        2. curlyGlob1.patch
          14 kB
          Hairong Kuang

        Activity

          Is this a blocker for 0.15 release?

          dhruba Dhruba Borthakur added a comment - Is this a blocker for 0.15 release?
          cutting Doug Cutting added a comment -

          > Is this a blocker for 0.15 release?

          I don't think so. We don't usually do new feature blockers, rather only regression bugs.

          cutting Doug Cutting added a comment - > Is this a blocker for 0.15 release? I don't think so. We don't usually do new feature blockers, rather only regression bugs.

          It is not a blocker, but it would resolve some user issues we'd really like to fix. If we can get it into 15, it would make some people happy. But I would not hold the release for this feature.

          eric14 Eric Baldeschwieler added a comment - It is not a blocker, but it would resolve some user issues we'd really like to fix. If we can get it into 15, it would make some people happy. But I would not hold the release for this feature.
          hairong Hairong Kuang added a comment -

          This patch allows a glob to use curly brackets as descripbed in the jira. It also makes sure that a file name that contains Java Regex special characters does not get interpreated as an instruction.

          There is one problem left with globs which is that glob escape does not work. See HADOOP-1995 for more details. I will fix the escape problem once HADOOP-1995 is resolved.

          hairong Hairong Kuang added a comment - This patch allows a glob to use curly brackets as descripbed in the jira. It also makes sure that a file name that contains Java Regex special characters does not get interpreated as an instruction. There is one problem left with globs which is that glob escape does not work. See HADOOP-1995 for more details. I will fix the escape problem once HADOOP-1995 is resolved.
          szetszwo Tsz-wo Sze added a comment -

          +1
          Codes looks good. Below are some thoughts.

          • Since this is a single thread situation, StringBuilder is more efficient than StringBuffer.
          • For this problem, using some parser generators (e.g. yacc) might be better than Java Regex.
          szetszwo Tsz-wo Sze added a comment - +1 Codes looks good. Below are some thoughts. Since this is a single thread situation, StringBuilder is more efficient than StringBuffer. For this problem, using some parser generators (e.g. yacc) might be better than Java Regex.
          hairong Hairong Kuang added a comment -

          The patch uses StringBuilder in stead of StringBuffer. I feel that the use of lex & Yacc is a too big project now. So the new patch does not incorporate this suggestion.

          hairong Hairong Kuang added a comment - The patch uses StringBuilder in stead of StringBuffer. I feel that the use of lex & Yacc is a too big project now. So the new patch does not incorporate this suggestion.
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12367124/curlyGlob1.patch
          against trunk revision r582033.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/890/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/890/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/890/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/890/console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12367124/curlyGlob1.patch against trunk revision r582033. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/890/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/890/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/890/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/890/console This message is automatically generated.

          I just committed this. Thanks Hairong!

          dhruba Dhruba Borthakur added a comment - I just committed this. Thanks Hairong!
          hudson Hudson added a comment -
          hudson Hudson added a comment - Integrated in Hadoop-Nightly #263 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/263/ )

          Why is glob going through path? It seems to me that a glob string is not a path string and shouldn't be processed as such. The resulting match is a list of paths.

          eric14 Eric Baldeschwieler added a comment - Why is glob going through path? It seems to me that a glob string is not a path string and shouldn't be processed as such. The resulting match is a list of paths.
          hairong Hairong Kuang added a comment -

          It is for spliting a path name into path components. Matching is done component by component.

          hairong Hairong Kuang added a comment - It is for spliting a path name into path components. Matching is done component by component.

          People

            hairong Hairong Kuang
            eric14 Eric Baldeschwieler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: