Pig
  1. Pig
  2. PIG-965

PERFORMANCE: optimize common case in matches (PORegex)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: impl
    • Labels:
      None

      Description

      Some frequently seen use cases of 'matches' comparison operator have follow properties -
      1. The rhs is a constant string . eg "c1 matches 'abc%' "
      2. Regexes such that look for matching prefix , suffix etc are very common. eg - "abc%', "%abc", '%abc%'

      To optimize for these common cases , PORegex.java can be changed to -
      1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed.
      2. Use string comparisons for simple common regexes (in 2 above).

      The implementation of Hive like clause uses similar optimizations.

      1. automaton.jar
        168 kB
        Ankit Modi
      2. poregex2.patch
        33 kB
        Ankit Modi

        Issue Links

          Activity

          Hide
          Thejas M Nair added a comment -

          Committed to trunk.

          Show
          Thejas M Nair added a comment - Committed to trunk.
          Hide
          Thejas M Nair added a comment -

          Ankit is right, the patch is not present in trunk. I will apply it to trunk.

          Show
          Thejas M Nair added a comment - Ankit is right, the patch is not present in trunk. I will apply it to trunk.
          Hide
          Ankit Modi added a comment -

          I couldn't see the poregex2.patch patch applied in the code.

          automaton.jar is present in the trunk, but the files modified/added by above patch are not modified/added.

          Show
          Ankit Modi added a comment - I couldn't see the poregex2.patch patch applied in the code. automaton.jar is present in the trunk, but the files modified/added by above patch are not modified/added.
          Hide
          Olga Natkovich added a comment -

          patch committed. Thanks Ankit for contributing and Thejas for reviewing!

          Show
          Olga Natkovich added a comment - patch committed. Thanks Ankit for contributing and Thejas for reviewing!
          Hide
          Olga Natkovich added a comment -

          Thanks, Thejas. I will run test-commit tests manually and commit if it passes.

          Show
          Olga Natkovich added a comment - Thanks, Thejas. I will run test-commit tests manually and commit if it passes.
          Hide
          Thejas M Nair added a comment -

          +1 . new patch looks good
          Hudson findbugs and core-tests will fail because it does not include the attached jar while compiling.

          Show
          Thejas M Nair added a comment - +1 . new patch looks good Hudson findbugs and core-tests will fail because it does not include the attached jar while compiling.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12428184/poregex2.patch
          against trunk revision 890596.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/130/testReport/
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/130/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428184/poregex2.patch against trunk revision 890596. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/130/testReport/ Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/130/console This message is automatically generated.
          Hide
          Ankit Modi added a comment -

          Rewrote some logic in case 1 and 3 of determineBestRegex. Also found a bug in case1 so updated that.

          Added Thejas's recommendation.

          Also added a few unit test patterns.

          Show
          Ankit Modi added a comment - Rewrote some logic in case 1 and 3 of determineBestRegex. Also found a bug in case1 so updated that. Added Thejas's recommendation. Also added a few unit test patterns.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12428066/poregex2.patch
          against trunk revision 890596.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/127/testReport/
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/127/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428066/poregex2.patch against trunk revision 890596. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/127/testReport/ Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/127/console This message is automatically generated.
          Hide
          Ankit Modi added a comment -

          Here are numbers comparing comparing optimization 1&2 against optimization 1 & dk.brics

          dk.brics.Runautomaton is as fast as optimization 2 and also provides similar speeds in a set of additional expressions.

          Query svn_trunk std_dev Optimization 1 & 2 std_dev Optimization 1 & brics.RunAutomaton std_dev
          .*ABCD.* 33.87 0.71 18.77 0.71 18.94 0.02
          .*ABCD 30.06 2.91 18.44 0.05 18.94 0.03
          ABCD.* 21.93 2.91 18.35 0.1 18.85 0.04

          Values are averaged over 3 runs.

          Show
          Ankit Modi added a comment - Here are numbers comparing comparing optimization 1&2 against optimization 1 & dk.brics dk.brics.Runautomaton is as fast as optimization 2 and also provides similar speeds in a set of additional expressions. Query svn_trunk std_dev Optimization 1 & 2 std_dev Optimization 1 & brics.RunAutomaton std_dev .*ABCD.* 33.87 0.71 18.77 0.71 18.94 0.02 .*ABCD 30.06 2.91 18.44 0.05 18.94 0.03 ABCD.* 21.93 2.91 18.35 0.1 18.85 0.04 Values are averaged over 3 runs.
          Hide
          Thejas M Nair added a comment -

          The new patch does not address the following comment -

          • RegexInit.java, in determineBestRegexMethod Line 85 - 120
            There are while loops where we are testing only for preceding '\'

          For example -
          $

                      if( index != 0 ) {
                          while( pattern.charAt(index -1 ) == '\\' ) {
                              index = pattern.indexOf('[');
                          }
          
          $

          Should be -

          $

                      if( index != 0 ) {
                          while( isCharEscaped(pattern,index) ) {
                              index = pattern.indexOf('[');
                          }
          $

          There are 3 such while loops between lines 68-96 (in new patch) that needs to be fixed.

          • The indentation in RegexInit.isCharEscaped needs to be fixed.

          Everything else looks good.

          Show
          Thejas M Nair added a comment - The new patch does not address the following comment - RegexInit.java, in determineBestRegexMethod Line 85 - 120 There are while loops where we are testing only for preceding '\' For example - $ if ( index != 0 ) { while ( pattern.charAt(index -1 ) == '\\' ) { index = pattern.indexOf('['); } $ Should be - $ if ( index != 0 ) { while ( isCharEscaped(pattern,index) ) { index = pattern.indexOf('['); } $ There are 3 such while loops between lines 68-96 (in new patch) that needs to be fixed. The indentation in RegexInit.isCharEscaped needs to be fixed. Everything else looks good.
          Hide
          Ankit Modi added a comment -
          • NonConstantRegex - I did not think of equals. But I added a length check before as it could find out change in length faster and to best of my knowledge its a getMethod. And yes as you mentioned equals will check for same object and instanceOf which is not useful in our case.
          • The numbers published above are using dk.brics.automaton.RunAutomaton. Do you want me to publish numbers for more set of regexs ?

          I'll create a patch for rest of the comments.

          Show
          Ankit Modi added a comment - NonConstantRegex - I did not think of equals. But I added a length check before as it could find out change in length faster and to best of my knowledge its a getMethod. And yes as you mentioned equals will check for same object and instanceOf which is not useful in our case. The numbers published above are using dk.brics.automaton.RunAutomaton. Do you want me to publish numbers for more set of regexs ? I'll create a patch for rest of the comments.
          Hide
          Thejas M Nair added a comment -

          Reviewed the latest patch.
          Comments :

          • RegexInit.java, in determineBestRegexMethod Line 85 - 120
            There are while loops where we are testing only for preceding '\'
            The handling of preceding escapes could be done in a separate function, since the logic is used at multiple places
          • RegexInit.java lines 61,147
            // This is the case when an old number of escapes
            I believe you meant "odd"
          • RegexImpl.java - following comments are not relevant anymore
            +// LHSXXXX means LHS is constantExpression and RHS varies with each Tuple
            +// RHSXXXX means RHS is constantExpression and LHS varies with each Tuple
          • NonConstantRegex , line 34-35
            || rhs.length() != oldString.length()
            || rhs.compareTo(oldString) != 0
            

            could be simplified as -

            || !rhs.equals(oldString)
            

            Did you chose the former because it might be faster ? That can be the case in this situation, because equals has a additional check of - "instanceOf String" . So I think the existing code is fine. A comment there might be useful.

          Can you also publish your numbers for the comparison of dk.brics.automaton.RunAutomaton and optimization 2 (Use string comparisons for simple common regexes ) in the jira ?

          Show
          Thejas M Nair added a comment - Reviewed the latest patch. Comments : RegexInit.java, in determineBestRegexMethod Line 85 - 120 There are while loops where we are testing only for preceding '\' The handling of preceding escapes could be done in a separate function, since the logic is used at multiple places RegexInit.java lines 61,147 // This is the case when an old number of escapes I believe you meant "odd" RegexImpl.java - following comments are not relevant anymore +// LHSXXXX means LHS is constantExpression and RHS varies with each Tuple +// RHSXXXX means RHS is constantExpression and LHS varies with each Tuple NonConstantRegex , line 34-35 || rhs.length() != oldString.length() || rhs.compareTo(oldString) != 0 could be simplified as - || !rhs.equals(oldString) Did you chose the former because it might be faster ? That can be the case in this situation, because equals has a additional check of - "instanceOf String" . So I think the existing code is fine. A comment there might be useful. Can you also publish your numbers for the comparison of dk.brics.automaton.RunAutomaton and optimization 2 (Use string comparisons for simple common regexes ) in the jira ?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427913/poregex2.patch
          against trunk revision 889870.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/testReport/
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427913/poregex2.patch against trunk revision 889870. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/testReport/ Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/console This message is automatically generated.
          Hide
          Ankit Modi added a comment -

          I have included changes suggested by Thejas.

          Show
          Ankit Modi added a comment - I have included changes suggested by Thejas.
          Hide
          Thejas M Nair added a comment -

          Review comments:

          • The regex will always be on the rhs. So we don't need the code/classes which tries to determine which side has the regular expression based on which side has constant.
          • in determineBestRegexMethod, need to add "(?" to the list of regex strings not supported in dk.bricks (in javaRegexOnly) . It has special meanings in java regex, which is not honored by dk.brics .
          • in determineBestRegexMethod, We are dealing with cases like "\d" (choose java regex), "
            d" (choose dk.brics), but not dealing with "\\\d" (which should be choose java regex). ie we need to go back until we find a non '\' char.
          • in RegexInit.compile(..), the following message is more appropriate at debug level, not at info . At info level, it might also confuse the user.
            + log.info("Got an IllegalArgumentException for Pattern: " + pattern );
            + log.info(e.getMessage());
            + log.info("Switching to java.util.regex" );
          • The following comment in PORegex.java seems to be out of place .
            // This is a BinaryComparisonOperator hence there can only be two inputs
          Show
          Thejas M Nair added a comment - Review comments: The regex will always be on the rhs. So we don't need the code/classes which tries to determine which side has the regular expression based on which side has constant. in determineBestRegexMethod, need to add "(?" to the list of regex strings not supported in dk.bricks (in javaRegexOnly) . It has special meanings in java regex, which is not honored by dk.brics . in determineBestRegexMethod, We are dealing with cases like "\d" (choose java regex), " d" (choose dk.brics), but not dealing with "\\\d" (which should be choose java regex). ie we need to go back until we find a non '\' char. in RegexInit.compile(..), the following message is more appropriate at debug level, not at info . At info level, it might also confuse the user. + log.info("Got an IllegalArgumentException for Pattern: " + pattern ); + log.info(e.getMessage()); + log.info("Switching to java.util.regex" ); The following comment in PORegex.java seems to be out of place . // This is a BinaryComparisonOperator hence there can only be two inputs
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427730/automaton.jar
          against trunk revision 889346.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/117/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427730/automaton.jar against trunk revision 889346. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/117/console This message is automatically generated.
          Hide
          Ankit Modi added a comment -

          One small change to JarManager.java is missing. Will add a new patch with it.

          Show
          Ankit Modi added a comment - One small change to JarManager.java is missing. Will add a new patch with it.
          Hide
          Ankit Modi added a comment -

          New patch with removed comments and added automaton.jar from http://www.brics.dk/~amoeller/automaton/automaton.jar.

          It fails findBugs due to missing symbols. I ran the findBugs after adding the jar to the build and it did not complain about any findBugs in the modified and added files.

          Show
          Ankit Modi added a comment - New patch with removed comments and added automaton.jar from http://www.brics.dk/~amoeller/automaton/automaton.jar . It fails findBugs due to missing symbols. I ran the findBugs after adding the jar to the build and it did not complain about any findBugs in the modified and added files.
          Hide
          Ankit Modi added a comment -

          Attaching one more file of patch. This one has all the changes, except changes to build.xml. Still trying to find a maven repo for dk.brics.automaton.

          Show
          Ankit Modi added a comment - Attaching one more file of patch. This one has all the changes, except changes to build.xml. Still trying to find a maven repo for dk.brics.automaton.
          Hide
          Ankit Modi added a comment -

          These are patches for two implementations

          One (poregex.patch) is an implementation applying optimization mentioned above in the JIRA.
          Second (poregex2.patch) implementation applies optimization 1 and uses dk.brics.automaton for running simple regular expressions. Otherwise it reverts back to java.util.regex.

          In 1 the decision to use optimization two or use java.util.regex is decided by getSimpleString method

          In 2 the decision to use dk.brics.automaton is done by determineBestRegexMethod. ( changes to build.xml is this patch are temporary )

          Both patches use RegexInit as an implementation which makes a decision ( calling the above mentioned decision functions ) and then sets the implementation to one decided by the decision function.

          In second patch, the decision function was created looking at the support of operators in dk.brics.automaton and its grammar. I tried out the classes supported and not supported in dk.brics.automaton and decided upon it.

          I could not find any specific page mentioning the difference between regex language java.util.regex and dk.brics.automaton.

          Show
          Ankit Modi added a comment - These are patches for two implementations One (poregex.patch) is an implementation applying optimization mentioned above in the JIRA. Second (poregex2.patch) implementation applies optimization 1 and uses dk.brics.automaton for running simple regular expressions. Otherwise it reverts back to java.util.regex. In 1 the decision to use optimization two or use java.util.regex is decided by getSimpleString method In 2 the decision to use dk.brics.automaton is done by determineBestRegexMethod. ( changes to build.xml is this patch are temporary ) Both patches use RegexInit as an implementation which makes a decision ( calling the above mentioned decision functions ) and then sets the implementation to one decided by the decision function. In second patch, the decision function was created looking at the support of operators in dk.brics.automaton and its grammar. I tried out the classes supported and not supported in dk.brics.automaton and decided upon it. I could not find any specific page mentioning the difference between regex language java.util.regex and dk.brics.automaton.
          Hide
          Thejas M Nair added a comment -

          Never mind the question above, I notice that you have mentioned that the second patch has "optimization 1,2 and dk.brics.automaton".

          Show
          Thejas M Nair added a comment - Never mind the question above, I notice that you have mentioned that the second patch has "optimization 1,2 and dk.brics.automaton".
          Hide
          Thejas M Nair added a comment -

          In the above performance numbers, I assume optimization 2 (custom string comparison) is used only for the regex ".ABCD." , while optimization 1 (re-using compiled pattern) is used with dk.brics.automaton as well. Can you please confirm ?

          From the performance numbers, it looks like we don't need to do optimization 2. We can just use dk.brics.automaton for the common regexes as well and keep the pig code simpler.

          Show
          Thejas M Nair added a comment - In the above performance numbers, I assume optimization 2 (custom string comparison) is used only for the regex ". ABCD. " , while optimization 1 (re-using compiled pattern) is used with dk.brics.automaton as well. Can you please confirm ? From the performance numbers, it looks like we don't need to do optimization 2. We can just use dk.brics.automaton for the common regexes as well and keep the pig code simpler.
          Hide
          Ankit Modi added a comment -

          I implemented a patch with optimization 1 and 2 mentioned above and another patch with optimization 1,2 and dk.brics.automaton.

          dk.brics.automaton does not support all features of java.util.regex hence the second patch considers that and switches to java.util.regex if the regex can only be handled by java.util.regex.

          Here are the numbers

          Regex svn_trunk Optimization 1 and 2 dk.brics.automaton comments
          .*ABCD.* 92.74 50.92 49.32 Here only optimization 2 is used
          .*[A-F] {2,3}

          .*

          152.3 133.48 105.93 dk.brics.automaton is used
          A.B.C.D 54.492 44.46 44.66 dk.brics.automaton is used
          .*([A-F] {4}

          )\w*\1.*

          129.29 112.89 109.43 java.util.regex used in all cases
          .*[A-F]{4}\w*[N-Z]{3}.* 129.63 108.11 54.42 dk.brics.automaton used

          These results were obtained using Local Mode on 1 Billion lines of data of following format
          f1:Chararray(100) of random chars from [A-Z]
          f2:int random integer

          dk.brics.automaton provides good performance in case of complex regex.

          Show
          Ankit Modi added a comment - I implemented a patch with optimization 1 and 2 mentioned above and another patch with optimization 1,2 and dk.brics.automaton. dk.brics.automaton does not support all features of java.util.regex hence the second patch considers that and switches to java.util.regex if the regex can only be handled by java.util.regex. Here are the numbers Regex svn_trunk Optimization 1 and 2 dk.brics.automaton comments .*ABCD.* 92.74 50.92 49.32 Here only optimization 2 is used .* [A-F] {2,3} .* 152.3 133.48 105.93 dk.brics.automaton is used A.B.C.D 54.492 44.46 44.66 dk.brics.automaton is used .*( [A-F] {4} )\w*\1.* 129.29 112.89 109.43 java.util.regex used in all cases .*[A-F]{4}\w* [N-Z] {3}.* 129.63 108.11 54.42 dk.brics.automaton used These results were obtained using Local Mode on 1 Billion lines of data of following format f1:Chararray(100) of random chars from [A-Z] f2:int random integer dk.brics.automaton provides good performance in case of complex regex.
          Hide
          Thejas M Nair added a comment -

          I found another regex library that is supposed to be faster than java.util.regex . - dk.brics.automaton.RegExp (BSD license, used in apache nutch). It does not support all features of java regex, but it is a candidate that can be used for purposes of this patch (common simpler regexes).

          It is faster than java regex, but much slower than 'optimization2' (see numbers in code comments below)

                  String prefix = "123";
                  Pattern p =  Pattern.compile("123.*");
                  RegExp r = new RegExp("123.*");
                  Automaton a = r.toAutomaton();
          
                  while((str = in.readLine()) != null ){
                      
                    // optimization 1 - takes 30 secs
          //            if((p.matcher(str).matches()))
          //                matches++;
                      
          
                      //optimization 2 - takes 15 secs
          //            int len = prefix.length();
          //            boolean matched = true;
          //            for(int i=0; i<len; i++){
          //                if(prefix.charAt(i) != str.charAt(i)){
          //                    matched = false;
          //                    break;
          //                }
          //            }
          //            if(matched)
          //                matches++;
          
                      // dk.brics.automaton - takes 25 secs
          //            if(a.run(str))
          //                matches++;
          
                      tot++;
                  }
          
          Show
          Thejas M Nair added a comment - I found another regex library that is supposed to be faster than java.util.regex . - dk.brics.automaton.RegExp (BSD license, used in apache nutch). It does not support all features of java regex, but it is a candidate that can be used for purposes of this patch (common simpler regexes). It is faster than java regex, but much slower than 'optimization2' (see numbers in code comments below) String prefix = "123" ; Pattern p = Pattern.compile( "123.*" ); RegExp r = new RegExp( "123.*" ); Automaton a = r.toAutomaton(); while ((str = in.readLine()) != null ){ // optimization 1 - takes 30 secs // if ((p.matcher(str).matches())) // matches++; //optimization 2 - takes 15 secs // int len = prefix.length(); // boolean matched = true ; // for ( int i=0; i<len; i++){ // if (prefix.charAt(i) != str.charAt(i)){ // matched = false ; // break ; // } // } // if (matched) // matches++; // dk.brics.automaton - takes 25 secs // if (a.run(str)) // matches++; tot++; }
          Hide
          Thejas M Nair added a comment -

          The 'common' use case to which these optimization apply usually has a constant string specifying the pattern. It makes sense to use this optimization only (specifically optimization 2) in such cases, so that the worst case is not worse off.

          Another thing to check is if there are alternative faster regex implementations .

          Show
          Thejas M Nair added a comment - The 'common' use case to which these optimization apply usually has a constant string specifying the pattern. It makes sense to use this optimization only (specifically optimization 2) in such cases, so that the worst case is not worse off. Another thing to check is if there are alternative faster regex implementations .
          Hide
          Thejas M Nair added a comment -

          Hive like clause implementation is here - http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop
          /hive/ql/udf/UDFLike.java?revision=802066&view=markup

          I ran simple tests with a simple java program to see the impact of these optimizations. Optimization 1 reduces runtime to 1/2, optimization 2 reduces runtime to 1/4 .

                  int matches =0;
                  int tot = 0;
                  String prefix = "123";
                  Pattern p =  Pattern.compile("123.*");
                  while((str = in.readLine()) != null ){
          
          
          
                      //without proposed optimizations
                      //test setups 1 and 2 took 9secs, 126 secs respectively
          //            if(str.matches("123.*"))
          //                matches++;
          
          
          
                      // with optimization 1
          //            test sestups 1, 2 took  4, 57 secs respectively
          //            if((p.matcher(str).matches()))
          //                matches++;
                      
          
                      // with optimization 1
          //            test sestups 1, 2 took  2.5, 25 secs respectively
                      //takes 2.5, 25 secs
          //            int len = prefix.length();
          //            boolean matched = true;
          //            for(int i=0; i<len; i++){
          //                if(prefix.charAt(i) != str.charAt(i)){
          //                    matched = false;
          //                    break;
          //                }
          //            }
          //            if(matched)
          //                matches++;
          
                      tot++;
                  }
                 }
                  System.out.println("matches " + matches + " tot " + tot);
          
          Show
          Thejas M Nair added a comment - Hive like clause implementation is here - http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop /hive/ql/udf/UDFLike.java?revision=802066&view=markup I ran simple tests with a simple java program to see the impact of these optimizations. Optimization 1 reduces runtime to 1/2, optimization 2 reduces runtime to 1/4 . int matches =0; int tot = 0; String prefix = "123" ; Pattern p = Pattern.compile( "123.*" ); while ((str = in.readLine()) != null ){ //without proposed optimizations //test setups 1 and 2 took 9secs, 126 secs respectively // if (str.matches( "123.*" )) // matches++; // with optimization 1 // test sestups 1, 2 took 4, 57 secs respectively // if ((p.matcher(str).matches())) // matches++; // with optimization 1 // test sestups 1, 2 took 2.5, 25 secs respectively //takes 2.5, 25 secs // int len = prefix.length(); // boolean matched = true ; // for ( int i=0; i<len; i++){ // if (prefix.charAt(i) != str.charAt(i)){ // matched = false ; // break ; // } // } // if (matched) // matches++; tot++; } } System .out.println( "matches " + matches + " tot " + tot);

            People

            • Assignee:
              Ankit Modi
              Reporter:
              Thejas M Nair
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development