Hive
  1. Hive
  2. HIVE-2327

Optimize REGEX UDFs with constant parameter information

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0, 2.0.0
    • Component/s: UDF
    • Labels:
      None

      Description

      There are a lot of UDFs which would show major performance differences if one assumes that some of its arguments are constant.

      Consider, for example, any UDF that takes a regular expression as input: This can be complied once (fast) if it's a constant, or once per row (wicked slow) if it's not a constant.

      Or, consider any UDF that reads from a file and/or takes a filename as input; it would have to re-read the whole file if the filename changes.

      1. HIVE-2327.2.patch
        34 kB
        Alexander Pivovarov
      2. HIVE-2327.01.patch
        17 kB
        Alexander Pivovarov

        Issue Links

          Activity

          Adam Kramer created issue -
          Adam Kramer made changes -
          Field Original Value New Value
          Summary UDFs should be made aware of their arguments are constants. UDFs should be made aware when their arguments are constants.
          Hide
          John Sichi added a comment -

          Is this different from HIVE-1360?

          Show
          John Sichi added a comment - Is this different from HIVE-1360 ?
          John Sichi made changes -
          Link This issue duplicates HIVE-1360 [ HIVE-1360 ]
          John Sichi made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Hide
          Carl Steinbach added a comment -

          HIVE-1360 made it possible for UDFs to detect constant parameters, but it didn't update any of the regex UDFs to take advantage of this feature.

          Show
          Carl Steinbach added a comment - HIVE-1360 made it possible for UDFs to detect constant parameters, but it didn't update any of the regex UDFs to take advantage of this feature.
          Carl Steinbach made changes -
          Resolution Duplicate [ 3 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          John Sichi added a comment -

          OK, then I guess this issue should be renamed?

          Show
          John Sichi added a comment - OK, then I guess this issue should be renamed?
          Carl Steinbach made changes -
          Summary UDFs should be made aware when their arguments are constants. Optimize REGEX UDFs with constant parameter information
          Alexander Pivovarov made changes -
          Component/s UDF [ 12313585 ]
          Alexander Pivovarov made changes -
          Assignee Alexander Pivovarov [ apivovarov ]
          Alexander Pivovarov made changes -
          Remote Link This issue links to "https://reviews.apache.org/r/32807/ (Web Link)" [ 24653 ]
          Hide
          Alexander Pivovarov added a comment -

          patch #01
          Refactor UDF to derive from GenericUDF which supports constant arguments

          Show
          Alexander Pivovarov added a comment - patch #01 Refactor UDF to derive from GenericUDF which supports constant arguments
          Alexander Pivovarov made changes -
          Attachment HIVE-2327.01.patch [ 12709172 ]
          Alexander Pivovarov made changes -
          Status Reopened [ 4 ] Patch Available [ 10002 ]
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12709172/HIVE-2327.01.patch

          ERROR: -1 due to 4 failed/errored test(s), 8704 tests executed
          Failed tests:

          TestMinimrCliDriver-smb_mapjoin_8.q - did not produce a TEST-*.xml file
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_short_regress
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorization_short_regress
          org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vectorization_short_regress
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3274/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3274/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3274/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 4 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12709172 - PreCommit-HIVE-TRUNK-Build

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12709172/HIVE-2327.01.patch ERROR: -1 due to 4 failed/errored test(s), 8704 tests executed Failed tests: TestMinimrCliDriver-smb_mapjoin_8.q - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_short_regress org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorization_short_regress org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_vectorization_short_regress Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3274/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3274/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3274/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed This message is automatically generated. ATTACHMENT ID: 12709172 - PreCommit-HIVE-TRUNK-Build
          Hide
          Ashutosh Chauhan added a comment -

          Vectorization failure looks relevant.

          Show
          Ashutosh Chauhan added a comment - Vectorization failure looks relevant.
          Ashutosh Chauhan made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alexander Pivovarov added a comment -

          I run vectorization_short_regress.q test locally and noticed couple changes in q.out file

          • rlike is replaced with regexp (they are synonyms)
          • parentesises around "(a regexp b)" are removed in explain query output.

          I also noticed that old UDFRegExp registration set isOperator=true

          system.registerUDF("rlike", UDFRegExp.class, true);
          system.registerUDF("regexp", UDFRegExp.class, true);
          

          But new implementation extends GenericUDF. Generic UDF registration does not have isOperator parameter. Can it cause any issues?

          system.registerGenericUDF("rlike", GenericUDFRegExp.class);
          system.registerGenericUDF("regexp", GenericUDFRegExp.class);
          
          Show
          Alexander Pivovarov added a comment - I run vectorization_short_regress.q test locally and noticed couple changes in q.out file rlike is replaced with regexp (they are synonyms) parentesises around "(a regexp b)" are removed in explain query output. I also noticed that old UDFRegExp registration set isOperator=true system.registerUDF( "rlike" , UDFRegExp.class, true ); system.registerUDF( "regexp" , UDFRegExp.class, true ); But new implementation extends GenericUDF. Generic UDF registration does not have isOperator parameter. Can it cause any issues? system.registerGenericUDF( "rlike" , GenericUDFRegExp.class); system.registerGenericUDF( "regexp" , GenericUDFRegExp.class);
          Hide
          Alexander Pivovarov added a comment -

          I found the reason for diff in q.out files and the place where isOperator is used.

          1. GenericUDFBridge.isOperator is only used in getDisplayString method
          This method adds parenthesis around "(a regexp b)"

          if (isOperator) {
            ...
            return "(" + children[0] + " " + udfName + " " + children[1] + ")";
          }
          

          I do not think we need parenthesis in getDisplayString output. This is why new GenericUDFRegExp.getDisplayString() returns just

          @Override
          public String getDisplayString(String[] children) {
            return children[0] + " regexp " + children[1];
          }
          

          2. The reason why rlike is replaced with regexp in query plan is because GenericUDFRegExp.getFuncName returns "regexp" (because it's primary name for the function)

            @Override
            protected String getFuncName() {
              return "regexp";
            }
          

          I'll update q.out files soon

          Show
          Alexander Pivovarov added a comment - I found the reason for diff in q.out files and the place where isOperator is used. 1. GenericUDFBridge.isOperator is only used in getDisplayString method This method adds parenthesis around "(a regexp b)" if (isOperator) { ... return "(" + children[0] + " " + udfName + " " + children[1] + ")" ; } I do not think we need parenthesis in getDisplayString output. This is why new GenericUDFRegExp.getDisplayString() returns just @Override public String getDisplayString( String [] children) { return children[0] + " regexp " + children[1]; } 2. The reason why rlike is replaced with regexp in query plan is because GenericUDFRegExp.getFuncName returns "regexp" (because it's primary name for the function) @Override protected String getFuncName() { return "regexp" ; } I'll update q.out files soon
          Hide
          Alexander Pivovarov added a comment -

          patch #2:

          • updated q.out files
          Show
          Alexander Pivovarov added a comment - patch #2: updated q.out files
          Alexander Pivovarov made changes -
          Attachment HIVE-2327.2.patch [ 12732752 ]
          Alexander Pivovarov made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12732752/HIVE-2327.2.patch

          ERROR: -1 due to 3 failed/errored test(s), 8882 tests executed
          Failed tests:

          TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_auto_partitioned
          org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver_encryption_insert_partition_static
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3888/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3888/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3888/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 3 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12732752 - PreCommit-HIVE-TRUNK-Build

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12732752/HIVE-2327.2.patch ERROR: -1 due to 3 failed/errored test(s), 8882 tests executed Failed tests: TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_auto_partitioned org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver_encryption_insert_partition_static Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3888/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3888/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3888/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed This message is automatically generated. ATTACHMENT ID: 12732752 - PreCommit-HIVE-TRUNK-Build
          Hide
          Alexander Pivovarov added a comment -

          Ashutosh Chauhan, tests look good now. I do not think two failed partition tests relate to patch #2

          Show
          Alexander Pivovarov added a comment - Ashutosh Chauhan , tests look good now. I do not think two failed partition tests relate to patch #2
          Hide
          Ashutosh Chauhan added a comment -

          This patch ports regex udf to GenericUDF framework, which in itself is a good enough reason to commit this, since we want all our udfs to move to that eventually. But I am skeptical that whether it will actually regress in terms of performance. Reason being said optimization of not repeatedly compiling regex is already there in previous version. New version introduces additional branches, virtual function calls and actual logic computation in evaluate()
          Can you do before and after perf comparison for this change ?

          Show
          Ashutosh Chauhan added a comment - This patch ports regex udf to GenericUDF framework, which in itself is a good enough reason to commit this, since we want all our udfs to move to that eventually. But I am skeptical that whether it will actually regress in terms of performance. Reason being said optimization of not repeatedly compiling regex is already there in previous version. New version introduces additional branches, virtual function calls and actual logic computation in evaluate() Can you do before and after perf comparison for this change ?
          Hide
          Alexander Pivovarov added a comment -

          Is there a standard approach to do UDF performance test in hive?
          Is there any tables which are usually used for UDF performance tests in hive?

          If not then I can create 1 Mil rows table with string column t1 containing random strings 100 chars each.
          Then I run smth like:

          select count(*) from t where t.t1 regexp '.*abc.*';
          
          Show
          Alexander Pivovarov added a comment - Is there a standard approach to do UDF performance test in hive? Is there any tables which are usually used for UDF performance tests in hive? If not then I can create 1 Mil rows table with string column t1 containing random strings 100 chars each. Then I run smth like: select count(*) from t where t.t1 regexp '.*abc.*';
          Hide
          Ashutosh Chauhan added a comment -

          No there is no standard approach. Your test sounds good. I will modify the query to have a map-only query, so that other overheads don't come in play, something like:

          select t1 from t where t.t1 regexp '.*abc.*';
          
          Show
          Ashutosh Chauhan added a comment - No there is no standard approach. Your test sounds good. I will modify the query to have a map-only query, so that other overheads don't come in play, something like: select t1 from t where t.t1 regexp '.*abc.*';
          Hide
          Alexander Pivovarov added a comment -

          I checked old and new regexp UDF performance. New implementation is 1% faster.

          table files generator

          import java.io.FileNotFoundException;
          import java.io.PrintWriter;
          import java.math.BigInteger;
          import java.security.SecureRandom;
          
          public class RandomStringGenerator {
            private SecureRandom random = new SecureRandom();
          
            public String nextSessionId() {
              return new BigInteger(130, random).toString(32);
            }
          
            public static void main(String[] args) throws FileNotFoundException {
              RandomStringGenerator g = new RandomStringGenerator();
          
              // lets generate 10 files 10Mil rows each
              for (int j = 0; j < 10; j++) {
                System.out.println("start file " + j);
                PrintWriter pw = new PrintWriter("/tmp/rexexp_test/00000" + j + "_0");
                try {
                  for (int i = 0; i < 1000000; i++) {
                    String id = g.nextSessionId();
                    pw.println(id);
                  }
                } finally {
                  pw.close();
                }
              }
              System.out.println("All Done");
            }
          }
          

          create table

          hadoop fs -put -f /tmp/regexp_test /tmp
          
          create table regexp_test (
            a string
          )
          stored as textfile
          location '/tmp/regexp_test';
          

          test queries

          --1
          time bin/hive -e "select * from regexp_test where regexp(a, '.*abcd.*')"
          --2
          time bin/hive -e "select * from regexp_test where regexp(a, '.*efgh.*')"
          --3
          time bin/hive -e "select * from regexp_test where regexp(a, '.*ijkl.*')"
          --4
          time bin/hive -e "select a from regexp_test where regexp(a, '.*mnop.*')"
          

          old regexp implementation

          --1  233 rows
          real	1m6.881s
          user	1m10.582s
          sys	0m1.652s
          
          --2  247 rows
          real	1m6.520s
          user	1m10.082s
          sys	0m1.534s
          
          --3  224 rows
          real	1m8.037s
          user	1m11.718s
          sys	0m1.608s
          
          --4   rows 232
          real	1m6.698s
          user	1m10.378s
          sys	0m1.499s
          
          --AVG 67.034
          

          new regexp implementation

          --1  233 rows
          real	1m6.762s
          user	1m10.517s
          sys	0m1.471s
          
          --2  247 rows
          real	1m6.362s
          user	1m9.961s
          sys	0m1.558s
          
          --3  224 rows
          real	1m5.854s
          user	1m9.534s
          sys	0m1.452s
          
          --4  232 rows
          real	1m6.435s
          user	1m10.816s
          sys	0m1.571s
          
          --AVG 66.35325
          

          delta = AVG2 - AVG1 = 0.68075 sec
          new implementation is 1% faster (delta / max(AVG1, AVG2))

          Show
          Alexander Pivovarov added a comment - I checked old and new regexp UDF performance. New implementation is 1% faster. table files generator import java.io.FileNotFoundException; import java.io.PrintWriter; import java.math.BigInteger; import java.security.SecureRandom; public class RandomStringGenerator { private SecureRandom random = new SecureRandom(); public String nextSessionId() { return new BigInteger(130, random).toString(32); } public static void main( String [] args) throws FileNotFoundException { RandomStringGenerator g = new RandomStringGenerator(); // lets generate 10 files 10Mil rows each for ( int j = 0; j < 10; j++) { System .out.println( "start file " + j); PrintWriter pw = new PrintWriter( "/tmp/rexexp_test/00000" + j + "_0" ); try { for ( int i = 0; i < 1000000; i++) { String id = g.nextSessionId(); pw.println(id); } } finally { pw.close(); } } System .out.println( "All Done" ); } } create table hadoop fs -put -f /tmp/regexp_test /tmp create table regexp_test ( a string ) stored as textfile location '/tmp/regexp_test'; test queries --1 time bin/hive -e "select * from regexp_test where regexp(a, '.*abcd.*')" --2 time bin/hive -e "select * from regexp_test where regexp(a, '.*efgh.*')" --3 time bin/hive -e "select * from regexp_test where regexp(a, '.*ijkl.*')" --4 time bin/hive -e "select a from regexp_test where regexp(a, '.*mnop.*')" old regexp implementation --1 233 rows real 1m6.881s user 1m10.582s sys 0m1.652s --2 247 rows real 1m6.520s user 1m10.082s sys 0m1.534s --3 224 rows real 1m8.037s user 1m11.718s sys 0m1.608s --4 rows 232 real 1m6.698s user 1m10.378s sys 0m1.499s --AVG 67.034 new regexp implementation --1 233 rows real 1m6.762s user 1m10.517s sys 0m1.471s --2 247 rows real 1m6.362s user 1m9.961s sys 0m1.558s --3 224 rows real 1m5.854s user 1m9.534s sys 0m1.452s --4 232 rows real 1m6.435s user 1m10.816s sys 0m1.571s --AVG 66.35325 delta = AVG2 - AVG1 = 0.68075 sec new implementation is 1% faster (delta / max(AVG1, AVG2))
          Hide
          Ashutosh Chauhan added a comment -

          Sounds good. +1

          Show
          Ashutosh Chauhan added a comment - Sounds good. +1
          Hide
          Alexander Pivovarov added a comment -

          Committed to trunk. Thank you Ashutosh for your review!

          Show
          Alexander Pivovarov added a comment - Committed to trunk. Thank you Ashutosh for your review!
          Alexander Pivovarov made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 1.3.0 [ 12332154 ]
          Resolution Fixed [ 1 ]
          Thejas M Nair made changes -
          Fix Version/s 2.0.0 [ 12332641 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          42d 23h 14m 1 John Sichi 12/Sep/11 23:06
          Resolved Resolved Reopened Reopened
          33m 34s 1 Carl Steinbach 12/Sep/11 23:39
          Reopened Reopened Patch Available Patch Available
          1298d 7h 1m 1 Alexander Pivovarov 03/Apr/15 06:41
          Patch Available Patch Available Open Open
          40d 15h 31m 1 Ashutosh Chauhan 13/May/15 22:12
          Open Open Patch Available Patch Available
          5h 32m 1 Alexander Pivovarov 14/May/15 03:45
          Patch Available Patch Available Resolved Resolved
          5d 19h 46m 1 Alexander Pivovarov 19/May/15 23:31

            People

            • Assignee:
              Alexander Pivovarov
              Reporter:
              Adam Kramer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development