Hadoop Common
  1. Hadoop Common
  2. HADOOP-4070

[Hive] Provide a mechanism for registering UDFs from the query language

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      UDFs (user defined functions) in Hive are currently actually built-in functions. This issue is to develop a packaging and registration mechanism so users can add their own functions at runtime.

      1. hadoop-4070-v4.patch
        18 kB
        Ashish Thusoo
      2. hadoop-4070-v3.patch
        17 kB
        Tom White
      3. hadoop-4070-v2.patch
        18 kB
        Tom White
      4. hadoop-4070.patch
        12 kB
        Tom White

        Activity

        Hide
        Tom White added a comment -

        This patch extends the grammar to support creating UDFs.

        For example you can do

        CREATE FUNCTION my_udf AS com.example.MyUdf;
        

        then use the my_udf function just like any other function.

        The class defining the UDF (com.example.MyUdf) should be packed in a jar file then made available to hive by setting the HIVE_AUX_JARS_PATH environment variable (which specifies the directory that the jar is in).

        Note that there is a change in ExecDriver which removes some code that was setting the "tmpjars" property incorrectly. The problem was that the paths were not have the "file://" scheme prepended to them as GenericOptionsParser does, which caused the paths to be interpreted as HDFS paths. By removing the code GenericOptionsParser sets the variable correctly.

        I have tested it manually on a simple UDF, the next step is to write some automated tests.

        Show
        Tom White added a comment - This patch extends the grammar to support creating UDFs. For example you can do CREATE FUNCTION my_udf AS com.example.MyUdf; then use the my_udf function just like any other function. The class defining the UDF (com.example.MyUdf) should be packed in a jar file then made available to hive by setting the HIVE_AUX_JARS_PATH environment variable (which specifies the directory that the jar is in). Note that there is a change in ExecDriver which removes some code that was setting the "tmpjars" property incorrectly. The problem was that the paths were not have the "file://" scheme prepended to them as GenericOptionsParser does, which caused the paths to be interpreted as HDFS paths. By removing the code GenericOptionsParser sets the variable correctly. I have tested it manually on a simple UDF, the next step is to write some automated tests.
        Hide
        Zheng Shao added a comment -

        Can you add a test case, Tom?

        For adding a test case, please add a "my.q" file in hive/ql/src/test/queries/positive, and then run "ant test -Dtestcase=TestParse -Dqfile=my.q", collect the standard result by searching for the failed "diff" command in the test log.
        And then do the same for hive/ql/src/test/queries/clientpositive, and "ant test -Dtestcase=TestCliDriver -Dqfile=my.q".

        Another thing is that do you think AdminWork will be responsible for other works besides creating UDF? if not then let's call it a more specific name (like CreateUDFWork).

        Show
        Zheng Shao added a comment - Can you add a test case, Tom? For adding a test case, please add a "my.q" file in hive/ql/src/test/queries/positive, and then run "ant test -Dtestcase=TestParse -Dqfile=my.q", collect the standard result by searching for the failed "diff" command in the test log. And then do the same for hive/ql/src/test/queries/clientpositive, and "ant test -Dtestcase=TestCliDriver -Dqfile=my.q". Another thing is that do you think AdminWork will be responsible for other works besides creating UDF? if not then let's call it a more specific name (like CreateUDFWork).
        Hide
        Prasad Chakka added a comment -

        Hi Tom,

        The reason for that code was that user could specify the list of auxiliary jars in hive-default.xml (and also by setting the hive.aux.jars.path in hive command line client as a session variable). In which case GenericOptionsParser code doesn't work. I think appending 'file://' is new to 19.

        Can you persist this mapping in metastore? otherwise create needs to be specified everytime hive session is brought up. And also management functions such as drop, rename and some such needs to be provided.

        Another alternative (as an initial implementation which is simple but sufficient for now) is to use reflection to load the function class. The function name and class name can be same and the package could be predefined (org.apache.hadoop.hive.udf) or configured and at runtime this class would be loaded.

        Show
        Prasad Chakka added a comment - Hi Tom, The reason for that code was that user could specify the list of auxiliary jars in hive-default.xml (and also by setting the hive.aux.jars.path in hive command line client as a session variable). In which case GenericOptionsParser code doesn't work. I think appending 'file://' is new to 19. Can you persist this mapping in metastore? otherwise create needs to be specified everytime hive session is brought up. And also management functions such as drop, rename and some such needs to be provided. Another alternative (as an initial implementation which is simple but sufficient for now) is to use reflection to load the function class. The function name and class name can be same and the package could be predefined (org.apache.hadoop.hive.udf) or configured and at runtime this class would be loaded.
        Hide
        Ashish Thusoo added a comment -

        cool...
        Actually, could this be made a part of DDLWork instead of AdminWork and then it can all be launched from the DDLTask. We have different DDL descriptors in DDL Work for things like create/drop/alter tables and the functions could also belong to that. Having a DDLFunctionsWork is also fine. I do think we should be more specific than admin.

        Also there is this whole thing of making this mapping persistent(both Raghu and Prasad were mentioning that) so that they can be reused across sessions. Maybe we can have that discussion here as well.

        Show
        Ashish Thusoo added a comment - cool... Actually, could this be made a part of DDLWork instead of AdminWork and then it can all be launched from the DDLTask. We have different DDL descriptors in DDL Work for things like create/drop/alter tables and the functions could also belong to that. Having a DDLFunctionsWork is also fine. I do think we should be more specific than admin. Also there is this whole thing of making this mapping persistent(both Raghu and Prasad were mentioning that) so that they can be reused across sessions. Maybe we can have that discussion here as well.
        Hide
        Tom White added a comment -

        Thanks for all your feedback. I'm working on a test case.

        The reason I called it AdminWork was because I was following the terminology used in MySQL. "DDLWork" and "DDLFunctionsWork" don't sound right as function creation is not really a part of DDL. I can rename it something more specific - CreateFunctionWork?

        I agree that it would be good to make the mapping persistent, and add drop and rename, but these can be done in later Jiras.

        The reason for that code was that user could specify the list of auxiliary jars in hive-default.xml (and also by setting the hive.aux.jars.path in hive command line client as a session variable). In which case GenericOptionsParser code doesn't work. I think appending 'file://' is new to 19.

        Not sure about that, but HADOOP-3743 certainly changed things in that area. The problem is that if you set the variable HIVE_AUX_JARS_PATH to say /foo then the session variable hive.aux.jars.path is set to /foo by the hive script, rather than file:///foo. So I've been overriding this by redefining the session variable at the moment. What's the best way of fixing this?

        Show
        Tom White added a comment - Thanks for all your feedback. I'm working on a test case. The reason I called it AdminWork was because I was following the terminology used in MySQL. "DDLWork" and "DDLFunctionsWork" don't sound right as function creation is not really a part of DDL. I can rename it something more specific - CreateFunctionWork? I agree that it would be good to make the mapping persistent, and add drop and rename, but these can be done in later Jiras. The reason for that code was that user could specify the list of auxiliary jars in hive-default.xml (and also by setting the hive.aux.jars.path in hive command line client as a session variable). In which case GenericOptionsParser code doesn't work. I think appending 'file://' is new to 19. Not sure about that, but HADOOP-3743 certainly changed things in that area. The problem is that if you set the variable HIVE_AUX_JARS_PATH to say /foo then the session variable hive.aux.jars.path is set to /foo by the hive script, rather than file:///foo . So I've been overriding this by redefining the session variable at the moment. What's the best way of fixing this?
        Hide
        Prasad Chakka added a comment -

        You are correct that we do need to preppend file:// to files specified in session parameter hive.aux.jars.path. But what I meant is that instead of removing the code snippet in ExecDriver, we should mimic what GenericCommandOptionParser does (possibly refactoring the code out). This would be proper if user keeps changing this variable between queries. Another option would be to mandate specification of file:// in the hive.aux.jars.path param itself. But i prefer the former.

        Show
        Prasad Chakka added a comment - You are correct that we do need to preppend file:// to files specified in session parameter hive.aux.jars.path. But what I meant is that instead of removing the code snippet in ExecDriver, we should mimic what GenericCommandOptionParser does (possibly refactoring the code out). This would be proper if user keeps changing this variable between queries. Another option would be to mandate specification of file:// in the hive.aux.jars.path param itself. But i prefer the former.
        Hide
        Ashish Thusoo added a comment -

        Not exactly true about treating create/rename/alter functions as part of the DDL. Most DBs bin them in that bucket. Case to point is the mysql doc (link below) which clearly lists create function in the DDL section.

        http://dev.mysql.com/doc/refman/5.0/en/sql-syntax-data-definition.html

        I am not too hung up on this though. FunctionWork(instead of CreateFunctionWork) is also fine with me - though I would like to have that bin all the different descriptors for manipulating functions rather than create a different work class for each manipulation (create/alter/rename/drop etc...)

        We can start of this with being non persistent but then in the syntax it should be made explicit that this is non persistent.

        One proposal could be

        CREATE TEMPORARY FUNCTION blah AS blah

        Also for temporary functions, reflection is also another way to structure this as Prasad pointed out but then that would not work if we allow non Java functions, so I think having syntax is useful, though we could optimize the java case to the point that all it means to create a new function is to drop in a jar which implements the udf interfaces (including something like a displayName() interface which would allow the language name to be different from the implementation name.

        Show
        Ashish Thusoo added a comment - Not exactly true about treating create/rename/alter functions as part of the DDL. Most DBs bin them in that bucket. Case to point is the mysql doc (link below) which clearly lists create function in the DDL section. http://dev.mysql.com/doc/refman/5.0/en/sql-syntax-data-definition.html I am not too hung up on this though. FunctionWork(instead of CreateFunctionWork) is also fine with me - though I would like to have that bin all the different descriptors for manipulating functions rather than create a different work class for each manipulation (create/alter/rename/drop etc...) We can start of this with being non persistent but then in the syntax it should be made explicit that this is non persistent. One proposal could be CREATE TEMPORARY FUNCTION blah AS blah Also for temporary functions, reflection is also another way to structure this as Prasad pointed out but then that would not work if we allow non Java functions, so I think having syntax is useful, though we could optimize the java case to the point that all it means to create a new function is to drop in a jar which implements the udf interfaces (including something like a displayName() interface which would allow the language name to be different from the implementation name.
        Hide
        Tom White added a comment -

        Here's an updated patch that addresses some of the comments.

        • I've reverted the change to ExecDriver. (Set hive.aux.jars.path as a session variable to pick up jars.)
        • AdminWork is now FunctionWork. Other descriptors can be added to this class as needed.
        • The syntax is now
          CREATE TEMPORARY FUNCTION my_udf AS 'com.example.MyUdf';
          
        • I've added a client unit test which uses a test UDF packaged in a jar.
        Show
        Tom White added a comment - Here's an updated patch that addresses some of the comments. I've reverted the change to ExecDriver. (Set hive.aux.jars.path as a session variable to pick up jars.) AdminWork is now FunctionWork. Other descriptors can be added to this class as needed. The syntax is now CREATE TEMPORARY FUNCTION my_udf AS 'com.example.MyUdf'; I've added a client unit test which uses a test UDF packaged in a jar.
        Hide
        Ashish Thusoo added a comment -

        +1

        Looks good. Thanks for the contribution.

        Only comment that I have is that in the patch you have duplicated the code for compile-test in ql/build.xml

        Would be better if instead we introduced a new target for building the udf jar (on similar lines as gen-test) and have a dummy gen-test-jar in build-common.xml which gets overwritten in ql/build.xml...

        Its upto you if you want to incorporate that... But this is good as is...

        Thanks again for the contribution... keep them rolling

        Show
        Ashish Thusoo added a comment - +1 Looks good. Thanks for the contribution. Only comment that I have is that in the patch you have duplicated the code for compile-test in ql/build.xml Would be better if instead we introduced a new target for building the udf jar (on similar lines as gen-test) and have a dummy gen-test-jar in build-common.xml which gets overwritten in ql/build.xml... Its upto you if you want to incorporate that... But this is good as is... Thanks again for the contribution... keep them rolling
        Hide
        Zheng Shao added a comment -

        +1

        This diff probably conflicts with Ashish's diff on explain (which merged UDFRegistry and UDAFRegistry). We should find a way to merge them.

        Show
        Zheng Shao added a comment - +1 This diff probably conflicts with Ashish's diff on explain (which merged UDFRegistry and UDAFRegistry). We should find a way to merge them.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12390274/hadoop-4070-v2.patch
        against trunk revision 696551.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12390274/hadoop-4070-v2.patch against trunk revision 696551. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/console This message is automatically generated.
        Hide
        Tom White added a comment -

        New patch which includes Ashish's suggestion to reduce duplication in the build files.

        Show
        Tom White added a comment - New patch which includes Ashish's suggestion to reduce duplication in the build files.
        Hide
        Ashish Thusoo added a comment -

        +1

        Looks good.

        As Zheng points out, this will conflict with my changes, but we can work that out at commit time.

        Show
        Ashish Thusoo added a comment - +1 Looks good. As Zheng points out, this will conflict with my changes, but we can work that out at commit time.
        Hide
        Ashish Thusoo added a comment -

        The conflict will be on line 45 in FunctionTask.java - instead of UDFRegistry we have FunctionRegistry now.

        We can fix explain plan for this later...

        Show
        Ashish Thusoo added a comment - The conflict will be on line 45 in FunctionTask.java - instead of UDFRegistry we have FunctionRegistry now. We can fix explain plan for this later...
        Hide
        Ashish Thusoo added a comment -

        Cancelling as this is going to not compile now since Dhrubha has already checked in my txn of explain plan.

        Show
        Ashish Thusoo added a comment - Cancelling as this is going to not compile now since Dhrubha has already checked in my txn of explain plan.
        Hide
        Ashish Thusoo added a comment -

        Resolved conflict that Tom's patch had with explain plan and added an explain plan test.

        Tom I hope you don't mind me doing this as otherwise I think this may not make it for tomorrows 0.19 code freeze

        Show
        Ashish Thusoo added a comment - Resolved conflict that Tom's patch had with explain plan and added an explain plan test. Tom I hope you don't mind me doing this as otherwise I think this may not make it for tomorrows 0.19 code freeze
        Hide
        Ashish Thusoo added a comment -

        resolved conflicts with my explain plan transaction.

        Show
        Ashish Thusoo added a comment - resolved conflicts with my explain plan transaction.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12390402/hadoop-4070-v4.patch
        against trunk revision 696846.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12390402/hadoop-4070-v4.patch against trunk revision 696846. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/console This message is automatically generated.
        Hide
        Tom White added a comment -

        I've just committed this.

        The test failures are unrelated to this patch. Thanks Ashish for fixing the conflict with the explain plan patch.

        Show
        Tom White added a comment - I've just committed this. The test failures are unrelated to this patch. Thanks Ashish for fixing the conflict with the explain plan patch.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #611 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/ )

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development