Issue Details (XML | Word | Printable)

Key: HADOOP-4070
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Tom White
Reporter: Tom White
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

[Hive] Provide a mechanism for registering UDFs from the query language

Created: 04/Sep/08 04:40 PM   Updated: 08/Jul/09 05:06 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works hadoop-4070-v2.patch 2008-09-17 02:30 PM Tom White 18 kB
Text File Licensed for inclusion in ASF works hadoop-4070-v3.patch 2008-09-18 11:53 AM Tom White 17 kB
Text File Licensed for inclusion in ASF works hadoop-4070-v4.patch 2008-09-18 06:54 PM Ashish Thusoo 18 kB
Text File Licensed for inclusion in ASF works hadoop-4070.patch 2008-09-11 04:21 PM Tom White 12 kB

Hadoop Flags: Reviewed
Resolution Date: 19/Sep/08 10:20 AM


 Description  « Hide
UDFs (user defined functions) in Hive are currently actually built-in functions. This issue is to develop a packaging and registration mechanism so users can add their own functions at runtime.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Tom White added a comment - 11/Sep/08 04:21 PM
This patch extends the grammar to support creating UDFs.

For example you can do

CREATE FUNCTION my_udf AS com.example.MyUdf;

then use the my_udf function just like any other function.

The class defining the UDF (com.example.MyUdf) should be packed in a jar file then made available to hive by setting the HIVE_AUX_JARS_PATH environment variable (which specifies the directory that the jar is in).

Note that there is a change in ExecDriver which removes some code that was setting the "tmpjars" property incorrectly. The problem was that the paths were not have the "file://" scheme prepended to them as GenericOptionsParser does, which caused the paths to be interpreted as HDFS paths. By removing the code GenericOptionsParser sets the variable correctly.

I have tested it manually on a simple UDF, the next step is to write some automated tests.


Zheng Shao added a comment - 11/Sep/08 07:57 PM
Can you add a test case, Tom?

For adding a test case, please add a "my.q" file in hive/ql/src/test/queries/positive, and then run "ant test -Dtestcase=TestParse -Dqfile=my.q", collect the standard result by searching for the failed "diff" command in the test log.
And then do the same for hive/ql/src/test/queries/clientpositive, and "ant test -Dtestcase=TestCliDriver -Dqfile=my.q".

Another thing is that do you think AdminWork will be responsible for other works besides creating UDF? if not then let's call it a more specific name (like CreateUDFWork).


Prasad Chakka added a comment - 11/Sep/08 08:16 PM
Hi Tom,

The reason for that code was that user could specify the list of auxiliary jars in hive-default.xml (and also by setting the hive.aux.jars.path in hive command line client as a session variable). In which case GenericOptionsParser code doesn't work. I think appending 'file://' is new to 19.

Can you persist this mapping in metastore? otherwise create needs to be specified everytime hive session is brought up. And also management functions such as drop, rename and some such needs to be provided.

Another alternative (as an initial implementation which is simple but sufficient for now) is to use reflection to load the function class. The function name and class name can be same and the package could be predefined (org.apache.hadoop.hive.udf) or configured and at runtime this class would be loaded.


Ashish Thusoo added a comment - 11/Sep/08 08:19 PM
cool...
Actually, could this be made a part of DDLWork instead of AdminWork and then it can all be launched from the DDLTask. We have different DDL descriptors in DDL Work for things like create/drop/alter tables and the functions could also belong to that. Having a DDLFunctionsWork is also fine. I do think we should be more specific than admin.

Also there is this whole thing of making this mapping persistent(both Raghu and Prasad were mentioning that) so that they can be reused across sessions. Maybe we can have that discussion here as well.


Tom White added a comment - 12/Sep/08 12:37 PM
Thanks for all your feedback. I'm working on a test case.

The reason I called it AdminWork was because I was following the terminology used in MySQL. "DDLWork" and "DDLFunctionsWork" don't sound right as function creation is not really a part of DDL. I can rename it something more specific - CreateFunctionWork?

I agree that it would be good to make the mapping persistent, and add drop and rename, but these can be done in later Jiras.

The reason for that code was that user could specify the list of auxiliary jars in hive-default.xml (and also by setting the hive.aux.jars.path in hive command line client as a session variable). In which case GenericOptionsParser code doesn't work. I think appending 'file://' is new to 19.

Not sure about that, but HADOOP-3743 certainly changed things in that area. The problem is that if you set the variable HIVE_AUX_JARS_PATH to say /foo then the session variable hive.aux.jars.path is set to /foo by the hive script, rather than file:///foo. So I've been overriding this by redefining the session variable at the moment. What's the best way of fixing this?


Prasad Chakka added a comment - 12/Sep/08 03:35 PM
You are correct that we do need to preppend file:// to files specified in session parameter hive.aux.jars.path. But what I meant is that instead of removing the code snippet in ExecDriver, we should mimic what GenericCommandOptionParser does (possibly refactoring the code out). This would be proper if user keeps changing this variable between queries. Another option would be to mandate specification of file:// in the hive.aux.jars.path param itself. But i prefer the former.

Ashish Thusoo added a comment - 12/Sep/08 05:24 PM
Not exactly true about treating create/rename/alter functions as part of the DDL. Most DBs bin them in that bucket. Case to point is the mysql doc (link below) which clearly lists create function in the DDL section.

http://dev.mysql.com/doc/refman/5.0/en/sql-syntax-data-definition.html

I am not too hung up on this though. FunctionWork(instead of CreateFunctionWork) is also fine with me - though I would like to have that bin all the different descriptors for manipulating functions rather than create a different work class for each manipulation (create/alter/rename/drop etc...)

We can start of this with being non persistent but then in the syntax it should be made explicit that this is non persistent.

One proposal could be

CREATE TEMPORARY FUNCTION blah AS blah

Also for temporary functions, reflection is also another way to structure this as Prasad pointed out but then that would not work if we allow non Java functions, so I think having syntax is useful, though we could optimize the java case to the point that all it means to create a new function is to drop in a jar which implements the udf interfaces (including something like a displayName() interface which would allow the language name to be different from the implementation name.


Tom White added a comment - 17/Sep/08 02:30 PM
Here's an updated patch that addresses some of the comments.
  • I've reverted the change to ExecDriver. (Set hive.aux.jars.path as a session variable to pick up jars.)
  • AdminWork is now FunctionWork. Other descriptors can be added to this class as needed.
  • The syntax is now
    CREATE TEMPORARY FUNCTION my_udf AS 'com.example.MyUdf';
    
  • I've added a client unit test which uses a test UDF packaged in a jar.

Ashish Thusoo added a comment - 18/Sep/08 01:43 AM
+1

Looks good. Thanks for the contribution.

Only comment that I have is that in the patch you have duplicated the code for compile-test in ql/build.xml

Would be better if instead we introduced a new target for building the udf jar (on similar lines as gen-test) and have a dummy gen-test-jar in build-common.xml which gets overwritten in ql/build.xml...

Its upto you if you want to incorporate that... But this is good as is...

Thanks again for the contribution... keep them rolling


Zheng Shao added a comment - 18/Sep/08 02:07 AM
+1

This diff probably conflicts with Ashish's diff on explain (which merged UDFRegistry and UDAFRegistry). We should find a way to merge them.


Hadoop QA added a comment - 18/Sep/08 05:30 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390274/hadoop-4070-v2.patch
against trunk revision 696551.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 12 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3294/console

This message is automatically generated.


Tom White added a comment - 18/Sep/08 11:53 AM
New patch which includes Ashish's suggestion to reduce duplication in the build files.

Ashish Thusoo added a comment - 18/Sep/08 05:42 PM
+1

Looks good.

As Zheng points out, this will conflict with my changes, but we can work that out at commit time.


Ashish Thusoo added a comment - 18/Sep/08 05:45 PM
The conflict will be on line 45 in FunctionTask.java - instead of UDFRegistry we have FunctionRegistry now.

We can fix explain plan for this later...


Ashish Thusoo added a comment - 18/Sep/08 06:52 PM
Cancelling as this is going to not compile now since Dhrubha has already checked in my txn of explain plan.

Ashish Thusoo added a comment - 18/Sep/08 06:54 PM
Resolved conflict that Tom's patch had with explain plan and added an explain plan test.

Tom I hope you don't mind me doing this as otherwise I think this may not make it for tomorrows 0.19 code freeze


Ashish Thusoo added a comment - 18/Sep/08 06:54 PM
resolved conflicts with my explain plan transaction.

Hadoop QA added a comment - 19/Sep/08 07:38 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390402/hadoop-4070-v4.patch
against trunk revision 696846.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 12 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3315/console

This message is automatically generated.


Tom White added a comment - 19/Sep/08 10:20 AM
I've just committed this.

The test failures are unrelated to this patch. Thanks Ashish for fixing the conflict with the explain plan patch.


Hudson added a comment - 22/Sep/08 03:18 PM