Hive
  1. Hive
  2. HIVE-6047 Permanent UDFs in Hive
  3. HIVE-6167

Allow user-defined functions to be qualified with database name

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: UDF
    • Labels:
      None

      Description

      Function names in Hive are currently unqualified and there is a single namespace for all function names. This task would allow users to define temporary UDFs (and eventually permanent UDFs) with a database name, such as:

      CREATE TEMPORARY FUNCTION userdb.myfunc 'myudfclass';

      1. HIVE-6167.4.patch
        18 kB
        Jason Dere
      2. HIVE-6167.3.patch
        18 kB
        Jason Dere
      3. HIVE-6167.2.patch
        83 kB
        Jason Dere
      4. HIVE-6167.1.patch
        18 kB
        Jason Dere

        Activity

        Hide
        Jason Dere added a comment -

        Xuefu Zhang brought up this point:

        Reading the document, I found one thing that seems to be debatable:
        1. Creating a function w/o database name means "in the current database of the session".
        2. Creating a temp function 2/o database name means global in the system as built-in functions.
        I understand the consideration of backward compatibility, but the discrepancy can confuse the user a great deal. Why cannot we change #1 in the same way for temp functions?
        1'. Creating a function w/o database name means global in the system as built-in functions.

        It would make sense for there to be some consistency with the default database name for temporary/permanent functions. I've also been playing with qualifying built-in functions with schema names and I don't really like the effect on "describe functions":

        hive> show functions;
        OK
        default.myfunc
        mydb.myfunc
        sys.!
        sys.!=
        sys.%
        sys.&
        sys.*
        sys.+
        sys.-
        sys./
        sys.<
        sys.<=
        sys.<=>
        sys.<>
        sys.=
        sys.==
        sys.>
        sys.>=
        sys.^
        sys.abs
        sys.acos
        sys.and
        sys.array
        sys.array_contains
        sys.ascii
         ...
        

        Many of these functions will not work if they are called as they are labelled, such as any of the operators (sys.+ wouldn't work), or any keywords that are implemented as functions. So I'm wondering if the built-in functions can be in the registry as non-qualified function names. However, I think that if we do have permanent functions, that they should be qualified. But we also want consistency in default db name between temp and permanent functions.

        So how about this behavior:

        • Built-in functions are not qualified
        • User-defined functions (temp/permanent) that are created without a database name will be created using the database name "default".
        • Function resolution of non-qualified functions will be in the following order:
          1. Lookup using non-qualified name, which will catch built-ins
          2. Lookup by qualifying function name with "default". This will catch non-qualified user-defined functions
          3. Lookup by qualifying function name with user's current database.

        I suppose a future enhancement could be to allow users to specify a custom set of db names when resolving function names, but I think the above would be a suitable default. Does this approach make sense?

        Show
        Jason Dere added a comment - Xuefu Zhang brought up this point: Reading the document, I found one thing that seems to be debatable: 1. Creating a function w/o database name means "in the current database of the session". 2. Creating a temp function 2/o database name means global in the system as built-in functions. I understand the consideration of backward compatibility, but the discrepancy can confuse the user a great deal. Why cannot we change #1 in the same way for temp functions? 1'. Creating a function w/o database name means global in the system as built-in functions. It would make sense for there to be some consistency with the default database name for temporary/permanent functions. I've also been playing with qualifying built-in functions with schema names and I don't really like the effect on "describe functions": hive> show functions; OK default.myfunc mydb.myfunc sys.! sys.!= sys.% sys.& sys.* sys.+ sys.- sys./ sys.< sys.<= sys.<=> sys.<> sys.= sys.== sys.> sys.>= sys.^ sys.abs sys.acos sys.and sys.array sys.array_contains sys.ascii ... Many of these functions will not work if they are called as they are labelled, such as any of the operators (sys.+ wouldn't work), or any keywords that are implemented as functions. So I'm wondering if the built-in functions can be in the registry as non-qualified function names. However, I think that if we do have permanent functions, that they should be qualified. But we also want consistency in default db name between temp and permanent functions. So how about this behavior: Built-in functions are not qualified User-defined functions (temp/permanent) that are created without a database name will be created using the database name "default". Function resolution of non-qualified functions will be in the following order: 1. Lookup using non-qualified name, which will catch built-ins 2. Lookup by qualifying function name with "default". This will catch non-qualified user-defined functions 3. Lookup by qualifying function name with user's current database. I suppose a future enhancement could be to allow users to specify a custom set of db names when resolving function names, but I think the above would be a suitable default. Does this approach make sense?
        Hide
        Xuefu Zhang added a comment -

        I think a direct consequence of the proposed behavior is that user who defines a function (temp or perm) without qualified name will not be able to use it as the way the function is defined unless the current database is 'default'. Plus, it seems unusual that in such case that a function defines in the context of a database other than 'default' actually registered with another database ('default').

        I can think of two alternatives:
        1. break backward compatibility so that non-qualified temp function goes to the current database.
        2. don't qualify user defined functions (temp or perm) with database names so that every user-defined function as well as built-in function is global.

        I'd like to read other's thoughts as well.

        Show
        Xuefu Zhang added a comment - I think a direct consequence of the proposed behavior is that user who defines a function (temp or perm) without qualified name will not be able to use it as the way the function is defined unless the current database is 'default'. Plus, it seems unusual that in such case that a function defines in the context of a database other than 'default' actually registered with another database ('default'). I can think of two alternatives: 1. break backward compatibility so that non-qualified temp function goes to the current database. 2. don't qualify user defined functions (temp or perm) with database names so that every user-defined function as well as built-in function is global. I'd like to read other's thoughts as well.
        Hide
        Alan Gates added a comment -

        My thoughts:

        1. I agree that built-in UDFs should be in a global space that need not (probably cannot) be qualified. Everyone needs access to them, and I can't see any value in adding a namespace here.
        2. Permanent functions added by users or admins should be in a database. This gives us namespace separation, it fits with the SQL standard, and I think it will fit in better with where we're trying to take the security model. I'm fine with create function defaulting to the current (not necessarily the default) data base if users don't specify which database.
        3. Temporary functions are really session specific (they go away once the user disconnects) and can't be accessed by any other session, correct? So I think it's fine to view them as "global" for the purposes of that session. The following sequence should work:
          1. use db foo
          2. create temporary function a();
          3. use db bar
          4. select a( x ) from T
        Show
        Alan Gates added a comment - My thoughts: I agree that built-in UDFs should be in a global space that need not (probably cannot) be qualified. Everyone needs access to them, and I can't see any value in adding a namespace here. Permanent functions added by users or admins should be in a database. This gives us namespace separation, it fits with the SQL standard, and I think it will fit in better with where we're trying to take the security model. I'm fine with create function defaulting to the current (not necessarily the default) data base if users don't specify which database. Temporary functions are really session specific (they go away once the user disconnects) and can't be accessed by any other session, correct? So I think it's fine to view them as "global" for the purposes of that session. The following sequence should work: use db foo create temporary function a(); use db bar select a( x ) from T
        Hide
        Jason Dere added a comment -

        I could get on board with that approach. Xuefu, does that work for you?

        Show
        Jason Dere added a comment - I could get on board with that approach. Xuefu, does that work for you?
        Hide
        Xuefu Zhang added a comment -

        I largely agree except for one thought on #2. If a user adds a perm function without db name, could we assume it's global. Image an admin register a particular function across all DBs. If Hive insists the function go to a DB, then he/she will have to do the same for each DB:

        use DB1;
        create function a();
        use Db2;
        create function a();
        ...
        

        or

        create function DB1.a();
        create function DB1.a();
        ...
        

        It becomes much simpler if a function without DB name goes global (in line with temp functions). Is this completely anti SQL standard? Even so, It seems that usability outweighs, not to mention that Hive function is non-standard anyway.

        Show
        Xuefu Zhang added a comment - I largely agree except for one thought on #2. If a user adds a perm function without db name, could we assume it's global. Image an admin register a particular function across all DBs. If Hive insists the function go to a DB, then he/she will have to do the same for each DB: use DB1; create function a(); use Db2; create function a(); ... or create function DB1.a(); create function DB1.a(); ... It becomes much simpler if a function without DB name goes global (in line with temp functions). Is this completely anti SQL standard? Even so, It seems that usability outweighs, not to mention that Hive function is non-standard anyway.
        Hide
        Alan Gates added a comment -

        There's no need to add it to every database. Users can invoked the function across database. So it would be

        use DB1;
        create function a();
        use DB2;
        select DB1.a() from T;
        

        I disagree that this is a usability issue. I think it's much more usable to give users namespaces then to push them into one flat namespace.

        Show
        Alan Gates added a comment - There's no need to add it to every database. Users can invoked the function across database. So it would be use DB1; create function a(); use DB2; select DB1.a() from T; I disagree that this is a usability issue. I think it's much more usable to give users namespaces then to push them into one flat namespace.
        Hide
        Edward Capriolo added a comment -

        In my opinion we must keep the current syntax working as is. Current users of hive do not want there scripts to break just to match a standard. If we wish to add new syntax that matches a given standard that makes sense. I do not think the current standard forbids keeping our current syntax and functionality. Also realistically we have to be practical. Users have sessions, most users are not going to care what database/schema a function is associated with. Most are going to want global functions. Most people are not going to have so many functions that a conflict would ever arise. Lets not make and solve problems we really don't have.

        Show
        Edward Capriolo added a comment - In my opinion we must keep the current syntax working as is. Current users of hive do not want there scripts to break just to match a standard. If we wish to add new syntax that matches a given standard that makes sense. I do not think the current standard forbids keeping our current syntax and functionality. Also realistically we have to be practical. Users have sessions, most users are not going to care what database/schema a function is associated with. Most are going to want global functions. Most people are not going to have so many functions that a conflict would ever arise. Lets not make and solve problems we really don't have.
        Hide
        Jason Dere added a comment -

        The current syntax and usage (for temporary functions) will continue to work as they do now. The new syntax additions and behavior (for "permanent" functions registered in the metastore) is what would have extra name qualifiers.

        Show
        Jason Dere added a comment - The current syntax and usage (for temporary functions) will continue to work as they do now. The new syntax additions and behavior (for "permanent" functions registered in the metastore) is what would have extra name qualifiers.
        Hide
        Jason Dere added a comment -

        Initial patch. Since built-in/temp functions do not have qualifiers there isn't won't actually be any functions that are resolvable when qualified with a db name, but this does allow Hive to parse qualified function names.

        Show
        Jason Dere added a comment - Initial patch. Since built-in/temp functions do not have qualifiers there isn't won't actually be any functions that are resolvable when qualified with a db name, but this does allow Hive to parse qualified function names.
        Hide
        Jason Dere added a comment -

        submitting patch to see how tests look

        Show
        Jason Dere added a comment - submitting patch to see how tests look
        Hide
        Hive QA added a comment -

        Overall: -1 at least one tests failed

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12623290/HIVE-6167.1.patch

        ERROR: -1 due to 8 failed/errored test(s), 4929 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_show_describe_func_quotes
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_show_functions
        org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_clusterbyorderby
        org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_join_alt_syntax_comma_on
        org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_lateral_view_join
        org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_uniquejoin3
        org.apache.hadoop.hive.ql.parse.TestMacroSemanticAnalyzer.testCannotUseReservedWordAsName
        org.apache.hadoop.hive.ql.parse.TestParseNegative.testParseNegative_macro_reserved_word
        

        Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/932/testReport
        Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/932/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 8 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12623290

        Show
        Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12623290/HIVE-6167.1.patch ERROR: -1 due to 8 failed/errored test(s), 4929 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_show_describe_func_quotes org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_show_functions org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_clusterbyorderby org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_join_alt_syntax_comma_on org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_lateral_view_join org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_uniquejoin3 org.apache.hadoop.hive.ql.parse.TestMacroSemanticAnalyzer.testCannotUseReservedWordAsName org.apache.hadoop.hive.ql.parse.TestParseNegative.testParseNegative_macro_reserved_word Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/932/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/932/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 8 tests failed This message is automatically generated. ATTACHMENT ID: 12623290
        Show
        Jason Dere added a comment - https://reviews.apache.org/r/17493/
        Hide
        Jason Dere added a comment -

        patch v4 - add apache header to FunctionUtils

        Show
        Jason Dere added a comment - patch v4 - add apache header to FunctionUtils
        Hide
        Hive QA added a comment -

        Overall: -1 at least one tests failed

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12626955/HIVE-6167.4.patch

        ERROR: -1 due to 1 failed/errored test(s), 5012 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket_num_reducers
        

        Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1189/testReport
        Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1189/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 1 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12626955

        Show
        Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12626955/HIVE-6167.4.patch ERROR: -1 due to 1 failed/errored test(s), 5012 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket_num_reducers Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1189/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1189/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed This message is automatically generated. ATTACHMENT ID: 12626955
        Hide
        Jason Dere added a comment -

        I don't think the 1 failure is related to these changes, bucket_num_reducers has failed on a number of precommit tests.

        Show
        Jason Dere added a comment - I don't think the 1 failure is related to these changes, bucket_num_reducers has failed on a number of precommit tests.
        Hide
        Ashutosh Chauhan added a comment -

        Patch looks good. I have one question tho. Earlier, function names were identifier. I assume DOT can be part of an identifier. If so, than can it be a problem now if some one who has earlier created a function name with dot in it, may now confuse Hive.

        Show
        Ashutosh Chauhan added a comment - Patch looks good. I have one question tho. Earlier, function names were identifier. I assume DOT can be part of an identifier. If so, than can it be a problem now if some one who has earlier created a function name with dot in it, may now confuse Hive.
        Hide
        Jason Dere added a comment -

        So DOT isn't normally allowed as part of an identifier. I think this would only be possible if the dot was included as part of a quoted name, which seems like a bit of an unusual case. If this case would need to be supported this may be better off being done as a separate issue.

        Show
        Jason Dere added a comment - So DOT isn't normally allowed as part of an identifier. I think this would only be possible if the dot was included as part of a quoted name, which seems like a bit of an unusual case. If this case would need to be supported this may be better off being done as a separate issue.
        Hide
        Ashutosh Chauhan added a comment -

        OK, if need be that could be done in a follow-up +1

        Show
        Ashutosh Chauhan added a comment - OK, if need be that could be done in a follow-up +1
        Hide
        Ashutosh Chauhan added a comment -

        Committed to trunk. Thanks, Jason!

        Show
        Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Jason!
        Hide
        Lefty Leverenz added a comment -

        Jason Dere documented this in the wiki's DDL doc, so I added a brief description in the Hive Plugins doc and linked it to the DDL.

        But the jira description currently has things backwards: "This task would allow users to define temporary UDFs (and eventually permanent UDFs) with a database name" – it's permanent UDFs that can be defined with a database name, not temporary UDFs.

        Show
        Lefty Leverenz added a comment - Jason Dere documented this in the wiki's DDL doc, so I added a brief description in the Hive Plugins doc and linked it to the DDL. DDL: Permanent Functions Hive Plugins But the jira description currently has things backwards: "This task would allow users to define temporary UDFs (and eventually permanent UDFs) with a database name" – it's permanent UDFs that can be defined with a database name, not temporary UDFs.

          People

          • Assignee:
            Jason Dere
            Reporter:
            Jason Dere
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development