Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: UDF
    • Labels:
      None

      Description

      Currently Hive only supports temporary UDFs which must be re-registered when starting up a Hive session. Provide some support to register permanent UDFs with Hive.

      1. PermanentFunctionsinHive.pdf
        169 kB
        Jason Dere
      2. PermanentFunctionsinHive.pdf
        176 kB
        Jason Dere

        Issue Links

          Activity

          Hide
          ehans Eric Hanson added a comment -

          Vectorized execution works with temporary UDFs through an adaptor. If you could verify that "permanent" UDFs added by users also work in vectorized mode with that adaptor, that'd be great.

          Show
          ehans Eric Hanson added a comment - Vectorized execution works with temporary UDFs through an adaptor. If you could verify that "permanent" UDFs added by users also work in vectorized mode with that adaptor, that'd be great.
          Hide
          jdere Jason Dere added a comment -

          Attaching initial proposal

          Show
          jdere Jason Dere added a comment - Attaching initial proposal
          Hide
          xuefuz Xuefu Zhang added a comment -

          The document looks good. I have a few questions/comments:

          1. The document seems mostly about functionality. Will there be a design document as well?
          2. Will there be any metastore api changes/additions? If so, it would be good to specify them here also.
          3. It would be good to specify in the doc what's in line with SQL standard, which deviates from that.
          4. For using a UDF on data from a different database, it seems to me that Hive should prevent from this, at least as a functional goal. Otherwise, all UDFs will be global, and db name becomes namespace, similar to java package name.
          5. There seems a little inconsistency between the text and the syntax, thought a very minor point:

          Sets of JARs can be saved to the warehouse using CREATE JARSET.
          CREATE JARSET | FILESET | ARCHIVESET ...

          6. Does "SHOW JARSET | FILESET | ARCHIVESET" need to specify database name?

          Show
          xuefuz Xuefu Zhang added a comment - The document looks good. I have a few questions/comments: 1. The document seems mostly about functionality. Will there be a design document as well? 2. Will there be any metastore api changes/additions? If so, it would be good to specify them here also. 3. It would be good to specify in the doc what's in line with SQL standard, which deviates from that. 4. For using a UDF on data from a different database, it seems to me that Hive should prevent from this, at least as a functional goal. Otherwise, all UDFs will be global, and db name becomes namespace, similar to java package name. 5. There seems a little inconsistency between the text and the syntax, thought a very minor point: Sets of JARs can be saved to the warehouse using CREATE JARSET. CREATE JARSET | FILESET | ARCHIVESET ... 6. Does "SHOW JARSET | FILESET | ARCHIVESET" need to specify database name?
          Hide
          jdere Jason Dere added a comment -

          Thanks for the feedback.

          1. Yeah, will provide some design details. Still need to think though some of this.
          2. Definitely will be metastore changes - will provide those as part of the design details.
          3. I wasn't aware of any CREATE FUNCTION spec in the SQL standard .. is there one?
          4. True, the db name ends up being more of a namespace, but I was under the impression that Hive databases were more like a schema name. Even if a user has specified "use database A", are they still able to select/join data from a table in another database? I thought this was the case.
          5. Do you mean it should read "Sets of JARs/files/archives can be saved .."? Sure, will fix.
          6. If we are able to create them using a database name, then yes it would make sense to be able to specify the DB name to get the jar/file/archive sets for that database. Without a db name, then I guess it would default to the current database.

          Show
          jdere Jason Dere added a comment - Thanks for the feedback. 1. Yeah, will provide some design details. Still need to think though some of this. 2. Definitely will be metastore changes - will provide those as part of the design details. 3. I wasn't aware of any CREATE FUNCTION spec in the SQL standard .. is there one? 4. True, the db name ends up being more of a namespace, but I was under the impression that Hive databases were more like a schema name. Even if a user has specified "use database A", are they still able to select/join data from a table in another database? I thought this was the case. 5. Do you mean it should read "Sets of JARs/files/archives can be saved .."? Sure, will fix. 6. If we are able to create them using a database name, then yes it would make sense to be able to specify the DB name to get the jar/file/archive sets for that database. Without a db name, then I guess it would default to the current database.
          Hide
          xuefuz Xuefu Zhang added a comment -

          #5: I meant "CREATE JARSET | FILESET | ARCHIVESET.." should just be "CREATE JARSET ..." because you're only talking JARSET here, especially you later have a dedicated section talking about FILESET and ARCHIVESET.

          In addition to JAR sets, file and archive sets can also be created in a similar manner:
          ...

          Of course, this is very minor. Reader should be able to understand you regardless.

          Show
          xuefuz Xuefu Zhang added a comment - #5: I meant "CREATE JARSET | FILESET | ARCHIVESET.." should just be "CREATE JARSET ..." because you're only talking JARSET here, especially you later have a dedicated section talking about FILESET and ARCHIVESET. In addition to JAR sets, file and archive sets can also be created in a similar manner: ... Of course, this is very minor. Reader should be able to understand you regardless.
          Hide
          jdere Jason Dere added a comment -

          Updated doc with anticipated metastore changes. Also, there is create function syntax in the sql standard. However as detailed in the doc, the proposal is making additions based on the current (and non-standard) Hive syntax.

          Show
          jdere Jason Dere added a comment - Updated doc with anticipated metastore changes. Also, there is create function syntax in the sql standard. However as detailed in the doc, the proposal is making additions based on the current (and non-standard) Hive syntax.
          Hide
          appodictic Edward Capriolo added a comment -

          We just added the ability to write UDFs as groovy, can those be persisted as well it would be easier to save the groovy string rather then the compiled classes.

          Show
          appodictic Edward Capriolo added a comment - We just added the ability to write UDFs as groovy, can those be persisted as well it would be easier to save the groovy string rather then the compiled classes.
          Hide
          jdere Jason Dere added a comment -

          Ok, I'll try to keep that in mind. Would there be anything besides UDFs that a user might want to use the groovy compile command for? Rather than having a separate compile statement we could put the groovy code inline in the function declaration, something like "create function myfunc as 'Pyth' language groovy `... groovy code ...`".

          Show
          jdere Jason Dere added a comment - Ok, I'll try to keep that in mind. Would there be anything besides UDFs that a user might want to use the groovy compile command for? Rather than having a separate compile statement we could put the groovy code inline in the function declaration, something like "create function myfunc as 'Pyth' language groovy `... groovy code ...`".
          Hide
          appodictic Edward Capriolo added a comment -

          Theoreticallly you could compile anything, even input formats or serdes, but I do not imagine anyone using it that way.

          Show
          appodictic Edward Capriolo added a comment - Theoreticallly you could compile anything, even input formats or serdes, but I do not imagine anyone using it that way.
          Hide
          xuefuz Xuefu Zhang added a comment -

          Reading the document, I found one thing that seems to be debatable:
          1. Creating a function w/o database name means "in the current database of the session".
          2. Creating a temp function 2/o database name means global in the system as built-in functions.
          I understand the consideration of backward compatibility, but the discrepancy can confuse the user a great deal. Why cannot we change #1 in the same way for temp functions?
          1'. Creating a function w/o database name means global in the system as built-in functions.

          Show
          xuefuz Xuefu Zhang added a comment - Reading the document, I found one thing that seems to be debatable: 1. Creating a function w/o database name means "in the current database of the session". 2. Creating a temp function 2/o database name means global in the system as built-in functions. I understand the consideration of backward compatibility, but the discrepancy can confuse the user a great deal. Why cannot we change #1 in the same way for temp functions? 1'. Creating a function w/o database name means global in the system as built-in functions.
          Hide
          jdere Jason Dere added a comment -

          I'll continue Xuefu's discussion about database name for functions over at HIVE-6167

          Show
          jdere Jason Dere added a comment - I'll continue Xuefu's discussion about database name for functions over at HIVE-6167
          Hide
          jdere Jason Dere added a comment -

          updating doc to change the jar/file management. Rather than the idea of jar sets, each jar/file would be created as a separate resource, and referenced by the UDF. This would make the metastore changes a bit simpler.

          Show
          jdere Jason Dere added a comment - updating doc to change the jar/file management. Rather than the idea of jar sets, each jar/file would be created as a separate resource, and referenced by the UDF. This would make the metastore changes a bit simpler.
          Hide
          prasadm Prasad Mujumdar added a comment -

          Jason Dere Thanks for the patch. I will also take look at the view soon.
          BTW, the subtasks don't include schema scripts (new 0.13 and upgrade from 0.12). Are you planning to address that via a separate patch ?

          Show
          prasadm Prasad Mujumdar added a comment - Jason Dere Thanks for the patch. I will also take look at the view soon. BTW, the subtasks don't include schema scripts (new 0.13 and upgrade from 0.12). Are you planning to address that via a separate patch ?
          Hide
          jdere Jason Dere added a comment -

          Thanks Prasad Mujumdar, you're right, this does not yet include schema upgrade scripts. Will add an item for that.

          Show
          jdere Jason Dere added a comment - Thanks Prasad Mujumdar , you're right, this does not yet include schema upgrade scripts. Will add an item for that.
          Hide
          leftylev Lefty Leverenz added a comment -

          Jason Dere added documentation to the wiki here:

          So I added a link to it from Hive Plugins:

          Note that all four subtasks are committed in Hive 0.13.0 so that's what the doc says, but this parent jira hasn't been committed yet.

          Show
          leftylev Lefty Leverenz added a comment - Jason Dere added documentation to the wiki here: Language Manual DDL: Permanent Functions So I added a link to it from Hive Plugins: Creating Custom UDFs Note that all four subtasks are committed in Hive 0.13.0 so that's what the doc says, but this parent jira hasn't been committed yet.
          Hide
          leftylev Lefty Leverenz added a comment -

          Ping: shouldn't this jira be closed now?

          Show
          leftylev Lefty Leverenz added a comment - Ping: shouldn't this jira be closed now?
          Hide
          jdere Jason Dere added a comment -

          Yeah this should be closed, thanks for the reminder Lefty.

          Show
          jdere Jason Dere added a comment - Yeah this should be closed, thanks for the reminder Lefty.

            People

            • Assignee:
              jdere Jason Dere
              Reporter:
              jdere Jason Dere
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development