Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-538

make hive_jdbc.jar self-containing

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.3.0, 0.4.0, 0.6.0, 0.13.0
    • 0.14.0
    • JDBC
    • None

    Description

      Currently, most jars in hive/build/dist/lib and the hadoop-*-core.jar are required in the classpath to run jdbc applications on hive. We need to do atleast the following to get rid of most unnecessary dependencies:
      1. get rid of dynamic serde and use a standard serialization format, maybe tab separated, json or avro
      2. dont use hadoop configuration parameters
      3. repackage thrift and fb303 classes into hive_jdbc.jar

      Attachments

        1. HIVE-538.patch
          5 kB
          Nick White
        2. ASF.LICENSE.NOT.GRANTED--HIVE-538.D2553.2.patch
          5 kB
          Phabricator
        3. ASF.LICENSE.NOT.GRANTED--HIVE-538.D2553.1.patch
          5 kB
          Phabricator

        Issue Links

          Activity

            From a purely empirical approach it appears that the following jars are currently required to use the hive JDBC driver (version 0.5.0):

            dist/lib/hive-exec-0.7.0.jar
            dist/lib/hive-jdbc-0.7.0.jar
            dist/lib/hive-metastore-0.7.0.jar
            dist/lib/hive-service-0.7.0.jar
            dist/lib/libfb303.jar
            dist/lib/libthrift.jar
            
            hadoop-core-{version}.jar
            

            I propose modifying the build process to combine the classes from the first set of jars into one single jar. That way users only need to add the hadoop-core jar and and hive-jdbc-combined jar to their classpath. As other dependencies are removed or refactored away, we could thin out what goes in the jar.

            I can take on this JIRA if others agree with the approach.

            billgraham William W. Graham Jr added a comment - From a purely empirical approach it appears that the following jars are currently required to use the hive JDBC driver (version 0.5.0): dist/lib/hive-exec-0.7.0.jar dist/lib/hive-jdbc-0.7.0.jar dist/lib/hive-metastore-0.7.0.jar dist/lib/hive-service-0.7.0.jar dist/lib/libfb303.jar dist/lib/libthrift.jar hadoop-core-{version}.jar I propose modifying the build process to combine the classes from the first set of jars into one single jar. That way users only need to add the hadoop-core jar and and hive-jdbc-combined jar to their classpath. As other dependencies are removed or refactored away, we could thin out what goes in the jar. I can take on this JIRA if others agree with the approach.

            @Bill,
            Your approach sounds reasonable to me. Will you like to work on this? You can reference my patch at HIVE-2900 for how to do repackaging easily.

            ashutoshc Ashutosh Chauhan added a comment - @Bill, Your approach sounds reasonable to me. Will you like to work on this? You can reference my patch at HIVE-2900 for how to do repackaging easily.

            @Ashutosh I'm no longer actively using Hive these days, so it would be quite an effort for me to get set up again to contribute. Sorry, but I need to rescind my offer to take this one on.

            billgraham William W. Graham Jr added a comment - @Ashutosh I'm no longer actively using Hive these days, so it would be quite an effort for me to get set up again to contribute. Sorry, but I need to rescind my offer to take this one on.
            phabricator@reviews.facebook.net Phabricator added a comment -

            ashutoshc requested code review of "HIVE-538 [jira] make hive_jdbc.jar self-containing".
            Reviewers: JIRA

            https://issues.apache.org/jira/browse/HIVE-538

            This patch introduces two new targets:

            a) jar-jdbc-combined : This target generates a jar file containing all the hive jars required for jdbc driver.
            b) jar-jdbc-rt-deps : This target generates a jar file which contains all the hive runtime dependcies in a single jar.

            Currently, most jars in hive/build/dist/lib and the hadoop-*-core.jar are required in the classpath to run jdbc applications on hive. We need to do atleast the following to get rid of most unnecessary dependencies:
            1. get rid of dynamic serde and use a standard serialization format, maybe tab separated, json or avro
            2. dont use hadoop configuration parameters
            3. repackage thrift and fb303 classes into hive_jdbc.jar

            TEST PLAN
            EMPTY

            REVISION DETAIL
            https://reviews.facebook.net/D2553

            AFFECTED FILES
            build.xml
            ivy/libraries.properties
            ivy.xml

            MANAGE HERALD DIFFERENTIAL RULES
            https://reviews.facebook.net/herald/view/differential/

            WHY DID I GET THIS EMAIL?
            https://reviews.facebook.net/herald/transcript/5799/

            Tip: use the X-Herald-Rules header to filter Herald messages in your client.

            phabricator@reviews.facebook.net Phabricator added a comment - ashutoshc requested code review of " HIVE-538 [jira] make hive_jdbc.jar self-containing". Reviewers: JIRA https://issues.apache.org/jira/browse/HIVE-538 This patch introduces two new targets: a) jar-jdbc-combined : This target generates a jar file containing all the hive jars required for jdbc driver. b) jar-jdbc-rt-deps : This target generates a jar file which contains all the hive runtime dependcies in a single jar. Currently, most jars in hive/build/dist/lib and the hadoop-*-core.jar are required in the classpath to run jdbc applications on hive. We need to do atleast the following to get rid of most unnecessary dependencies: 1. get rid of dynamic serde and use a standard serialization format, maybe tab separated, json or avro 2. dont use hadoop configuration parameters 3. repackage thrift and fb303 classes into hive_jdbc.jar TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D2553 AFFECTED FILES build.xml ivy/libraries.properties ivy.xml MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/5799/ Tip: use the X-Herald-Rules header to filter Herald messages in your client.

            Patch for generating artifacts for jdbc drivers which makes it easier for folks using jdbc driver to include it in their projects. Note two noticeable omissions from hive-jdbc-rt-deps.jar datanucleus-core.jar and datanucleus-rdbms.jar If those are packaged in same jar then datanucleus have trouble loading them, so I excluded those. As a result, those still needs to put in application's classpath separately

            ashutoshc Ashutosh Chauhan added a comment - Patch for generating artifacts for jdbc drivers which makes it easier for folks using jdbc driver to include it in their projects. Note two noticeable omissions from hive-jdbc-rt-deps.jar datanucleus-core.jar and datanucleus-rdbms.jar If those are packaged in same jar then datanucleus have trouble loading them, so I excluded those. As a result, those still needs to put in application's classpath separately

            Ready for review.

            ashutoshc Ashutosh Chauhan added a comment - Ready for review.
            phabricator@reviews.facebook.net Phabricator added a comment -

            njain has commented on the revision "HIVE-538 [jira] make hive_jdbc.jar self-containing".

            INLINE COMMENTS
            build.xml:1262 Do you want to change it to jdo*api*

            REVISION DETAIL
            https://reviews.facebook.net/D2553

            phabricator@reviews.facebook.net Phabricator added a comment - njain has commented on the revision " HIVE-538 [jira] make hive_jdbc.jar self-containing". INLINE COMMENTS build.xml:1262 Do you want to change it to jdo*api* REVISION DETAIL https://reviews.facebook.net/D2553
            phabricator@reviews.facebook.net Phabricator added a comment -

            ashutoshc has commented on the revision "HIVE-538 [jira] make hive_jdbc.jar self-containing".

            INLINE COMMENTS
            build.xml:1262 I dont see any advantage of it. But, since it wont make a difference I will do it any case.

            REVISION DETAIL
            https://reviews.facebook.net/D2553

            phabricator@reviews.facebook.net Phabricator added a comment - ashutoshc has commented on the revision " HIVE-538 [jira] make hive_jdbc.jar self-containing". INLINE COMMENTS build.xml:1262 I dont see any advantage of it. But, since it wont make a difference I will do it any case. REVISION DETAIL https://reviews.facebook.net/D2553
            phabricator@reviews.facebook.net Phabricator added a comment -

            ashutoshc updated the revision "HIVE-538 [jira] make hive_jdbc.jar self-containing".
            Reviewers: JIRA

            Addressing Namit's comments.
            Rebased to trunk.

            REVISION DETAIL
            https://reviews.facebook.net/D2553

            AFFECTED FILES
            build.xml
            ivy/libraries.properties
            ivy.xml

            phabricator@reviews.facebook.net Phabricator added a comment - ashutoshc updated the revision " HIVE-538 [jira] make hive_jdbc.jar self-containing". Reviewers: JIRA Addressing Namit's comments. Rebased to trunk. REVISION DETAIL https://reviews.facebook.net/D2553 AFFECTED FILES build.xml ivy/libraries.properties ivy.xml
            njw45 Nick White added a comment -

            This should be easier to implement now Hive uses Maven as a build system - you could make the JDBC driver project use the shade plugin (https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html).

            njw45 Nick White added a comment - This should be easier to implement now Hive uses Maven as a build system - you could make the JDBC driver project use the shade plugin ( https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html ).
            rmurthy@fb.com Raghu Murthy added a comment -

            Hi, Raghu Murthy is no longer at Facebook so this email address is no longer being monitored. If you need assistance, please contact another person who is currently at the company or peeps@fb.com.

            rmurthy@fb.com Raghu Murthy added a comment - Hi, Raghu Murthy is no longer at Facebook so this email address is no longer being monitored. If you need assistance, please contact another person who is currently at the company or peeps@fb.com.
            njw45 Nick White added a comment -

            I've attached a patch that builds a self-containing jar -

            njw45 Nick White added a comment - I've attached a patch that builds a self-containing jar -

            njw45 Can you take a look at HIVE-6593 to see if it satisfies your needs?

            ashutoshc Ashutosh Chauhan added a comment - njw45 Can you take a look at HIVE-6593 to see if it satisfies your needs?
            njw45 Nick White added a comment -

            ashutoshc not really, it manually lists some dependencies (not the transitive ones) instead of using maven to work them out, and creates a tar.gz of many jars, not a single jar with all the dependencies in. A tar.gz can't easily integrate with maven; it's easy to add this complete jar as a dependency to a third-party maven project as it's published with a distinct classifier.

            njw45 Nick White added a comment - ashutoshc not really, it manually lists some dependencies (not the transitive ones) instead of using maven to work them out, and creates a tar.gz of many jars, not a single jar with all the dependencies in. A tar.gz can't easily integrate with maven; it's easy to add this complete jar as a dependency to a third-party maven project as it's published with a distinct classifier.
            njw45 Nick White added a comment -

            also, duplicating hive-jdbc's dependencies in an xml file in a different project will increase maintenance costs, as these two lists will have to be manually kept in sync.

            njw45 Nick White added a comment - also, duplicating hive-jdbc's dependencies in an xml file in a different project will increase maintenance costs, as these two lists will have to be manually kept in sync.
            ashutoshc Ashutosh Chauhan added a comment - - edited

            I think name of jar should be

             apache-hive-${project.version}-jdbc-client.jar 

            instead of

             apache-hive-${project.version}-jdbc.jar 

            Currently hadoop classes are excluded from this uber jar but adds in other transitive deps. Seems like we have few options of what to put in this jdbc jar:

            • One which is currently implemented in patch : all deps excluding hadoop.
            • Only hive classes in there.
            • All deps including hadoop.

            I don't have good sense whats the best choice here. brocknoland / vgumashta / prasadm / thejas Do you guys have an opinion on this?

            ashutoshc Ashutosh Chauhan added a comment - - edited I think name of jar should be apache-hive-${project.version}-jdbc-client.jar instead of apache-hive-${project.version}-jdbc.jar Currently hadoop classes are excluded from this uber jar but adds in other transitive deps. Seems like we have few options of what to put in this jdbc jar: One which is currently implemented in patch : all deps excluding hadoop. Only hive classes in there. All deps including hadoop. I don't have good sense whats the best choice here. brocknoland / vgumashta / prasadm / thejas Do you guys have an opinion on this?
            brocknoland Brock Noland added a comment -

            I'd prefer either 1 or 3 and that we shade all non-hive deps.

            brocknoland Brock Noland added a comment - I'd prefer either 1 or 3 and that we shade all non-hive deps.
            thejas Thejas Nair added a comment -

            The hadoop dependencies for jdbc client are needed only when kerberos authentication is used. However, it is a small set of hadoop jars that are needed, and their transitive dependencies are not needed. In case of hadoop 1.x, it is just hadoop-core*jar that is needed. I am not sure of the exact dependency in hadoop-2.x .

            I am fine with option 1. But if we can do option 3, with the minimal set of hadoop jars requried, that would be awesome.

            thejas Thejas Nair added a comment - The hadoop dependencies for jdbc client are needed only when kerberos authentication is used. However, it is a small set of hadoop jars that are needed, and their transitive dependencies are not needed. In case of hadoop 1.x, it is just hadoop-core*jar that is needed. I am not sure of the exact dependency in hadoop-2.x . I am fine with option 1. But if we can do option 3, with the minimal set of hadoop jars requried, that would be awesome.

            +1

            ashutoshc Ashutosh Chauhan added a comment - +1

            Committed to trunk. Thanks, Nick!

            ashutoshc Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Nick!
            ekoifman Eugene Koifman added a comment -

            the current build system produces 2 jdbc jars:
            hive-jdbc-0.14.0-SNAPSHOT-standalone.jar - the 51MB uber jar
            hive-jdbc-0.14.0-SNAPSHOT.jar - the 135K jar

            The pom file hive-jdbc-0.14.0-SNAPSHOT.pom (which I will attach) does not mention the hive-jdbc-0.14.0-SNAPSHOT-standalone.jar at all. Standalone jar is not part of hive tar bundle either. How is the end user supposed to access this standalone jar?

            ekoifman Eugene Koifman added a comment - the current build system produces 2 jdbc jars: hive-jdbc-0.14.0-SNAPSHOT-standalone.jar - the 51MB uber jar hive-jdbc-0.14.0-SNAPSHOT.jar - the 135K jar The pom file hive-jdbc-0.14.0-SNAPSHOT.pom (which I will attach) does not mention the hive-jdbc-0.14.0-SNAPSHOT-standalone.jar at all. Standalone jar is not part of hive tar bundle either. How is the end user supposed to access this standalone jar?
            njw45 Nick White added a comment -

            ekoifman Are you adding the driver as a dependency to a Maven project? If so, you should probably add the non-standalone version so you have more control over transitive dependency versioning. I'd use the standalone jar as an end-user download, e.g. if you want to drop it into an existing app (e.g. http://squirrel-sql.sourceforge.net, Sqoop). Or is your question - "where is the standalone jar automatically published to?"?

            njw45 Nick White added a comment - ekoifman Are you adding the driver as a dependency to a Maven project? If so, you should probably add the non-standalone version so you have more control over transitive dependency versioning. I'd use the standalone jar as an end-user download, e.g. if you want to drop it into an existing app (e.g. http://squirrel-sql.sourceforge.net , Sqoop). Or is your question - "where is the standalone jar automatically published to?"?
            ekoifman Eugene Koifman added a comment -

            yes, "where is it published to"? It seems like one would have to build Hive to get it.

            ekoifman Eugene Koifman added a comment - yes, "where is it published to"? It seems like one would have to build Hive to get it.
            thejas Thejas Nair added a comment -

            This has been fixed in 0.14 release. Please open new jira if you see any issues.

            thejas Thejas Nair added a comment - This has been fixed in 0.14 release. Please open new jira if you see any issues.

            People

              njw45 Nick White
              rsm Raghotham Murthy
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: