Hive
  1. Hive
  2. HIVE-538

make hive_jdbc.jar self-containing

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0, 0.4.0, 0.6.0, 0.13.0
    • Fix Version/s: 0.14.0
    • Component/s: JDBC
    • Labels:
      None

      Description

      Currently, most jars in hive/build/dist/lib and the hadoop-*-core.jar are required in the classpath to run jdbc applications on hive. We need to do atleast the following to get rid of most unnecessary dependencies:
      1. get rid of dynamic serde and use a standard serialization format, maybe tab separated, json or avro
      2. dont use hadoop configuration parameters
      3. repackage thrift and fb303 classes into hive_jdbc.jar

        Issue Links

          Activity

          Hide
          Bill Graham added a comment -

          From a purely empirical approach it appears that the following jars are currently required to use the hive JDBC driver (version 0.5.0):

          dist/lib/hive-exec-0.7.0.jar
          dist/lib/hive-jdbc-0.7.0.jar
          dist/lib/hive-metastore-0.7.0.jar
          dist/lib/hive-service-0.7.0.jar
          dist/lib/libfb303.jar
          dist/lib/libthrift.jar
          
          hadoop-core-{version}.jar
          

          I propose modifying the build process to combine the classes from the first set of jars into one single jar. That way users only need to add the hadoop-core jar and and hive-jdbc-combined jar to their classpath. As other dependencies are removed or refactored away, we could thin out what goes in the jar.

          I can take on this JIRA if others agree with the approach.

          Show
          Bill Graham added a comment - From a purely empirical approach it appears that the following jars are currently required to use the hive JDBC driver (version 0.5.0): dist/lib/hive-exec-0.7.0.jar dist/lib/hive-jdbc-0.7.0.jar dist/lib/hive-metastore-0.7.0.jar dist/lib/hive-service-0.7.0.jar dist/lib/libfb303.jar dist/lib/libthrift.jar hadoop-core-{version}.jar I propose modifying the build process to combine the classes from the first set of jars into one single jar. That way users only need to add the hadoop-core jar and and hive-jdbc-combined jar to their classpath. As other dependencies are removed or refactored away, we could thin out what goes in the jar. I can take on this JIRA if others agree with the approach.
          Hide
          Ashutosh Chauhan added a comment -

          @Bill,
          Your approach sounds reasonable to me. Will you like to work on this? You can reference my patch at HIVE-2900 for how to do repackaging easily.

          Show
          Ashutosh Chauhan added a comment - @Bill, Your approach sounds reasonable to me. Will you like to work on this? You can reference my patch at HIVE-2900 for how to do repackaging easily.
          Hide
          Bill Graham added a comment -

          @Ashutosh I'm no longer actively using Hive these days, so it would be quite an effort for me to get set up again to contribute. Sorry, but I need to rescind my offer to take this one on.

          Show
          Bill Graham added a comment - @Ashutosh I'm no longer actively using Hive these days, so it would be quite an effort for me to get set up again to contribute. Sorry, but I need to rescind my offer to take this one on.
          Hide
          Phabricator added a comment -

          ashutoshc requested code review of "HIVE-538 [jira] make hive_jdbc.jar self-containing".
          Reviewers: JIRA

          https://issues.apache.org/jira/browse/HIVE-538

          This patch introduces two new targets:

          a) jar-jdbc-combined : This target generates a jar file containing all the hive jars required for jdbc driver.
          b) jar-jdbc-rt-deps : This target generates a jar file which contains all the hive runtime dependcies in a single jar.

          Currently, most jars in hive/build/dist/lib and the hadoop-*-core.jar are required in the classpath to run jdbc applications on hive. We need to do atleast the following to get rid of most unnecessary dependencies:
          1. get rid of dynamic serde and use a standard serialization format, maybe tab separated, json or avro
          2. dont use hadoop configuration parameters
          3. repackage thrift and fb303 classes into hive_jdbc.jar

          TEST PLAN
          EMPTY

          REVISION DETAIL
          https://reviews.facebook.net/D2553

          AFFECTED FILES
          build.xml
          ivy/libraries.properties
          ivy.xml

          MANAGE HERALD DIFFERENTIAL RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/5799/

          Tip: use the X-Herald-Rules header to filter Herald messages in your client.

          Show
          Phabricator added a comment - ashutoshc requested code review of " HIVE-538 [jira] make hive_jdbc.jar self-containing". Reviewers: JIRA https://issues.apache.org/jira/browse/HIVE-538 This patch introduces two new targets: a) jar-jdbc-combined : This target generates a jar file containing all the hive jars required for jdbc driver. b) jar-jdbc-rt-deps : This target generates a jar file which contains all the hive runtime dependcies in a single jar. Currently, most jars in hive/build/dist/lib and the hadoop-*-core.jar are required in the classpath to run jdbc applications on hive. We need to do atleast the following to get rid of most unnecessary dependencies: 1. get rid of dynamic serde and use a standard serialization format, maybe tab separated, json or avro 2. dont use hadoop configuration parameters 3. repackage thrift and fb303 classes into hive_jdbc.jar TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D2553 AFFECTED FILES build.xml ivy/libraries.properties ivy.xml MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/5799/ Tip: use the X-Herald-Rules header to filter Herald messages in your client.
          Hide
          Ashutosh Chauhan added a comment -

          Patch for generating artifacts for jdbc drivers which makes it easier for folks using jdbc driver to include it in their projects. Note two noticeable omissions from hive-jdbc-rt-deps.jar datanucleus-core.jar and datanucleus-rdbms.jar If those are packaged in same jar then datanucleus have trouble loading them, so I excluded those. As a result, those still needs to put in application's classpath separately

          Show
          Ashutosh Chauhan added a comment - Patch for generating artifacts for jdbc drivers which makes it easier for folks using jdbc driver to include it in their projects. Note two noticeable omissions from hive-jdbc-rt-deps.jar datanucleus-core.jar and datanucleus-rdbms.jar If those are packaged in same jar then datanucleus have trouble loading them, so I excluded those. As a result, those still needs to put in application's classpath separately
          Hide
          Ashutosh Chauhan added a comment -

          Ready for review.

          Show
          Ashutosh Chauhan added a comment - Ready for review.
          Hide
          Phabricator added a comment -

          njain has commented on the revision "HIVE-538 [jira] make hive_jdbc.jar self-containing".

          INLINE COMMENTS
          build.xml:1262 Do you want to change it to jdo*api*

          REVISION DETAIL
          https://reviews.facebook.net/D2553

          Show
          Phabricator added a comment - njain has commented on the revision " HIVE-538 [jira] make hive_jdbc.jar self-containing". INLINE COMMENTS build.xml:1262 Do you want to change it to jdo*api* REVISION DETAIL https://reviews.facebook.net/D2553
          Hide
          Phabricator added a comment -

          ashutoshc has commented on the revision "HIVE-538 [jira] make hive_jdbc.jar self-containing".

          INLINE COMMENTS
          build.xml:1262 I dont see any advantage of it. But, since it wont make a difference I will do it any case.

          REVISION DETAIL
          https://reviews.facebook.net/D2553

          Show
          Phabricator added a comment - ashutoshc has commented on the revision " HIVE-538 [jira] make hive_jdbc.jar self-containing". INLINE COMMENTS build.xml:1262 I dont see any advantage of it. But, since it wont make a difference I will do it any case. REVISION DETAIL https://reviews.facebook.net/D2553
          Hide
          Phabricator added a comment -

          ashutoshc updated the revision "HIVE-538 [jira] make hive_jdbc.jar self-containing".
          Reviewers: JIRA

          Addressing Namit's comments.
          Rebased to trunk.

          REVISION DETAIL
          https://reviews.facebook.net/D2553

          AFFECTED FILES
          build.xml
          ivy/libraries.properties
          ivy.xml

          Show
          Phabricator added a comment - ashutoshc updated the revision " HIVE-538 [jira] make hive_jdbc.jar self-containing". Reviewers: JIRA Addressing Namit's comments. Rebased to trunk. REVISION DETAIL https://reviews.facebook.net/D2553 AFFECTED FILES build.xml ivy/libraries.properties ivy.xml
          Hide
          Nick White added a comment -

          This should be easier to implement now Hive uses Maven as a build system - you could make the JDBC driver project use the shade plugin (https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html).

          Show
          Nick White added a comment - This should be easier to implement now Hive uses Maven as a build system - you could make the JDBC driver project use the shade plugin ( https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html ).
          Hide
          Raghu Murthy added a comment -

          Hi, Raghu Murthy is no longer at Facebook so this email address is no longer being monitored. If you need assistance, please contact another person who is currently at the company or peeps@fb.com.

          Show
          Raghu Murthy added a comment - Hi, Raghu Murthy is no longer at Facebook so this email address is no longer being monitored. If you need assistance, please contact another person who is currently at the company or peeps@fb.com.
          Hide
          Nick White added a comment -

          I've attached a patch that builds a self-containing jar -

          Show
          Nick White added a comment - I've attached a patch that builds a self-containing jar -
          Hide
          Ashutosh Chauhan added a comment -

          Nick White Can you take a look at HIVE-6593 to see if it satisfies your needs?

          Show
          Ashutosh Chauhan added a comment - Nick White Can you take a look at HIVE-6593 to see if it satisfies your needs?
          Hide
          Nick White added a comment -

          Ashutosh Chauhan not really, it manually lists some dependencies (not the transitive ones) instead of using maven to work them out, and creates a tar.gz of many jars, not a single jar with all the dependencies in. A tar.gz can't easily integrate with maven; it's easy to add this complete jar as a dependency to a third-party maven project as it's published with a distinct classifier.

          Show
          Nick White added a comment - Ashutosh Chauhan not really, it manually lists some dependencies (not the transitive ones) instead of using maven to work them out, and creates a tar.gz of many jars, not a single jar with all the dependencies in. A tar.gz can't easily integrate with maven; it's easy to add this complete jar as a dependency to a third-party maven project as it's published with a distinct classifier.
          Hide
          Nick White added a comment -

          also, duplicating hive-jdbc's dependencies in an xml file in a different project will increase maintenance costs, as these two lists will have to be manually kept in sync.

          Show
          Nick White added a comment - also, duplicating hive-jdbc's dependencies in an xml file in a different project will increase maintenance costs, as these two lists will have to be manually kept in sync.
          Hide
          Ashutosh Chauhan added a comment - - edited

          I think name of jar should be

           apache-hive-${project.version}-jdbc-client.jar 

          instead of

           apache-hive-${project.version}-jdbc.jar 

          Currently hadoop classes are excluded from this uber jar but adds in other transitive deps. Seems like we have few options of what to put in this jdbc jar:

          • One which is currently implemented in patch : all deps excluding hadoop.
          • Only hive classes in there.
          • All deps including hadoop.

          I don't have good sense whats the best choice here. Brock Noland / Vaibhav Gumashta / Prasad Mujumdar / Thejas M Nair Do you guys have an opinion on this?

          Show
          Ashutosh Chauhan added a comment - - edited I think name of jar should be apache-hive-${project.version}-jdbc-client.jar instead of apache-hive-${project.version}-jdbc.jar Currently hadoop classes are excluded from this uber jar but adds in other transitive deps. Seems like we have few options of what to put in this jdbc jar: One which is currently implemented in patch : all deps excluding hadoop. Only hive classes in there. All deps including hadoop. I don't have good sense whats the best choice here. Brock Noland / Vaibhav Gumashta / Prasad Mujumdar / Thejas M Nair Do you guys have an opinion on this?
          Hide
          Brock Noland added a comment -

          I'd prefer either 1 or 3 and that we shade all non-hive deps.

          Show
          Brock Noland added a comment - I'd prefer either 1 or 3 and that we shade all non-hive deps.
          Hide
          Thejas M Nair added a comment -

          The hadoop dependencies for jdbc client are needed only when kerberos authentication is used. However, it is a small set of hadoop jars that are needed, and their transitive dependencies are not needed. In case of hadoop 1.x, it is just hadoop-core*jar that is needed. I am not sure of the exact dependency in hadoop-2.x .

          I am fine with option 1. But if we can do option 3, with the minimal set of hadoop jars requried, that would be awesome.

          Show
          Thejas M Nair added a comment - The hadoop dependencies for jdbc client are needed only when kerberos authentication is used. However, it is a small set of hadoop jars that are needed, and their transitive dependencies are not needed. In case of hadoop 1.x, it is just hadoop-core*jar that is needed. I am not sure of the exact dependency in hadoop-2.x . I am fine with option 1. But if we can do option 3, with the minimal set of hadoop jars requried, that would be awesome.
          Hide
          Ashutosh Chauhan added a comment -

          +1

          Show
          Ashutosh Chauhan added a comment - +1
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Nick!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Nick!
          Hide
          Eugene Koifman added a comment -

          the current build system produces 2 jdbc jars:
          hive-jdbc-0.14.0-SNAPSHOT-standalone.jar - the 51MB uber jar
          hive-jdbc-0.14.0-SNAPSHOT.jar - the 135K jar

          The pom file hive-jdbc-0.14.0-SNAPSHOT.pom (which I will attach) does not mention the hive-jdbc-0.14.0-SNAPSHOT-standalone.jar at all. Standalone jar is not part of hive tar bundle either. How is the end user supposed to access this standalone jar?

          Show
          Eugene Koifman added a comment - the current build system produces 2 jdbc jars: hive-jdbc-0.14.0-SNAPSHOT-standalone.jar - the 51MB uber jar hive-jdbc-0.14.0-SNAPSHOT.jar - the 135K jar The pom file hive-jdbc-0.14.0-SNAPSHOT.pom (which I will attach) does not mention the hive-jdbc-0.14.0-SNAPSHOT-standalone.jar at all. Standalone jar is not part of hive tar bundle either. How is the end user supposed to access this standalone jar?
          Hide
          Nick White added a comment -

          Eugene Koifman Are you adding the driver as a dependency to a Maven project? If so, you should probably add the non-standalone version so you have more control over transitive dependency versioning. I'd use the standalone jar as an end-user download, e.g. if you want to drop it into an existing app (e.g. http://squirrel-sql.sourceforge.net, Sqoop). Or is your question - "where is the standalone jar automatically published to?"?

          Show
          Nick White added a comment - Eugene Koifman Are you adding the driver as a dependency to a Maven project? If so, you should probably add the non-standalone version so you have more control over transitive dependency versioning. I'd use the standalone jar as an end-user download, e.g. if you want to drop it into an existing app (e.g. http://squirrel-sql.sourceforge.net , Sqoop). Or is your question - "where is the standalone jar automatically published to?"?
          Hide
          Eugene Koifman added a comment -

          yes, "where is it published to"? It seems like one would have to build Hive to get it.

          Show
          Eugene Koifman added a comment - yes, "where is it published to"? It seems like one would have to build Hive to get it.

            People

            • Assignee:
              Nick White
              Reporter:
              Raghotham Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development