Hive
  1. Hive
  2. HIVE-2055

Hive should add HBase classpath dependencies when available

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.13.0
    • Component/s: HBase Handler
    • Labels:
      None
    • Release Note:
      HBase will be detected via HBASE_HOME and HBASE_CONF_DIR. HBASE_HOME defaults to BigTop path /usr/lib/hbase.

      Description

      Created an external table in hive , which points to the HBase table. When tried to query a column using the column name in select clause got the following exception : ( java.lang.ClassNotFoundException: org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat), errorCode:12, SQLState:42000)

        Issue Links

          Activity

          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Nick!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Nick!
          Hide
          Roman Shaposhnik added a comment -

          Nick Dimiduk LGTM +1

          Show
          Roman Shaposhnik added a comment - Nick Dimiduk LGTM +1
          Hide
          Ashutosh Chauhan added a comment -

          +1

          Show
          Ashutosh Chauhan added a comment - +1
          Hide
          Nick Dimiduk added a comment -

          Updating patch according to Ashutosh's comments.

          Show
          Nick Dimiduk added a comment - Updating patch according to Ashutosh's comments.
          Hide
          Ashutosh Chauhan added a comment -

          We dont want hbase conf and jars to take precedence over rest of classpath. So, instead of
          + export HADOOP_CLASSPATH="$

          {HBASE_CONF_DIR}:${HADOOP_CLASSPATH}"
          + export HADOOP_CLASSPATH="${x}:${HADOOP_CLASSPATH}"

          do
          + export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:${HBASE_CONF_DIR}

          :"
          + export HADOOP_CLASSPATH="$

          {HADOOP_CLASSPATH}

          :$

          {x}

          "

          Rest of patch looks good.

          Show
          Ashutosh Chauhan added a comment - We dont want hbase conf and jars to take precedence over rest of classpath. So, instead of + export HADOOP_CLASSPATH="$ {HBASE_CONF_DIR}:${HADOOP_CLASSPATH}" + export HADOOP_CLASSPATH="${x}:${HADOOP_CLASSPATH}" do + export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:${HBASE_CONF_DIR} :" + export HADOOP_CLASSPATH="$ {HADOOP_CLASSPATH} :$ {x} " Rest of patch looks good.
          Hide
          Nick Dimiduk added a comment -

          Here's an updated patch to the launch script based on the new hbase command. Please excuse my bash scripting; I'm not a native speaker.

          Roman Shaposhnik you're just in time

          Show
          Nick Dimiduk added a comment - Here's an updated patch to the launch script based on the new hbase command. Please excuse my bash scripting; I'm not a native speaker. Roman Shaposhnik you're just in time
          Hide
          Roman Shaposhnik added a comment -

          Sorry for dropping by somewhat late but it looks like you've got a pretty reasonable solution with mapredcp.

          Show
          Roman Shaposhnik added a comment - Sorry for dropping by somewhat late but it looks like you've got a pretty reasonable solution with mapredcp.
          Hide
          Nick Dimiduk added a comment -

          Backport to 0.94 is also complete. For reference, here's its output:

          $ ./bin/hbase mapredcp | tr ':' '\n'
          /private/tmp/hbase-0.94.14-SNAPSHOT/hbase-0.94.14-SNAPSHOT.jar
          /private/tmp/hbase-0.94.14-SNAPSHOT/lib/protobuf-java-2.4.0a.jar
          /private/tmp/hbase-0.94.14-SNAPSHOT/lib/zookeeper-3.4.5.jar
          /private/tmp/hbase-0.94.14-SNAPSHOT/lib/guava-11.0.2.jar
          /private/tmp/hbase-0.94.14-SNAPSHOT/lib/hadoop-core-1.0.4.jar
          
          Show
          Nick Dimiduk added a comment - Backport to 0.94 is also complete. For reference, here's its output: $ ./bin/hbase mapredcp | tr ':' '\n' /private/tmp/hbase-0.94.14-SNAPSHOT/hbase-0.94.14-SNAPSHOT.jar /private/tmp/hbase-0.94.14-SNAPSHOT/lib/protobuf-java-2.4.0a.jar /private/tmp/hbase-0.94.14-SNAPSHOT/lib/zookeeper-3.4.5.jar /private/tmp/hbase-0.94.14-SNAPSHOT/lib/guava-11.0.2.jar /private/tmp/hbase-0.94.14-SNAPSHOT/lib/hadoop-core-1.0.4.jar
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in HBase-0.94-security #337 (See https://builds.apache.org/job/HBase-0.94-security/337/)
          HBASE-9165 [mapreduce] Modularize building dependency jars

          Separate adding HBase and dependencies from adding other job dependencies, and
          expose it as a separate method that other projects can use (for PIG-3285,
          HIVE-2055). (ndimiduk: rev 1542414)

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapred/TableMapReduceUtil.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduce.java
          Show
          Hudson added a comment - SUCCESS: Integrated in HBase-0.94-security #337 (See https://builds.apache.org/job/HBase-0.94-security/337/ ) HBASE-9165 [mapreduce] Modularize building dependency jars Separate adding HBase and dependencies from adding other job dependencies, and expose it as a separate method that other projects can use (for PIG-3285 , HIVE-2055 ). (ndimiduk: rev 1542414) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapred/TableMapReduceUtil.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduce.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in HBase-0.94 #1203 (See https://builds.apache.org/job/HBase-0.94/1203/)
          HBASE-9165 [mapreduce] Modularize building dependency jars

          Separate adding HBase and dependencies from adding other job dependencies, and
          expose it as a separate method that other projects can use (for PIG-3285,
          HIVE-2055). (ndimiduk: rev 1542414)

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapred/TableMapReduceUtil.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduce.java
          Show
          Hudson added a comment - FAILURE: Integrated in HBase-0.94 #1203 (See https://builds.apache.org/job/HBase-0.94/1203/ ) HBASE-9165 [mapreduce] Modularize building dependency jars Separate adding HBase and dependencies from adding other job dependencies, and expose it as a separate method that other projects can use (for PIG-3285 , HIVE-2055 ). (ndimiduk: rev 1542414) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapred/TableMapReduceUtil.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduce.java
          Hide
          Ashutosh Chauhan added a comment -

          List looks good. As I suggested in previous comment, within hive script we can exclude zk,protobuf, guava (obtained via hbase cmd) to avoid adding them twice.

          Show
          Ashutosh Chauhan added a comment - List looks good. As I suggested in previous comment, within hive script we can exclude zk,protobuf, guava (obtained via hbase cmd) to avoid adding them twice.
          Hide
          Nick Dimiduk added a comment -

          Here's the output of the latest version of the patch on HBASE-8438, what you will get when running from HBase > 0.96.1. I intend to backport the feature in time for the 0.94.14 release as well.

          $ ./bin/hbase mapredcp | tr ':' '\n'
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/netty-3.6.6.Final.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-hadoop-compat-0.97.0-SNAPSHOT.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/protobuf-java-2.5.0.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/guava-12.0.1.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/htrace-core-2.01.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-protocol-0.97.0-SNAPSHOT.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-client-0.97.0-SNAPSHOT.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/zookeeper-3.4.5.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-server-0.97.0-SNAPSHOT.jar
          /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-common-0.97.0-SNAPSHOT.jar
          
          Show
          Nick Dimiduk added a comment - Here's the output of the latest version of the patch on HBASE-8438 , what you will get when running from HBase > 0.96.1. I intend to backport the feature in time for the 0.94.14 release as well. $ ./bin/hbase mapredcp | tr ':' '\n' /private/tmp/hbase-0.97.0-SNAPSHOT/lib/netty-3.6.6.Final.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-hadoop-compat-0.97.0-SNAPSHOT.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/protobuf-java-2.5.0.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/guava-12.0.1.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/htrace-core-2.01.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-protocol-0.97.0-SNAPSHOT.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-client-0.97.0-SNAPSHOT.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/zookeeper-3.4.5.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-server-0.97.0-SNAPSHOT.jar /private/tmp/hbase-0.97.0-SNAPSHOT/lib/hbase-common-0.97.0-SNAPSHOT.jar
          Hide
          Ashutosh Chauhan added a comment -

          Lets use the new method. Also, since zk, protobuf and guava are already hive's dep and will be in classpath, in the hive script we can exclude them from getting included via this command.

          Show
          Ashutosh Chauhan added a comment - Lets use the new method. Also, since zk, protobuf and guava are already hive's dep and will be in classpath, in the hive script we can exclude them from getting included via this command.
          Hide
          Nick Dimiduk added a comment -

          Here's my comment on HBASE-8438 with a listing of the jars picked up.

          $ ./bin/hbase mapredcp 2>/dev/null | tr ':' '\n'
          /Users/ndimiduk/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5/zookeeper-3.4.5.jar
          /Users/ndimiduk/repos/hbase/hbase-common/target/hbase-common-0.96.0.jar
          /Users/ndimiduk/.m2/repository/org/cloudera/htrace/htrace-core/2.01/htrace-core-2.01.jar
          /Users/ndimiduk/repos/hbase/hbase-client/target/hbase-client-0.96.0.jar
          /Users/ndimiduk/.m2/repository/io/netty/netty/3.6.6.Final/netty-3.6.6.Final.jar
          /Users/ndimiduk/repos/hbase/hbase-protocol/target/hbase-protocol-0.96.0.jar
          /Users/ndimiduk/repos/hbase/hbase-hadoop-compat/target/hbase-hadoop-compat-0.96.0.jar
          /Users/ndimiduk/repos/hbase/hbase-server/target/hbase-server-0.96.0.jar
          /Users/ndimiduk/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar
          /Users/ndimiduk/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.1.0-beta/hadoop-mapreduce-client-core-2.1.0-beta.jar
          /Users/ndimiduk/.m2/repository/org/apache/hadoop/hadoop-common/2.1.0-beta/hadoop-common-2.1.0-beta.jar
          /Users/ndimiduk/.m2/repository/com/google/guava/guava/12.0.1/guava-12.0.1.jar
          

          The hadoop jar chosen is based on what hadoop itself puts into the classpath, so I think it won't be an issue in practice. However, Pig has a similar need (PIG-3285) so my patch on HBASE-9165 implements a new method which will exclude the Hadoop pieces. We can modify this patch to call that new method if you prefer.

          Show
          Nick Dimiduk added a comment - Here's my comment on HBASE-8438 with a listing of the jars picked up. $ ./bin/hbase mapredcp 2>/dev/null | tr ':' '\n' /Users/ndimiduk/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5/zookeeper-3.4.5.jar /Users/ndimiduk/repos/hbase/hbase-common/target/hbase-common-0.96.0.jar /Users/ndimiduk/.m2/repository/org/cloudera/htrace/htrace-core/2.01/htrace-core-2.01.jar /Users/ndimiduk/repos/hbase/hbase-client/target/hbase-client-0.96.0.jar /Users/ndimiduk/.m2/repository/io/netty/netty/3.6.6.Final/netty-3.6.6.Final.jar /Users/ndimiduk/repos/hbase/hbase-protocol/target/hbase-protocol-0.96.0.jar /Users/ndimiduk/repos/hbase/hbase-hadoop-compat/target/hbase-hadoop-compat-0.96.0.jar /Users/ndimiduk/repos/hbase/hbase-server/target/hbase-server-0.96.0.jar /Users/ndimiduk/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar /Users/ndimiduk/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.1.0-beta/hadoop-mapreduce-client-core-2.1.0-beta.jar /Users/ndimiduk/.m2/repository/org/apache/hadoop/hadoop-common/2.1.0-beta/hadoop-common-2.1.0-beta.jar /Users/ndimiduk/.m2/repository/com/google/guava/guava/12.0.1/guava-12.0.1.jar The hadoop jar chosen is based on what hadoop itself puts into the classpath, so I think it won't be an issue in practice. However, Pig has a similar need ( PIG-3285 ) so my patch on HBASE-9165 implements a new method which will exclude the Hadoop pieces. We can modify this patch to call that new method if you prefer.
          Hide
          Ashutosh Chauhan added a comment -

          Thanks Nick Dimiduk for picking this up. Hadoop jar version could be a source of problems later, can you get rid of that in your hbase command. Also, for documentation purposes it will be good if you can paste here output of that command to list what deps are there at the moment.

          Show
          Ashutosh Chauhan added a comment - Thanks Nick Dimiduk for picking this up. Hadoop jar version could be a source of problems later, can you get rid of that in your hbase command. Also, for documentation purposes it will be good if you can paste here output of that command to list what deps are there at the moment.
          Hide
          Nick Dimiduk added a comment -

          Hi Sushanth Sowmyan, Ashutosh Chauhan.

          See my latest patch on HBASE-8438. It provides a new command to the hbase bin script that prints out the jars our existing addDependencyJars detects. It's not perfect as it includes 2 unnecessary hadoop jars, but it's a lot better than blindly calling `hbase classpath`. Does this meet your expectations regarding environment variable length limitations?

          $ hbase mapredcp 2>/dev/null | tr ':' '\n' | wc
               12      12     660
          $ hbase classpath 2>/dev/null | tr ':' '\n' | wc
              116     115    5459
          
          Show
          Nick Dimiduk added a comment - Hi Sushanth Sowmyan , Ashutosh Chauhan . See my latest patch on HBASE-8438 . It provides a new command to the hbase bin script that prints out the jars our existing addDependencyJars detects. It's not perfect as it includes 2 unnecessary hadoop jars, but it's a lot better than blindly calling `hbase classpath`. Does this meet your expectations regarding environment variable length limitations? $ hbase mapredcp 2>/dev/null | tr ':' '\n' | wc 12 12 660 $ hbase classpath 2>/dev/null | tr ':' '\n' | wc 116 115 5459
          Hide
          Nick Dimiduk added a comment -

          My above link has turned stale. I'm referring to FileUtil#createJarWithClassPath.

          Calling this method to construct a classpath jar every time a script calls $(hbase classpath-min) sounds time-consuming and will leave temp jars orphaned on the FS.

          Another idea is to use maven to generate a dependency list that omits Hadoop and other runtime jars? Any friends at BigTop who could give advice on this approach? (cc Roman Shaposhnik, Sean Mackrory)

          Yet another option is to maintain the list manually on the HBase side. We already sort of do with in TableMapReduceUtils#addDependencyJars. This is probably the simplest but most fragile and least future-proof option.

          Show
          Nick Dimiduk added a comment - My above link has turned stale. I'm referring to FileUtil#createJarWithClassPath. Calling this method to construct a classpath jar every time a script calls $(hbase classpath-min) sounds time-consuming and will leave temp jars orphaned on the FS. Another idea is to use maven to generate a dependency list that omits Hadoop and other runtime jars? Any friends at BigTop who could give advice on this approach? (cc Roman Shaposhnik , Sean Mackrory ) Yet another option is to maintain the list manually on the HBase side. We already sort of do with in TableMapReduceUtils#addDependencyJars. This is probably the simplest but most fragile and least future-proof option.
          Hide
          Nick Dimiduk added a comment -

          Fair enough. I lack privileges to change this attribute.

          Show
          Nick Dimiduk added a comment - Fair enough. I lack privileges to change this attribute.
          Hide
          Brock Noland added a comment -

          Hey guys I think the patch available status of this JIRA should be removed until these open questions are resolved.

          Show
          Brock Noland added a comment - Hey guys I think the patch available status of this JIRA should be removed until these open questions are resolved.
          Hide
          Nick Dimiduk added a comment -

          The topic of this issue came up in conversation with Chris Nauroth. He pointed me toward a Hadoop feature used by YARN to get around this very issue. Perhaps we can make use of the same.

          https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java#L1140

          Show
          Nick Dimiduk added a comment - The topic of this issue came up in conversation with Chris Nauroth . He pointed me toward a Hadoop feature used by YARN to get around this very issue. Perhaps we can make use of the same. https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java#L1140
          Hide
          Nick Dimiduk added a comment -

          Please have a look at the patch on HBASE-8438. It should be sufficient for our needs. Consumption of it in the hive launch script would look something like

          HBASE_CLASSPATH="$HADOOP_CLASSPATH:$(hbase classpath)"
          HADOOP_CLASSPATH="$(hbase classpath-min $HBASE_CLASSPATH)"
          
          Show
          Nick Dimiduk added a comment - Please have a look at the patch on HBASE-8438 . It should be sufficient for our needs. Consumption of it in the hive launch script would look something like HBASE_CLASSPATH="$HADOOP_CLASSPATH:$(hbase classpath)" HADOOP_CLASSPATH="$(hbase classpath-min $HBASE_CLASSPATH)"
          Hide
          Nick Dimiduk added a comment -

          I took a stab at implementing a bash function that will remove duplicate entries from a PATH-like string. Trouble is, bash3 doesn't support associative maps, which makes implementation difficult. How do you feel about shelling out to perl for the string manipulation?

          Show
          Nick Dimiduk added a comment - I took a stab at implementing a bash function that will remove duplicate entries from a PATH-like string. Trouble is, bash3 doesn't support associative maps, which makes implementation difficult. How do you feel about shelling out to perl for the string manipulation?
          Hide
          Sushanth Sowmyan added a comment -

          Nick Dimiduk Yup, it could bite us on both of those in Windows. I'm a little less worried about HCATALOG-621 itself since the hcat commandline is for ddl only, and is another thing we want to deprecate eventually in favour of the hive commandline.

          Show
          Sushanth Sowmyan added a comment - Nick Dimiduk Yup, it could bite us on both of those in Windows. I'm a little less worried about HCATALOG-621 itself since the hcat commandline is for ddl only, and is another thing we want to deprecate eventually in favour of the hive commandline.
          Hide
          Nick Dimiduk added a comment -

          Are you suggesting that the CLASSPATH length issue is going to bite us on PIG-2786 and HCATALOG-621 as well?

          I guess what we really want is the set of Hive's dependencies union HBase's dependencies minus Hadoop's runtime jars.

          Show
          Nick Dimiduk added a comment - Are you suggesting that the CLASSPATH length issue is going to bite us on PIG-2786 and HCATALOG-621 as well? I guess what we really want is the set of Hive's dependencies union HBase's dependencies minus Hadoop's runtime jars.
          Hide
          Andrew Purtell added a comment -

          One possibility is that it could be maintained on the hbase side, so that running "hbase barebones-classpath" in a manner similar to the current "hbase classpath" can return only the deps that would have otherwise been added by addDependencyJars ?

          That sounds plausible. What do you think Nick?

          Show
          Andrew Purtell added a comment - One possibility is that it could be maintained on the hbase side, so that running "hbase barebones-classpath" in a manner similar to the current "hbase classpath" can return only the deps that would have otherwise been added by addDependencyJars ? That sounds plausible. What do you think Nick?
          Hide
          Sushanth Sowmyan added a comment -

          (Don't have edit privileges to edit cmdline above, but assume we clip the spaces around the assignment)

          Show
          Sushanth Sowmyan added a comment - (Don't have edit privileges to edit cmdline above, but assume we clip the spaces around the assignment)
          Hide
          Sushanth Sowmyan added a comment -

          Hmm.. I agree with Ashutosh in that the length of the classpath is a valid consideration, especially for Windows. For the HCat JsonSerDe as well, our approach was to document that if you wanted it, users added the HCat jars to HIVE_AUX_JARS_PATH, but not have that explicitly for all the tasks which don't need it.

          Adding it to HIVE_AUX_JARS_PATH will result in it getting appended to the CLASSPATH for the job anyway, and people will get a long CLASSPATH, but that won't affect people who aren't using hbase and need a shorter classpath because they have other deps they want to add in.

          I like the shorter deps idea, but I also agree that maintaining that from hive's side is asking for desynch troubles later on. One possibility is that it could be maintained on the hbase side, so that running "hbase barebones-classpath" in a manner similar to the current "hbase classpath" can return only the deps that would have otherwise been added by addDependencyJars ? Then, users could do something like:

          HIVE_AUX_JARS_PATH = $(hbase barebones-classpath) hive -f hbase_query.q
          
          Show
          Sushanth Sowmyan added a comment - Hmm.. I agree with Ashutosh in that the length of the classpath is a valid consideration, especially for Windows. For the HCat JsonSerDe as well, our approach was to document that if you wanted it, users added the HCat jars to HIVE_AUX_JARS_PATH, but not have that explicitly for all the tasks which don't need it. Adding it to HIVE_AUX_JARS_PATH will result in it getting appended to the CLASSPATH for the job anyway, and people will get a long CLASSPATH, but that won't affect people who aren't using hbase and need a shorter classpath because they have other deps they want to add in. I like the shorter deps idea, but I also agree that maintaining that from hive's side is asking for desynch troubles later on. One possibility is that it could be maintained on the hbase side, so that running "hbase barebones-classpath" in a manner similar to the current "hbase classpath" can return only the deps that would have otherwise been added by addDependencyJars ? Then, users could do something like: HIVE_AUX_JARS_PATH = $(hbase barebones-classpath) hive -f hbase_query.q
          Hide
          Nick Dimiduk added a comment -

          The classpath length is a restriction I hadn't considered. I don't understand the relationship between CLASSPATH and HIVE_AUX_JARS_PATH. Is the latter shipped to the cluster for MR jobs? Are they used in hive-cli runtime as well?

          HBase jobs against an online table or involved in offline HFile/Snapshot work will depend on the hbase-client, hbase-mapreduce, zookeeper, protobuf, and perhaps guava. I believe everything else in that list is provided by Hadoop runtime.

          Show
          Nick Dimiduk added a comment - The classpath length is a restriction I hadn't considered. I don't understand the relationship between CLASSPATH and HIVE_AUX_JARS_PATH. Is the latter shipped to the cluster for MR jobs? Are they used in hive-cli runtime as well? HBase jobs against an online table or involved in offline HFile/Snapshot work will depend on the hbase-client, hbase-mapreduce, zookeeper, protobuf, and perhaps guava. I believe everything else in that list is provided by Hadoop runtime.
          Hide
          Ashutosh Chauhan added a comment -

          Looking around, looks like this list is short and is already available at: http://hbase.apache.org/book.html#client_dependencies We should only need to add hbase jar (and possibly zk jar) from this list, since all other jars listed there will already be in classpath since they are hive's dependecies as well.

          Show
          Ashutosh Chauhan added a comment - Looking around, looks like this list is short and is already available at: http://hbase.apache.org/book.html#client_dependencies We should only need to add hbase jar (and possibly zk jar) from this list, since all other jars listed there will already be in classpath since they are hive's dependecies as well.
          Hide
          Ashutosh Chauhan added a comment -

          I am not very comfortable with adding all hbase dependency jars in hive's classpath. Couple of issues:

          • Order argument may be fine for hive, but what about hbase client itself, if it needs a different version of jar than what hive comes with, it will have trouble.
          • Classpath will become too long. We already have this trouble with Windows where because of restriction of classpath length of 8K chars, classpath is truncated without any indication to user. Linux is relaxed and allows 256KB (I believe) but this could be problem there too.
            Further in either of these two cases, when problem did arise, its very hard to debug because it manifest in total different area usually.

          If you guys think adding only specific jars is unmaintainable solution, I think till we find better solution for this, we can recommend users to use HIVE_AUX_JARS_PATH env variable to supply these jars for hive's classpath.

          Show
          Ashutosh Chauhan added a comment - I am not very comfortable with adding all hbase dependency jars in hive's classpath. Couple of issues: Order argument may be fine for hive, but what about hbase client itself, if it needs a different version of jar than what hive comes with, it will have trouble. Classpath will become too long. We already have this trouble with Windows where because of restriction of classpath length of 8K chars, classpath is truncated without any indication to user. Linux is relaxed and allows 256KB (I believe) but this could be problem there too. Further in either of these two cases, when problem did arise, its very hard to debug because it manifest in total different area usually. If you guys think adding only specific jars is unmaintainable solution, I think till we find better solution for this, we can recommend users to use HIVE_AUX_JARS_PATH env variable to supply these jars for hive's classpath.
          Hide
          Giridharan Kesavan added a comment -

          Ashutosh Chauhan I dont think having a list of minimal jars is scalable solution.
          What if the next release of hbase depends of bunch of new jar's that are not part of the list that's provided in the bin/hive script? And, this would require fixing bin/hive scripts every time when hbase comes up with a new runtime dependency.

          As Nick pointed out hbase libs are appended to hive's classpath and hive libs still take the precedence.

          Show
          Giridharan Kesavan added a comment - Ashutosh Chauhan I dont think having a list of minimal jars is scalable solution. What if the next release of hbase depends of bunch of new jar's that are not part of the list that's provided in the bin/hive script? And, this would require fixing bin/hive scripts every time when hbase comes up with a new runtime dependency. As Nick pointed out hbase libs are appended to hive's classpath and hive libs still take the precedence.
          Hide
          Nick Dimiduk added a comment -

          Tracking that list of dependencies across versions sounds like a terrible game of cat-and-mouse. Also, the list of jars is larger when using HBase's bulkload functionality. This patch appends the HBase jars to Hive's own list, so the versions from Hive will be found first. The jars are only involved in launching the command; the actual jars submitted with the job are based on use of addDependencyJars (via HIVE-2379).

          Show
          Nick Dimiduk added a comment - Tracking that list of dependencies across versions sounds like a terrible game of cat-and-mouse. Also, the list of jars is larger when using HBase's bulkload functionality. This patch appends the HBase jars to Hive's own list, so the versions from Hive will be found first. The jars are only involved in launching the command; the actual jars submitted with the job are based on use of addDependencyJars (via HIVE-2379 ).
          Hide
          Ashutosh Chauhan added a comment -

          Nick Dimiduk hbase/lib contains 58 jars. I don't think hbase client needs all of it. Further, most of them overlaps with hive jars. Its very easy to get out of sync w.r.t versions of these jars. Can you figure out minimal jars that we need to add. We should exclude the jars which are already there in hive and also which only hbase servers use.

          Show
          Ashutosh Chauhan added a comment - Nick Dimiduk hbase/lib contains 58 jars. I don't think hbase client needs all of it. Further, most of them overlaps with hive jars. Its very easy to get out of sync w.r.t versions of these jars. Can you figure out minimal jars that we need to add. We should exclude the jars which are already there in hive and also which only hbase servers use.
          Hide
          Nick Dimiduk added a comment -

          Updated patch, removing HCatalog support.

          Show
          Nick Dimiduk added a comment - Updated patch, removing HCatalog support.
          Hide
          Ashutosh Chauhan added a comment -

          We don't need to load hcatalog jars. Now that hcatalog and hive jars are merged, all hcatalog jars will automatically get loaded. No need to explicitly add them to classpath.

          Show
          Ashutosh Chauhan added a comment - We don't need to load hcatalog jars. Now that hcatalog and hive jars are merged, all hcatalog jars will automatically get loaded. No need to explicitly add them to classpath.
          Hide
          Nick Dimiduk added a comment -

          Ashutosh Chauhan Just as with PIG-2786 vs PIG-3285, there are two separate issues. The former is having HBase jars on the classpath for bin/hive invocations. The latter is for shipping the dependencies to the cluster for MR jobs. The former is also effectively identical to HCATALOG-621 in that this is necessary for DDL operations. This patch resolves the former. The latter must be resolved with a code change, as is underway in HIVE-2379.

          Show
          Nick Dimiduk added a comment - Ashutosh Chauhan Just as with PIG-2786 vs PIG-3285 , there are two separate issues. The former is having HBase jars on the classpath for bin/hive invocations. The latter is for shipping the dependencies to the cluster for MR jobs. The former is also effectively identical to HCATALOG-621 in that this is necessary for DDL operations. This patch resolves the former. The latter must be resolved with a code change, as is underway in HIVE-2379 .
          Hide
          Ashutosh Chauhan added a comment -

          To be more specific, this patch just puts hbase jars in classpath of client, it doesn't ship those jars to cluster. HIVE-2379 achieves that as well.

          Show
          Ashutosh Chauhan added a comment - To be more specific, this patch just puts hbase jars in classpath of client, it doesn't ship those jars to cluster. HIVE-2379 achieves that as well.
          Hide
          Ashutosh Chauhan added a comment -

          I think HIVE-2379 is a better fix for this. Giridharan Kesavan / Nick Dimiduk / Sushanth Sowmyan Can you try with the patch attached there and see if that solves your problem.

          Show
          Ashutosh Chauhan added a comment - I think HIVE-2379 is a better fix for this. Giridharan Kesavan / Nick Dimiduk / Sushanth Sowmyan Can you try with the patch attached there and see if that solves your problem.
          Hide
          Sushanth Sowmyan added a comment -

          As Nick notes in HCATALOG-621, there might be more to this - I only tested for ddl operations. That said, setting HIVE_AUX_JARS_PATH should work for this, right?

          Show
          Sushanth Sowmyan added a comment - As Nick notes in HCATALOG-621 , there might be more to this - I only tested for ddl operations. That said, setting HIVE_AUX_JARS_PATH should work for this, right?
          Hide
          Sushanth Sowmyan added a comment -

          This works for me. Non-binding +1.

          Show
          Sushanth Sowmyan added a comment - This works for me. Non-binding +1.
          Hide
          Giridharan Kesavan added a comment -

          HIVE-2055.patch fixes bin/hive script to include hbase and hcat lib's into the classpath

          Show
          Giridharan Kesavan added a comment - HIVE-2055 .patch fixes bin/hive script to include hbase and hcat lib's into the classpath
          Show
          Brock Noland added a comment - Michael Naumov CM should be discussed here https://groups.google.com/a/cloudera.org/forum/?fromgroups#!forum/scm-users
          Hide
          Michael Naumov added a comment -

          Somehow CM 4.5 is ignoring Hive Client Configuration Safety Values for hive-site.xml
          and throws the same error in Hive
          java.io.IOException: Cannot create an instance of InputSplit class = org.apache.hadoop.hive.hbase.HBaseSplit:Class org.apache.hadoop.hive.hbase.HBaseSplit not found

          Show
          Michael Naumov added a comment - Somehow CM 4.5 is ignoring Hive Client Configuration Safety Values for hive-site.xml and throws the same error in Hive java.io.IOException: Cannot create an instance of InputSplit class = org.apache.hadoop.hive.hbase.HBaseSplit:Class org.apache.hadoop.hive.hbase.HBaseSplit not found
          Hide
          Guido Serra aka Zeph added a comment -

          agree... but I can't alter the status of the ticket... sajith v ?

          Show
          Guido Serra aka Zeph added a comment - agree... but I can't alter the status of the ticket... sajith v ?
          Hide
          Brock Noland added a comment -

          Yeah the ultimate solution to this should be HIVE-2379. I think this one can be resolved as a duplicate.

          Show
          Brock Noland added a comment - Yeah the ultimate solution to this should be HIVE-2379 . I think this one can be resolved as a duplicate.
          Hide
          Guido Serra aka Zeph added a comment -

          uhmm... apparently HUE does not seem to pick it up

          java.io.IOException: Cannot create an instance of InputSplit class = org.apache.hadoop.hive.hbase.HBaseSplit:Class org.apache.hadoop.hive.hbase.HBaseSplit not found

          Show
          Guido Serra aka Zeph added a comment - uhmm... apparently HUE does not seem to pick it up java.io.IOException: Cannot create an instance of InputSplit class = org.apache.hadoop.hive.hbase.HBaseSplit:Class org.apache.hadoop.hive.hbase.HBaseSplit not found
          Hide
          Guido Serra aka Zeph added a comment -

          btw... this is enough to fix it:

          <property>
          <name>hive.aux.jars.path</name>
          <value>file:///usr/lib/hive/lib/hbase.jar,file:///usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar</value>
          </property>
          

          from Cloudera Manager into Hive Client Configuration Safety Valve for hive-site.xml

          Show
          Guido Serra aka Zeph added a comment - btw... this is enough to fix it: <property> <name>hive.aux.jars.path</name> <value>file: ///usr/lib/hive/lib/hbase.jar,file:///usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar</value> </property> from Cloudera Manager into Hive Client Configuration Safety Valve for hive-site.xml
          Hide
          Guido Serra aka Zeph added a comment -

          indeed, HIVE-2379 is the source of this

          Show
          Guido Serra aka Zeph added a comment - indeed, HIVE-2379 is the source of this
          Hide
          Navis added a comment -

          I think HIVE-2379 is addressing the same issue. Could you try it?

          Show
          Navis added a comment - I think HIVE-2379 is addressing the same issue. Could you try it?
          Hide
          Guido Serra aka Zeph added a comment -
          Name        : hive                         Relocations: (not relocatable)
          Version     : 0.9.0+155                         Vendor: (none)
          Release     : 1.cdh4.1.2.p0.21.el6          Build Date: Fri Nov  2 01:34:37 2012
          Install Date: Mon Jan 14 15:50:48 2013         Build Host: rhel64-6-0-mk1.jenkins.cloudera.com
          Group       : Development/Libraries         Source RPM: hive-0.9.0+155-1.cdh4.1.2.p0.21.el6.src.rpm
          Size        : 153016181                        License: Apache License v2.0
          Signature   : DSA/SHA1, Mon Nov  5 16:16:55 2012, Key ID f90c0d8fe8f86acd
          
          Name        : hbase                        Relocations: (not relocatable)
          Version     : 0.92.1+160                        Vendor: (none)
          Release     : 1.cdh4.1.2.p0.24.el6          Build Date: Fri Nov  2 01:03:06 2012
          Install Date: Mon Jan 14 15:50:31 2013         Build Host: rhel64-6-0-mk1.jenkins.cloudera.com
          Group       : Development/Libraries         Source RPM: hbase-0.92.1+160-1.cdh4.1.2.p0.24.el6.src.rpm
          Size        : 39911513                         License: APL2
          Signature   : DSA/SHA1, Mon Nov  5 16:16:52 2012, Key ID f90c0d8fe8f86acd
          
          Show
          Guido Serra aka Zeph added a comment - Name : hive Relocations: (not relocatable) Version : 0.9.0+155 Vendor: (none) Release : 1.cdh4.1.2.p0.21.el6 Build Date: Fri Nov 2 01:34:37 2012 Install Date: Mon Jan 14 15:50:48 2013 Build Host: rhel64-6-0-mk1.jenkins.cloudera.com Group : Development/Libraries Source RPM: hive-0.9.0+155-1.cdh4.1.2.p0.21.el6.src.rpm Size : 153016181 License: Apache License v2.0 Signature : DSA/SHA1, Mon Nov 5 16:16:55 2012, Key ID f90c0d8fe8f86acd Name : hbase Relocations: (not relocatable) Version : 0.92.1+160 Vendor: (none) Release : 1.cdh4.1.2.p0.24.el6 Build Date: Fri Nov 2 01:03:06 2012 Install Date: Mon Jan 14 15:50:31 2013 Build Host: rhel64-6-0-mk1.jenkins.cloudera.com Group : Development/Libraries Source RPM: hbase-0.92.1+160-1.cdh4.1.2.p0.24.el6.src.rpm Size : 39911513 License: APL2 Signature : DSA/SHA1, Mon Nov 5 16:16:52 2012, Key ID f90c0d8fe8f86acd
          Hide
          Guido Serra aka Zeph added a comment -

          I have the same issue... I setted up an environment with Cloudera Manager, the latest with CDH4

          the libraries are available under /usr/lib/hive/lib , and they are correctly in the CLASSPATH that gets computed by the hive-env.sh script (I did a print out of that), but in any case... they do not seem to get picked up, instead if I force them from command line, they work

          [zeph@yallayalla ~]$ hive
          Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.9.0-cdh4.1.2.jar!/hive-log4j.properties
          Hive history file=/tmp/zeph/hive_job_log_zeph_201301311652_675003173.txt
          hive> add jar /usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar;
          Added /usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar to class path
          Added resource: /usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar
          hive> add jar /usr/lib/hive/lib/hbase.jar;
          Added /usr/lib/hive/lib/hbase.jar to class path
          Added resource: /usr/lib/hive/lib/hbase.jar
          hive> select count(*) from yallayalla_vendors;                             
          Total MapReduce jobs = 1
          Launching Job 1 out of 1
          Number of reduce tasks determined at compile time: 1
          In order to change the average load for a reducer (in bytes):
            set hive.exec.reducers.bytes.per.reducer=<number>
          In order to limit the maximum number of reducers:
            set hive.exec.reducers.max=<number>
          In order to set a constant number of reducers:
            set mapred.reduce.tasks=<number>
          Starting Job = job_201301311629_0008, Tracking URL = http://yallayalla:50030/jobdetails.jsp?jobid=job_201301311629_0008
          Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=yallayalla:8021 -kill job_201301311629_0008
          Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
          2013-01-31 16:52:50,532 Stage-1 map = 0%,  reduce = 0%
          2013-01-31 16:52:56,591 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.32 sec
          2013-01-31 16:52:57,601 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.32 sec
          2013-01-31 16:52:58,608 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.32 sec
          2013-01-31 16:52:59,615 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.32 sec
          2013-01-31 16:53:00,629 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.14 sec
          2013-01-31 16:53:01,641 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.14 sec
          MapReduce Total cumulative CPU time: 4 seconds 140 msec
          Ended Job = job_201301311629_0008
          MapReduce Jobs Launched: 
          Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.14 sec   HDFS Read: 0 HDFS Write: 0 SUCCESS
          Total MapReduce CPU Time Spent: 4 seconds 140 msec
          OK
          749
          Time taken: 20.024 seconds
          hive> 
          
          Show
          Guido Serra aka Zeph added a comment - I have the same issue... I setted up an environment with Cloudera Manager, the latest with CDH4 the libraries are available under /usr/lib/hive/lib , and they are correctly in the CLASSPATH that gets computed by the hive-env.sh script (I did a print out of that), but in any case... they do not seem to get picked up, instead if I force them from command line, they work [zeph@yallayalla ~]$ hive Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.9.0-cdh4.1.2.jar!/hive-log4j.properties Hive history file=/tmp/zeph/hive_job_log_zeph_201301311652_675003173.txt hive> add jar /usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar; Added /usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar to class path Added resource: /usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar hive> add jar /usr/lib/hive/lib/hbase.jar; Added /usr/lib/hive/lib/hbase.jar to class path Added resource: /usr/lib/hive/lib/hbase.jar hive> select count(*) from yallayalla_vendors; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201301311629_0008, Tracking URL = http: //yallayalla:50030/jobdetails.jsp?jobid=job_201301311629_0008 Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=yallayalla:8021 -kill job_201301311629_0008 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2013-01-31 16:52:50,532 Stage-1 map = 0%, reduce = 0% 2013-01-31 16:52:56,591 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.32 sec 2013-01-31 16:52:57,601 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.32 sec 2013-01-31 16:52:58,608 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.32 sec 2013-01-31 16:52:59,615 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.32 sec 2013-01-31 16:53:00,629 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.14 sec 2013-01-31 16:53:01,641 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.14 sec MapReduce Total cumulative CPU time: 4 seconds 140 msec Ended Job = job_201301311629_0008 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.14 sec HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 140 msec OK 749 Time taken: 20.024 seconds hive>
          Hide
          Nick Dimiduk added a comment -

          This sounds like a classpath issue. Can you further describe your environment?

          Show
          Nick Dimiduk added a comment - This sounds like a classpath issue. Can you further describe your environment?

            People

            • Assignee:
              Nick Dimiduk
              Reporter:
              sajith v
            • Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development