[HUDI-281] HiveSync failure through Spark when useJdbc is set to false - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: 0.10.1
Component/s: hive, spark, Usability
Labels:
- query-eng
- user-support-issues

Story Points:
1

Description

Table creation with Hive sync through Spark fails, when I set useJdbc to false. Currently I had to modify the code to set useJdbc to false as there is not DataSourceOption through which I can specify this field when running Hudi code.

Here is the failure:

java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
  at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
  at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
  at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
  at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
  at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
  at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)

I was expecting this to fail through Spark, becuase hive-exec is not shaded inside hudi-spark-bundle, while HiveConf is shaded and relocated. This SessionState is coming from the spark-hive jar and obviously it does not accept the relocated HiveConf.

We in EMR are running into same problem when trying to integrate with Glue Catalog. For this we have to create Hive metastore client through Hive.get(conf).getMsc() instead of how it is being down now, so that alternate implementations of metastore can get created. However, because hive-exec is not shaded but HiveConf is relocated we run into same issues there.

It would not be recommended to shade hive-exec either because it itself is an Uber jar that shades a lot of things, and all of them would end up in hudi-spark-bundle jar. We would not want to head that route. That is why, we would suggest if we consider removing any shading of Hive libraries.

We can add a Maven Profile to shade, but that means it has to be activated by default otherwise it will fail default if useJdbc is set to false, and later when we commit Glue Catalog changes.

Attachments

Activity

People

Assignee:: Udit Mehrotra

Reporter:: Udit Mehrotra

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Sep/19 00:07

Updated:: 07/Jan/22 21:34

Resolved:: 05/Jan/22 22:29