Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-281

HiveSync failure through Spark when useJdbc is set to false

    XMLWordPrintableJSON

Details

    • 1

    Description

      Table creation with Hive sync through Spark fails, when I set useJdbc to false. Currently I had to modify the code to set useJdbc to false as there is not DataSourceOption through which I can specify this field when running Hudi code.

      Here is the failure:

      java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
        at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
        at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
        at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
        at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
        at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
        at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
        at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
        at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
        at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)

      I was expecting this to fail through Spark, becuase hive-exec is not shaded inside hudi-spark-bundle, while HiveConf is shaded and relocated. This SessionState is coming from the spark-hive jar and obviously it does not accept the relocated HiveConf.

      We in EMR are running into same problem when trying to integrate with Glue Catalog. For this we have to create Hive metastore client through Hive.get(conf).getMsc() instead of how it is being down now, so that alternate implementations of metastore can get created. However, because hive-exec is not shaded but HiveConf is relocated we run into same issues there.

      It would not be recommended to shade hive-exec either because it itself is an Uber jar that shades a lot of things, and all of them would end up in hudi-spark-bundle jar. We would not want to head that route. That is why, we would suggest if we consider removing any shading of Hive libraries.

      We can add a Maven Profile to shade, but that means it has to be activated by default otherwise it will fail default if useJdbc is set to false, and later when we commit Glue Catalog changes.

       

       

       

       

       

      Attachments

        Activity

          People

            uditme Udit Mehrotra
            uditme Udit Mehrotra
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: