Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6868

Hudi HiveSync doesn't support extracting passwords from credential store

    XMLWordPrintableJSON

Details

    Description

      We have a customer use-case of running PySpark on Dataproc Serverless with hudi-spark3-bundle, PySpark job fails to sync Hudi table with HMS DB(remote CloudSQL DB instance) due to not able to extract the password from the credential store. 

      Same job works fine if we mention Hive Metstore DB user password instead of credential store. 

      Checking code for HiveSync configs or HiveSyncConfigHolder, I don't see any option where it detects credential store for extracting passwords. Something like this code from HMS ObjectStore.

      Hive Sync Config Document also doesn't have any reference of using credential store. 

      In order to find the password through the Hadoop Credential Provider API, it would need to make a call to `Configuration#getPassword(String)`. We don't see anywhere in the Hudi codebase calling "getPassword"

       

      Repro steps:

       

      Sample PySpark script - Attached. 

       

      Command with successful job execution with Metastore DB password:

      gcloud dataproc batches submit --version 1.1 --container-image gcr.io/<container-repo>/new-custom-debian:v4 --region <region> pyspark gs://<gcs-bucket>/pyspark_hudi_test.py --jars="gs://<gcs-bucket>/hudi-spark3-bundle_2.12-0.12.3.jar" --properties "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://<cloud-sql-HMS-DB-IP>:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.javax.jdo.option.ConnectionPassword=<hive-db-user-password>" --deps-bucket gs://<gcs-bucket> -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/* 

       

      Failing command ( with credential store):

      gcloud dataproc batches submit --version 1.1 --container-image gcr.io/<container-repo>/new-custom-debian:v4 --region <region> pyspark gs://<gcs-bucket>/pyspark_hudi_test.py --jars="gs://<gcs-bucket>/hudi-spark3-bundle_2.12-0.12.3.jar" --properties "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://<cloud-sql-HMS-DB-IP>:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.hadoop.security.credential.provider.path=jceks://gs@<gcs-bucket>/metastore-pass-v2.jceks" --deps-bucket gs://<gcs-bucket> -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/*  

       

      Error:

      23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Commit 20230911042953444 successful!
      23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.inlineCompactionEnabled ? false
      23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Compaction Scheduled is Optional.empty
      23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.asyncClusteringEnabled ? false
      23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Clustering Scheduled is Optional.empty
      23/09/11 04:30:42 INFO HiveConf: Found configuration file null
      [..]
      23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from gs://<gcs-bucket>/
      23/09/11 04:30:42 INFO HoodieTableConfig: Loading table properties from gs://<gcs-bucket>/.hoodie/hoodie.properties
      23/09/11 04:30:42 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from gs://<gcs-bucket>/
      23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading Active commit timeline for gs://<gcs-bucket>/
      23/09/11 04:30:42 INFO HoodieActiveTimeline: Loaded instants upto : Option\{val=[20230911042953444__commit__COMPLETED]}
      23/09/11 04:30:43 INFO HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
      23/09/11 04:30:43 INFO ObjectStore: ObjectStore, initialize called
      23/09/11 04:30:44 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
      Mon Sep 11 04:30:44 UTC 2023 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
      [..]
      Unable to open a test connection to the given database. JDBC url = jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
      java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' (using password: YES)
      at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
      at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3933)
      at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3869)
      at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:864)
      at com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1707)
      at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1217)
      [..]
      ------
      org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
      java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' (using password: YES)
      at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
       
      [..]
       
      Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
      java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' (using password: YES)
      at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
       
      

       

      Note - metastore-pass-v2.jceks in above example contains value of "javax.jdo.option.ConnectionPassword" and there is no issue with it. It works fine with this credential store for other pyspark jobs(without Hudi of course)

       

      We tried with "hudi-spark3-bundle_2.12-0.13.1.jar" as well, it did not help.

      Attachments

        1. pyspark_hudi_test.py
          2 kB
          Kuldeep Kulkarni

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kuldeepkulkarni09@gmail.com Kuldeep Kulkarni
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: