Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31675

Fail to insert data to a table with remote location which causes by hive encryption check

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 2.4.6, 3.0.0, 3.1.0
    • None
    • SQL
    • None

    Description

      Before this fix https://issues.apache.org/jira/browse/HIVE-14380 in Hive 2.2.0, when moving files from staging dir to the final table dir, Hive will do encryption check for the srcPaths and destPaths

      // Some comments here
           if (!isSrcLocal) {
              // For NOT local src file, rename the file
              if (hdfsEncryptionShim != null && (hdfsEncryptionShim.isPathEncrypted(srcf) || hdfsEncryptionShim.isPathEncrypted(destf))
                  && !hdfsEncryptionShim.arePathsOnSameEncryptionZone(srcf, destf))
              {
                LOG.info("Copying source " + srcf + " to " + destf + " because HDFS encryption zones are different.");
                success = FileUtils.copy(srcf.getFileSystem(conf), srcf, destf.getFileSystem(conf), destf,
                    true,    // delete source
                    replace, // overwrite destination
                    conf);
              } else {
      

      The hdfsEncryptionShim instance holds a global FileSystem instance belong to the default fileSystem. It causes failures when checking a path that belongs to a remote file system.

      For example, I

      key	int	NULL
      
      # Detailed Table Information
      Database	bdms_hzyaoqin_test_2
      Table	abc
      Owner	bdms_hzyaoqin
      Created Time	Mon May 11 15:14:15 CST 2020
      Last Access	Thu Jan 01 08:00:00 CST 1970
      Created By	Spark 2.4.3
      Type	MANAGED
      Provider	hive
      Table Properties	[transient_lastDdlTime=1589181255]
      Location	hdfs://cluster2/user/warehouse/bdms_hzyaoqin_test.db/abc
      Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      InputFormat	org.apache.hadoop.mapred.TextInputFormat
      OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      Storage Properties	[serialization.format=1]
      Partition Provider	Catalog
      Time taken: 0.224 seconds, Fetched 18 row(s)
      

      The table abc belongs to the remote hdfs 'hdfs://cluster2', and when we run command below via a spark sql job with default fs is ' 'hdfs://cluster1'

      insert into bdms_hzyaoqin_test_2.abc values(1);
      
      
      Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs://cluster2/user/warehouse/bdms_hzyaoqin_test.db/abc/.hive-staging_hive_2020-05-11_17-10-27_123_6306294638950056285-1/-ext-10000/part-00000-badf2a31-ab36-4b60-82a1-0848774e4af5-c000, expected: hdfs://cluster1
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Qin Yao Kent Yao 2

            Dates

              Created:
              Updated:

              Slack

                Issue deployment