Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24261

Spark cannot read renamed managed Hive table

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • Spark Core

    Description

      When spark creates hive table using df.write.saveAsTable, it creates managed table in hive with SERDEPROPERTIES like 

      WITH SERDEPROPERTIES ('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') 

      When any external user changes hive table name via Hive CLI or Hue, Hive makes sure the table name is changed and also the path is changed to new location. But it never updates the serdeproperties mentioned above. 

      Steps to Reproduce:

      1. Save table using spark:
      spark.sql("select * from some_db.some_table").write.saveAsTable("some_db.some_new_table")

      2. In Hive CLI or Hue, run
      alter table some_db.some_new_table rename to some_db.some_new_table_buggy_path

      3. Try to ready the buggy table some_db.some_new_table_buggy_path in spark
      spark.sql("select * from some_db.some_new_table_buggy_path limit 10").collect

      Spark throws following warning (Spark fails to read while hive can read this table):

      18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from cache
      18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing from cache
      18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it deleted very recently?
      res2: Array[org.apache.spark.sql.Row] = Array()

      The DDLs for each of the tables are attached. 

      This will create inconsistency and endusers will spend endless time in finding bug if data exists in both location, but spark reads it from different location while hive process writes the new data in new location. 

      I went through similar JIRAs, but those address different issues.

      SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, while other external process renames the table.

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            snayakm Suraj Nayak
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment