[SPARK-24261] Spark cannot read renamed managed Hive table - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed

Description

When spark creates hive table using df.write.saveAsTable, it creates managed table in hive with SERDEPROPERTIES like

WITH SERDEPROPERTIES ('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table')

When any external user changes hive table name via Hive CLI or Hue, Hive makes sure the table name is changed and also the path is changed to new location. But it never updates the serdeproperties mentioned above.

Steps to Reproduce:

1. Save table using spark:
spark.sql("select * from some_db.some_table").write.saveAsTable("some_db.some_new_table")

2. In Hive CLI or Hue, run
alter table some_db.some_new_table rename to some_db.some_new_table_buggy_path

3. Try to ready the buggy table some_db.some_new_table_buggy_path in spark
spark.sql("select * from some_db.some_new_table_buggy_path limit 10").collect

Spark throws following warning (Spark fails to read while hive can read this table):

18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from cache
18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing from cache
18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was it deleted very recently?
res2: Array[org.apache.spark.sql.Row] = Array()

The DDLs for each of the tables are attached.

This will create inconsistency and endusers will spend endless time in finding bug if data exists in both location, but spark reads it from different location while hive process writes the new data in new location.

I went through similar JIRAs, but those address different issues.

~~SPARK-15635~~ and ~~SPARK-16570~~ address alter table in spark, unlike this jira, while other external process renames the table.

Attachments

some_db.some_new_table_buggy_path.ddl
13/May/18 21:31
1 kB
Suraj Nayak
some_db.some_new_table.ddl
13/May/18 21:31
1 kB
Suraj Nayak
some_db.some_table.ddl
13/May/18 21:31
0.6 kB
Suraj Nayak

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Suraj Nayak

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 13/May/18 21:24

Updated:: 25/Nov/19 20:12

Resolved:: 21/May/19 07:13

Agile

View on Board

Spark cannot read renamed managed Hive table

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment