[SPARK-31751] spark serde property path overwrites table property location - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.3.1, 2.4.5
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

This is an issue that have caused us so many data errors.

1) using spark ( with hive context enabled )

df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df.write.format("orc").option("compression", "ZLIB").mode("overwrite").saveAsTable('test_spark');

2) from hive

alter table test_spark rename to test_spark2

3)from spark-sql from command line ( note : not pyspark or spark-shell )

select * from test_spark2

will give output

NULL NULL NULL
Time taken: 0.334 seconds, Fetched 1 row(s)

This will throw NULL because , pyspark write API will add a serde property called path into the hive metastore. when hive renames the table , it do not understand this serde and hence keep it as it is. Now when spark-sql tries to read it , it will honor the serde property first and then tries to read from the non-existent hdfs location. If it had given an error , then also it would have been fine , but throwing out NULL will cause applications to fail pretty bad. Spark claims to support hive tables , hence it should respect hive metastore location property rather than spark serde property when trying to read a table. This cannot be classified as a expected behaviour.

Attachments

Issue Links

links to

[Github] Pull Request #28882 (TJX2014)

Activity

People

Assignee:: Unassigned

Reporter:: Nithin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/May/20 16:04

Updated:: 21/Jun/20 07:03