[SPARK-31770] spark double count issue when table is recreated from hive - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.2, 2.4.5
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed

Description

Spark sets serde property named path when a table is created. Along with it , it will add a property called sql.source.provider that ensures the consistency. But if we recreate the same table using hive , then the hive will not respect the serde properties set by spark named sql.source.provider. This will lead to conflict if we read the newly created table using spark .

It can be argued that the issue is with hive because it do not properly ensures that the table properties are being copied. But these are the properties set by spark on a hive table. and spark should be the one adhering to the standards of hive. From a corporate standpoint , we cannot tell users to avoid creating or altering tables on hive because spark will end up giving wrong results. Worst of all , even if it failed then it was okay. but it simply gives double count without even a failure message and that is very dangerous for critical applications. Either end up giving error message if the serde property does not have path as well as sql.source.provider (or) do not even try to create the property path in first place. Few related issues :

SPARK-31751
~~SPARK-28266~~ was resolved as non-reproducible which is not entirely true

steps to reproduce :

df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df.createOrReplaceTempView("test1")
spark.sql("CREATE TABLE test1_using_orc USING ORC AS (SELECT * from test1)")

Run the below from hive ( preferably tez )

create table test2_like_test1 like test1_using_orc;
insert into test2_like_test1 select * from test1_using_orc;

-------from spark again : Now test2_like_test1 will have serde property called path , but will not have sql.source.provider.

>>> spark.sql("set spark.sql.hive.convertMetastoreOrc=false").show()
+--------------------+-----+
| key|value|
+--------------------+-----+
|spark.sql.hive.co...|false|
+--------------------+-----+
>>> spark.sql("select count(*) from test2_like_test1").show()
+--------+
|count(1)|
+--------+
| 1|
+--------+
>>> spark.sql("set spark.sql.hive.convertMetastoreOrc=true").show()
+--------------------+-----+
| key|value|
+--------------------+-----+
|spark.sql.hive.co...| true|
+--------------------+-----+
>>> spark.sql("select count(*) from test2_like_test1").show()
+--------+
|count(1)|
+--------+
| 2|
+--------+

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nithin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/May/20 09:05

Updated:: 25/May/21 01:53

Resolved:: 25/May/21 01:39