Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.2.0, 2.2.1, 2.2.2
Description
Spark duplicates returned datasets when `path` serde is present in a parquet table.
Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 at least).
Reproducer:
>>> spark.sql("create table ruslan_test.test55 as select 1 as id") DataFrame[] >>> spark.table("ruslan_test.test55").explain() == Physical Plan == HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16] >>> spark.table("ruslan_test.test55").count() 1
(all is good at this point, now exist session and run in Hive for example - )
ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
So LOCATION and serde `path` property would point to the same location.
Now see count returns two records instead of one:
>>> spark.table("ruslan_test.test55").count() 2 >>> spark.table("ruslan_test.test55").explain() == Physical Plan == *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> >>>
Also notice that the presence of `path` serde property makes TABLE location
show up twice -
InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive...,
We have some applications that create parquet tables in Hive with `path` serde property
and it makes data duplicate in query results.
Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but not Spark 2.2 and later releases.
Attachments
Issue Links
- is duplicated by
-
SPARK-37027 Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES
- Open
- is related to
-
HIVE-21952 Hive should allow to delete serde properties too, not just add them
- Resolved
- links to