[SPARK-28266] data duplication when `path` serde property is present - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.0, 2.2.1, 2.2.2
Fix Version/s: 3.2.0, 3.1.3, 3.0.4
Component/s: Spark Core
Labels:
- correctness

Description

Spark duplicates returned datasets when `path` serde is present in a parquet table.

Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.

Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 at least).

Reproducer:

>>> spark.sql("create table ruslan_test.test55 as select 1 as id")
DataFrame[]

>>> spark.table("ruslan_test.test55").explain()

== Physical Plan ==
HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]

>>> spark.table("ruslan_test.test55").count()
1

(all is good at this point, now exist session and run in Hive for example - )

ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )

So LOCATION and serde `path` property would point to the same location.
Now see count returns two records instead of one:

>>> spark.table("ruslan_test.test55").count()
2

>>> spark.table("ruslan_test.test55").explain()
== Physical Plan ==
*(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
>>>

Also notice that the presence of `path` serde property makes TABLE location
show up twice -

InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive...,

We have some applications that create parquet tables in Hive with `path` serde property
and it makes data duplicate in query results.

Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but not Spark 2.2 and later releases.

Attachments

Issue Links

is duplicated by

SPARK-37027 Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

Open

is related to

HIVE-21952 Hive should allow to delete serde properties too, not just add them

Resolved

links to

[Github] Pull Request #33328 (shardulm94)

Activity

People

Assignee:: Shardul Mahadik

Reporter:: Ruslan Dautkhanov

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 05/Jul/19 23:29

Updated:: 18/Oct/21 18:10

Resolved:: 21/Jul/21 14:42