Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28266

data duplication when `path` serde property is present

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
    • Fix Version/s: None
    • Component/s: Optimizer, Spark Core
    • Labels:

      Description

      Spark duplicates returned datasets when `path` serde is present in a parquet table. 

      Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.

      Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 at least).

      Reproducer:

      >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
      DataFrame[]
      
      >>> spark.table("ruslan_test.test55").explain()
      
      == Physical Plan ==
      HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
      
      >>> spark.table("ruslan_test.test55").count()
      1
      
      

      (all is good at this point, now exist session and run in Hive for example - )

      ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
      

      So LOCATION and serde `path` property would point to the same location.
      Now see count returns two records instead of one:

      >>> spark.table("ruslan_test.test55").count()
      2
      
      >>> spark.table("ruslan_test.test55").explain()
      == Physical Plan ==
      *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      >>>
      
      

      Also notice that the presence of `path` serde property makes TABLE location
      show up twice -

      InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive...,

      We have some applications that create parquet tables in Hive with `path` serde property
      and it makes data duplicate in query results.

      Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but not Spark 2.2 and later releases.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Tagar Ruslan Dautkhanov
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: