Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28266

data duplication when `path` serde property is present

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0, 2.2.1, 2.2.2
    • Fix Version/s: 3.2.0, 3.1.3, 3.0.4
    • Component/s: Spark Core
    • Labels:

      Description

      Spark duplicates returned datasets when `path` serde is present in a parquet table. 

      Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.

      Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 at least).

      Reproducer:

      >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
      DataFrame[]
      
      >>> spark.table("ruslan_test.test55").explain()
      
      == Physical Plan ==
      HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
      
      >>> spark.table("ruslan_test.test55").count()
      1
      
      

      (all is good at this point, now exist session and run in Hive for example - )

      ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
      

      So LOCATION and serde `path` property would point to the same location.
      Now see count returns two records instead of one:

      >>> spark.table("ruslan_test.test55").count()
      2
      
      >>> spark.table("ruslan_test.test55").explain()
      == Physical Plan ==
      *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      >>>
      
      

      Also notice that the presence of `path` serde property makes TABLE location
      show up twice -

      InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, hdfs://epsdatalake/hive...,

      We have some applications that create parquet tables in Hive with `path` serde property
      and it makes data duplicate in query results.

      Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but not Spark 2.2 and later releases.

        Attachments

        Issue Links

          Activity

          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users
          Cancel

            People

            • Assignee:
              shardulm Shardul Mahadik Assign to me
              Reporter:
              Tagar Ruslan Dautkhanov

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment