Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31770

spark double count issue when table is recreated from hive

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.2, 2.4.5
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Spark sets serde property named path when a table is created. Along with it , it will add a property called sql.source.provider that ensures the consistency. But if we recreate the same table using hive , then the hive will not respect the serde properties set by spark named sql.source.provider. This will lead to conflict if we read the newly created table using spark . 

       

      It can be argued that the issue is with hive because it do not properly ensures that the table properties are being copied. But these are the properties set by spark on a hive table. and spark should be the one adhering to the standards of hive. From a corporate standpoint , we cannot tell users to avoid creating or altering tables on hive because spark will end up giving wrong results. Worst of all , even if it failed then it was okay. but it simply gives double count without even a failure message and that is very dangerous for critical applications. Either end up giving error message if the serde property does not have path as well as sql.source.provider (or) do not even try to create the property path in first place.  Few related issues : 

      1. SPARK-31751
      2. SPARK-28266 was resolved as non-reproducible which is not entirely true

       

      steps to reproduce :

       

      df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
      df.createOrReplaceTempView("test1")
      spark.sql("CREATE TABLE test1_using_orc USING ORC AS (SELECT * from test1)")
      

       


      Run the below from hive ( preferably tez )

       

      create table test2_like_test1 like test1_using_orc;
      insert into test2_like_test1 select * from test1_using_orc;

      -------from spark again : Now test2_like_test1 will have serde property called path , but will not have sql.source.provider. 

       

       

      >>> spark.sql("set spark.sql.hive.convertMetastoreOrc=false").show()
      +--------------------+-----+
      | key|value|
      +--------------------+-----+
      |spark.sql.hive.co...|false|
      +--------------------+-----+
      >>> spark.sql("select count(*) from test2_like_test1").show()
      +--------+
      |count(1)|
      +--------+
      | 1|
      +--------+
      >>> spark.sql("set spark.sql.hive.convertMetastoreOrc=true").show()
      +--------------------+-----+
      | key|value|
      +--------------------+-----+
      |spark.sql.hive.co...| true|
      +--------------------+-----+
      >>> spark.sql("select count(*) from test2_like_test1").show()
      +--------+
      |count(1)|
      +--------+
      | 2|
      +--------+
      

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              niths Nithin
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: