Details
-
Improvement
-
Status: Open
-
Trivial
-
Resolution: Unresolved
-
2.4.5, 3.1.2
-
None
-
None
Description
If a Hive table is created with both WITH SERDEPROPERTIES ('path'='<tableLocation>') and LOCATION <tableLocation>, Spark can return doubled rows when reading the table. This issue seems to be an extension of SPARK-30507.
Reproduce steps:
- Create table and insert records via Hive (Spark doesn't allow to insert into table like this)
CREATE TABLE `test_table`( `c1` LONG, `c2` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" ) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '<tableLocationPath>'; INSERT INTO TABLE `test_table` VALUES (0, '0'); SELECT * FROM `test_table`; -- will return -- 0 0
- Read above table from Spark
SELECT * FROM `test_table`; -- will return -- 0 0 -- 0 0
But if we set spark.sql.hive.convertMetastoreParquet=false, Spark will return same result as Hive (i.e. single row)
A similar case is that, if a Hive table is created with both WITH SERDEPROPERTIES ('path'='<anotherPath>') and LOCATION <tableLocation>, Spark will read both rows under anotherPath and rows under tableLocation, regardless of spark.sql.hive.convertMetastoreParquet ‘s value. However, actually Hive seems to return only rows under tableLocation
Another similar case is that, if path is provided in TBLPROPERTIES, Spark won’t double the rows when 'path'='<tableLocation>'. If 'path'='<anotherPath>', Spark will read both rows under anotherPath and rows under tableLocation, Hive seems to keep ignoring the path in TBLPROPERTIES
Code examples for the above cases (diff patch wrote in HiveParquetMetastoreSuite.scala) can be found in Attachments
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-28266 data duplication when `path` serde property is present
- Resolved