[SPARK-37027] Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Trivial
Resolution: Unresolved
Affects Version/s: 2.4.5, 3.1.2
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

If a Hive table is created with both WITH SERDEPROPERTIES ('path'='<tableLocation>') and LOCATION <tableLocation>, Spark can return doubled rows when reading the table. This issue seems to be an extension of ~~SPARK-30507~~.

Reproduce steps:

Create table and insert records via Hive (Spark doesn't allow to insert into table like this)

CREATE TABLE `test_table`(
  `c1` LONG,
  `c2` STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" )
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<tableLocationPath>';

INSERT INTO TABLE `test_table`
VALUES (0, '0');

SELECT * FROM `test_table`;
-- will return
-- 0 0

Read above table from Spark

SELECT * FROM `test_table`;
-- will return
-- 0 0
-- 0 0

But if we set spark.sql.hive.convertMetastoreParquet=false, Spark will return same result as Hive (i.e. single row)

A similar case is that, if a Hive table is created with both WITH SERDEPROPERTIES ('path'='<anotherPath>') and LOCATION <tableLocation>, Spark will read both rows under anotherPath and rows under tableLocation, regardless of spark.sql.hive.convertMetastoreParquet ‘s value. However, actually Hive seems to return only rows under tableLocation

Another similar case is that, if path is provided in TBLPROPERTIES, Spark won’t double the rows when 'path'='<tableLocation>'. If 'path'='<anotherPath>', Spark will read both rows under anotherPath and rows under tableLocation, Hive seems to keep ignoring the path in TBLPROPERTIES

Code examples for the above cases (diff patch wrote in HiveParquetMetastoreSuite.scala) can be found in Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPARK-37027-test-example.patch
17/Oct/21 03:37
7 kB
Yuzhou Sun

Issue Links

duplicates

SPARK-28266 data duplication when `path` serde property is present

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yuzhou Sun

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Oct/21 03:34

Updated:: 18/Oct/21 18:10