[SPARK-35108] Pickle produces incorrect key labels for GenericRowWithSchema (data corruption) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Duplicate
Affects Version/s: 3.0.1, 3.0.2
Fix Version/s: None
Component/s: SQL
Labels:
- correctness

Description

I think this also shows up for all versions of Spark that pickle the data when doing a collect from python.

When you do a collect in python java will do a collect and convert the UnsafeRows into GenericRowWithSchema instances before it sends them to the Pickler. The Pickler, by default, will try to dedupe objects using hashCode and .equals for the object. But .equals and .hashCode for GenericRowWithSchema only looks at the data, not the schema. But when we pickle the row the keys from the schema are written out.

This can result in data corruption, sort of, in a few cases where a row has the same number of elements as a struct within the row does, or a sub-struct within another struct.

If the data happens to be the same, the keys for the resulting row or struct can be wrong.

My repro case is a bit convoluted, but it does happen.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test.py
16/Apr/21 21:37
5 kB
Robert Joseph Evans
test.sh
16/Apr/21 21:37
0.5 kB
Robert Joseph Evans

Issue Links

duplicates

SPARK-34545 PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Robert Joseph Evans

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Apr/21 21:37

Updated:: 12/Dec/22 18:10

Resolved:: 04/May/21 12:46