[SPARK-48950] Corrupt data from parquet scans - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.0, 4.0.0, 3.5.1, 3.5.2
Fix Version/s: None
Component/s: Input/Output
Labels:
- correctness
Environment:

Spark 3.5.0

Running on kubernetes

Using Azure Blob storage with hierarchical namespace enabled

Target Version/s:

3.5.3

Description

Its very rare and non-deterministic but since Spark 3.5.0 we have started seeing a correctness bug in parquet scans when using the vectorized reader.

We've noticed this on double type columns where occasionally small groups (typically 10s to 100s) of rows are replaced with crazy values like `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, -7.60562076e+240, -3.18088886e-064, 2.89435993e-116`. I think this is the result of interpreting uniform random bits as a double type. Most of my testing has been on an array of double type column but we have also seen it on un-nested plain double type columns.

I've been testing this by adding a filter that should return zero results but will return non-zero if the parquet scan has problems. I've attached screenshots of this from the Spark UI.

I did a `git bisect` and found that the problem starts with https://github.com/apache/spark/pull/39950, but I haven't yet understood why. Its possible that this change is fine but it reveals a problem elsewhere? I did also notice https://github.com/apache/spark/pull/44853 which appears to be a different implementation of the same thing so maybe that could help.

Its not a major problem by itself but another symptom appears to be that Parquet scan tasks fail at a rate of approximately 0.03% with errors like those in the attached `example_task_errors.txt`. If I revert https://github.com/apache/spark/pull/39950 I get exactly 0 task failures on the same test.

The problem seems to be a bit dependant on how the parquet files happen to be organised on blob storage so I don't yet have a reproduce that I can share that doesn't depend on private data.

I tested on a pre-release 4.0.0 and the problem was still present.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sql_query_plan.png
19/Jul/24 21:49
23 kB
Thomas Newton
reproduce_spark-48950.py
19/Sep/24 11:58
2 kB
Thomas Newton
job_dag.png
19/Jul/24 21:49
11 kB
Thomas Newton
generate_data_to_reproduce_spark-48950.ipynb
19/Sep/24 12:06
5 kB
Thomas Newton
example_task_errors.txt
19/Jul/24 21:46
5 kB
Thomas Newton
corrupt_data_examples.zip
23/Sep/24 17:08
3.19 MB
Thomas Newton

Issue Links

is caused by

SPARK-42388 Avoid unnecessary parquet footer reads when no filters in vectorized reader

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Newton

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Jul/24 21:46

Updated:: 23/Sep/24 17:20