[SPARK-32672] Data corruption in some cached compressed boolean columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
Fix Version/s: 2.4.7, 3.0.1, 3.1.0
Component/s: SQL
Labels:
- correctness

Description

I found that when sorting some boolean data into the cache that the results can change when the data is read back out.

It needs to be a non-trivial amount of data, and it is highly dependent on the order of the data. If I disable compression in the cache the issue goes away. I was able to make this happen in 3.0.0. I am going to try and reproduce it in other versions too.

I'll attach the parquet file with boolean data in an order that causes this to happen. As you can see after the data is cached a single null values switches over to be false.

scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
bad_order: org.apache.spark.sql.DataFrame = [b: boolean]                        

scala> bad_order.groupBy("b").count.show
+-----+-----+
|    b|count|
+-----+-----+
| null| 7153|
| true|54334|
|false|54021|
+-----+-----+


scala> bad_order.cache()
res1: bad_order.type = [b: boolean]

scala> bad_order.groupBy("b").count.show
+-----+-----+
|    b|count|
+-----+-----+
| null| 7152|
| true|54334|
|false|54022|
+-----+-----+


scala>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

bad_order.snappy.parquet
20/Aug/20 21:24
8 kB
Robert Joseph Evans
small_bad.snappy.parquet
21/Aug/20 13:02
2 kB
Robert Joseph Evans

Issue Links

is caused by

SPARK-20783 Enhance ColumnVector to support compressed representation

Resolved

links to

[Github] Pull Request #29506 (revans2)

Activity

People

Assignee:: Robert Joseph Evans

Reporter:: Robert Joseph Evans

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 20/Aug/20 21:24

Updated:: 12/Dec/22 18:11

Resolved:: 22/Aug/20 02:09