[SPARK-45678] Cover BufferReleasingInputStream.available under tryOrFetchFailedException - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.4.1, 3.5.0, 4.0.0
Fix Version/s: 3.4.2, 4.0.0, 3.5.1
Component/s: Spark Core
Labels:
- pull-request-available

Description

We have encountered shuffle data corruption issue:

```
Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:543)
at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450)
at org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497)
at org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356)
```

Spark shuffle has capacity to detect corruption for a few stream op like `read` and `skip`, such `IOException` in the stack trace will be rethrown as `FetchFailedException` that will re-try the failed shuffle task. But in the stack trace it is `available` that is not covered by the mechanism. So no-retry has been happened and the Spark application just failed.

As the `available` op will also involve data decompression, we should be able to check it like `read` and `skip` do.

Attachments

Issue Links

links to

GitHub Pull Request #43543

Activity

People

Assignee:: L. C. Hsieh

Reporter:: L. C. Hsieh

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Oct/23 21:00

Updated:: 28/Oct/23 04:15

Resolved:: 28/Oct/23 02:21