[SPARK-5142] Possibly data may be ruined in Spark Streaming's WAL mechanism. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: DStreams
Labels:
- bulk-closed

Description

Currently in Spark Streaming's WAL manager, data will be written into HDFS with multiple tries when meeting failure, because of lacking of transactional guarantee, previously partial-written data is not rolled back and the retried data will be appended to the last, this will ruin the file and make the WriteAheadLogReader to read data with failure.

Firstly I think this problem is hard to fix because HDFS do not support truncate operation(~~HDFS-3107~~) or random write with specific offset.

Secondly, I think if we meet such write exception, it is better not to try again, try again will ruin the file and make read abnormal.

Sorry if I misunderstand anything.

Attachments

Issue Links

relates to

SPARK-6222 [STREAMING] All data may not be recovered from WAL when driver is killed

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Saisai Shao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Jan/15 10:00

Updated:: 21/May/19 05:37

Resolved:: 21/May/19 05:37