Details
Description
There is a race condition introduced in SPARK-11141 which could cause data loss.
This affects all versions since 1.6.0.
Problematic situation:
- Start streaming job with 2 receivers with WAL enabled.
- Receiver 1 receives a block and does the following
-
- Writes a BlockAdditionEvent into WAL
- Puts the block into it's received block queue with ID 1
- Receiver 2 receives a block and does the following
-
- Writes a BlockAdditionEvent into WAL
- Spark allocates all blocks from it's received block queue and writes AllocatedBlocks(IDs=(1)) into WAL
- Driver crashes
- New Driver recovers from WAL
- Realise block with ID 2 never processed