Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Correctness - Unrecoverable Corruption / Loss
-
Normal
-
Normal
-
User Report
-
All
-
None
-
Description
Bulk reader leverages a time provider that returns the current time during read to guide compaction and validation.
As the current time value varies in spark executors, there is a chance that rows/cells get expired inconsistently. Another issue is the validation on no-expired rows/cells after compaction might fail, since they could expire during read. The read can take minutes or even hours.
It could lead to false data omission and job failure.
The fix is to use constant reference time that is decided by Spark driver and distribute to all executors. The reference time is used for compaction and validation later.
Attachments
Issue Links
- links to