[CASSANDRA-19452] [Analytics] Use constant reference time during bulk read process - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: NA
Component/s: Analytics Library
Labels:
None

Bug Category:
Correctness - Unrecoverable Corruption / Loss
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

NA
Source Control Link:

https://github.com/apache/cassandra-analytics/commit/a13532272051d4e4608f92d53bdd997103e8ea19
Test and Documentation Plan:

Hide

ci; unit

Show
ci; unit

Description

Bulk reader leverages a time provider that returns the current time during read to guide compaction and validation.

As the current time value varies in spark executors, there is a chance that rows/cells get expired inconsistently. Another issue is the validation on no-expired rows/cells after compaction might fail, since they could expire during read. The read can take minutes or even hours.
It could lead to false data omission and job failure.

The fix is to use constant reference time that is decided by Spark driver and distribute to all executors. The reference time is used for compaction and validation later.

Attachments

Issue Links

links to

GitHub Pull Request #44

Activity

People

Assignee:: Yifan Cai

Reporter:: Yifan Cai

Authors:: Yifan Cai

Reviewers:: Francisco Guerrero, James Berragan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Feb/24 18:24

Updated:: 05/Mar/24 19:07

Resolved:: 05/Mar/24 19:07

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h