Details
-
Bug
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
None
-
None
Description
This is similar to BEAM-7547. The issue and identified there was related to the state cache. However for UnboundedSource there is a separate ReaderCache. In particular the following sequence could lead to using a Reader at the wrong position.
1. Work arrives indicating to use reader at state C1
2. Work processes and advances reader to C2, commit is prepared with elements output from C1 to C2 and reader checkpoint for C2
3. Commit of original processing fails (perhaps routed to previous backend during autoscaling)
4. Work is retried (still indicating to use reader at C1 and still same cache token) however the ReaderCache is used, there is a hit, and the existing reader (positioned at C2) is used.
5. Retry of work processes and advances reader to C3, commit is scheduled with elements output from C2 to C3 and reader checkpoint for C3
6. Commit succeeds
At this point there was never a successful commit for the elements between C1 to C2, though the reader is now advanced past them.
Possible fixes:
1. Use increasing work token as in BEAM-7547 to detect retry and not use the ReaderCache entry
2. Seek recovered reader from cache to checkpoint or only use if it matches the checkpoint. This would probably involve changing the Checkpoint interface so 1 is likely preferred.
Attachments
Issue Links
- links to