Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-20654

Unaligned checkpoint recovery may lead to corrupted data stream

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Fix of FLINK-20433 shows potential corruption after recovery for all variations of UnalignedCheckpointITCase.

      To reproduce, run UCITCase a couple hundreds times. The issue showed for me in:

      • execute [Parallel union, p = 5]
      • execute [Parallel union, p = 10]
      • execute [Parallel cogroup, p = 5]
      • execute [parallel pipeline with remote channels, p = 5]
        with decreasing frequency.

      The issue manifests as one of the following issues:

      • stream corrupted exception
      • EOF exception
      • assertion failure in NUM_LOST or NUM_OUT_OF_ORDER
      • (for union) ArithmeticException overflow (because the number that should be [0;100000] has been mis-deserialized)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            pnowojski Piotr Nowojski
            arvid Arvid Heise
            Votes:
            0 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment