Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12858

Potential distributed deadlock in case of synchronous savepoint failure

    XMLWordPrintableJSON

Details

    Description

      Current implementation of stop-with-savepoint (FLINK-11458) would lock the thread (on syncSavepointLatch) that carries StreamTask.performCheckpoint(). For non-source tasks, this thread is implied to be the task's main thread (stop-with-savepoint deliberately stops any activity in the task's main thread).

      Unlocking happens either when the task is cancelled or when the corresponding checkpoint is acknowledged.

      It's possible, that other downstream tasks of the same Flink job "soft" fail the checkpoint/savepoint due to various reasons (for example, due to max buffered bytes BarrierBuffer.checkSizeLimit(). In such case, the checkpoint abortion would be notified to JM . But it looks like, the checkpoint coordinator would handle such abortion as usual and assume that the Flink job continues running.

      Attachments

        Issue Links

          Activity

            People

              1u0 Alex
              1u0 Alex
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h