Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10753

Propagate and log snapshotting exceptions

    XMLWordPrintableJSON

    Details

      Description

      Upon failure, AbstractStreamOperator.snapshotState rethrows a new exception with the message "Could not complete snapshot {} for operator {}." and the original exception as the cause. 

      While handling the error, CheckpointCoordinator.discardCheckpoint method logs only this  propagated message and not the original cause of the exception.

      In addition, pendingCheckpoint.abortDeclined(), called from the discardCheckpoint, reports the failed checkpoint with a misleading message "Checkpoint was declined (tasks not ready)". This message is what will be displayed in the UI (see attached).

       Proposition:

      1. Log exception at the Task Manager (.snapshotState)
      2. Log cause, instead of cause.getMessage() at the JobsManager (.dicardCheckpoint)
      3. Pass root cause to abortDeclined and propagate a more appropriate message to the UI.

        Attachments

        1. Screen Shot 2018-11-01 at 16.27.01.png
          27 kB
          Alexander Fedulov

          Issue Links

            Activity

              People

              • Assignee:
                srichter Stefan Richter
                Reporter:
                afedulov Alexander Fedulov
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: