Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6042

Display last n exceptions/causes for job restarts in Web UI

    XMLWordPrintableJSON

Details

    • Hide
      Flink exposes the exception history now through the REST API and the UI. The amount of most-recently handled exceptions that shall be tracked can be defined through `web.exception-history-size`. Some values of the exception history's REST API Json response are deprecated as part of this effort.
      Show
      Flink exposes the exception history now through the REST API and the UI. The amount of most-recently handled exceptions that shall be tracked can be defined through `web.exception-history-size`. Some values of the exception history's REST API Json response are deprecated as part of this effort.

    Description

      Users requested that it would be nice to see the last n exceptions causing a job restart in the Web UI. This will help to more easily debug and operate a job.

      We could store the root causes for failures similar to how prior executions are stored in the ExecutionVertex using the EvictingBoundedList and then serve this information via the Web UI.

      -- Update: January 21, 2021 --

      The UI can already handle multiple exceptions through the Exception History. Right now, we list one or more exceptions which caused the job to fail. Instead, we could adapt it in a way that the history contains not only the exceptions of the most recent failure but one expandable entry per restart. If there are more than one exception connected to a single restart, we would list their stacktraces within one expandable entry.

      Attachments

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              mapohl Matthias Pohl
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: