Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33121

Failed precondition in JobExceptionsHandler due to concurrent global failures

    XMLWordPrintableJSON

Details

    Description

      We make the assumption that Global Failures (with null Task name) may only be RootExceptions and and Local/Task exception may be part of concurrent exceptions List (see JobExceptionsHandler#createRootExceptionInfo).
      However, when the Adaptive scheduler is in a Restarting phase due to an existing failure (that is now the new Root) we can still, in rare occasions, capture new Global failures, violating this condition (with an assertion is thrown as part of assertLocalExceptionInfo) seeing something like:

      The taskName must not be null for a non-global failure.  

      We want to ignore Global failures while being in a Restarting phase on the Adaptive scheduler until we properly support multiple Global failures in the Exception History as part of https://issues.apache.org/jira/browse/FLINK-34922

      Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step)

      Attachments

        Issue Links

          Activity

            People

              pgaref Panagiotis Garefalakis
              pgaref Panagiotis Garefalakis
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: