Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-22506

YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.11.3
    • None
    • Deployment / YARN
    • None

    Description

      If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) occurs during the initiation of the job manager, the job cluster exits with an error code. But since it does not mark the attempt as failed, it won't be count as a failed attempt, and YARN will keep retrying forever.

      Attachments

        1. corrupted_savepoint.log
          89 kB
          Paul Lin
        2. yarn application attempts.png
          400 kB
          Paul Lin

        Activity

          People

            Unassigned Unassigned
            Paul Lin Paul Lin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: