Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9030

JobManager fails to archive job to FS when TM is lost

    XMLWordPrintableJSON

Details

    Description

      We are running flink on mesos, and are finding that when a job fails due to a task manager getting lost (from an OOM kill), the job isn't archived properly into the history server dir on the filesystem. 

      When this happens, the job does appear in the finished listing in the job manager's in-memory archive view, and is accessible in the running job manager's rest api, but obviously not in the history server's rest api.

      This is causing us issues as we are using the history server as a system of record for canceled or failed jobs in order to determine previous savepoint / external checkpoints.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jstehler Jared Stehler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: