Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10133

finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.5.0, 1.5.2, 1.6.0
    • 1.6.1
    • Runtime / Coordination
    • None

    Description

      Hi,

      We have 3 servers in our test environment, noted as node1-3. Setup is as following:

      • hadoop hdfs: node1 as namenode, node2,3 as datanode
      • zookeeper: node1-3 as a quorum (but also tried node1 alone)
      • flink: node1,2 as masters, node2,3 as slaves

      As my understanding when a job finished the corresponding job's blob data is expected to be deleted from hdfs path and node under zookeeper's path `/{zk path root}/{cluster-id}/jobgraphs/{job id}` should be deleted after that. However we observe that whenever we submitted a job and it finished (via `bin/flink run WordCount.jar`), the blob data is gone whereas job id node under zookeeper is still there, with a uuid style lock node inside it. From the debug node in zookeeper we observed something like "cannot be deleted because non empty". Because of this, as long as a job is finished and the jobgraph node persists, if restart the clusters or kill one manager (to test HA mode), it tries to recover a finished job and couldn't find blob data under hdfs, and the whole cluster is down.

      If we use only node1 as master and node2,3 as slaves, the jobgraphs node can be deleted successfully. If the jobgraphs is clean, killing one job manager makes another stand-by JM raised as leader, so it is only this jobgraphs issue preventing HA from working.

      I'm not sure if there's something wrong with our configs because this happens every time for finished job (we only tested with wordcount.jar though). I'm aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens every time, rendering HA mode un-useable for us.

      Any idea what might cause this?

      Attachments

        1. zookeeper.log
          264 kB
          Xiangyu Zhu
        2. standalonesession.log
          255 kB
          Xiangyu Zhu
        3. namenode.log
          41 kB
          Xiangyu Zhu
        4. client.log
          52 kB
          Xiangyu Zhu

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Frefreak Xiangyu Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment