Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-15007

Flink on YARN does not request required TaskExecutors in some cases

    XMLWordPrintableJSON

Details

    Description

      This was discovered while debugging FLINK-14834. In some cases Flink does not request new TaskExecutors even though new slots are requested. You can see this in some of the logs attached to FLINK-14834.

      You can reproduce this using https://github.com/aljoscha/docker-hadoop-cluster to bring up a YARN cluster and then running a compiled Flink in there.

      When you run

      bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out && hdfs dfs -rm -r /wc-out
      

      the job waits and eventually fails because it does not have enough slots. (You can see in the log that 3 new slots are requested but only 2 TaskExecutors are requested.

      When you run

      bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out && hdfs dfs -rm -r /wc-out
      

      runs successfully.

      This is the git bisect log that identifies the first faulty commit (https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4):

      git bisect start
      # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the class for multivariate Gaussian Distribution.
      git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
      # bad: [85905f80e9711967711c2992612dccdd2cc211ac] [FLINK-14834][tests] Disable flaky yarn_kerberos_docker (default input) test
      git bisect bad 85905f80e9711967711c2992612dccdd2cc211ac
      # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
      git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
      # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
      git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
      # bad: [ae539c97c858b94e0e2504b54a8517ac1383482a] [hotfix][runtime] Check managed memory fraction range when setting it into StreamConfig
      git bisect bad ae539c97c858b94e0e2504b54a8517ac1383482a
      # good: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out slots creation from the TaskSlotTable constructor
      git bisect good 01d6972ab267807b8afccb09a45c454fa76d6c4b
      # bad: [d32e1d00854e24bc4bb3aad6d866c2d709acd993] [FLINK-14594][core] Change Resource to use BigDecimal as its value
      git bisect bad d32e1d00854e24bc4bb3aad6d866c2d709acd993
      # bad: [25f87ec208a642283e995811d809632129ca289a] [FLINK-11935][table-planner] Fix cast timestamp/date to string to avoid Gregorian cutover
      git bisect bad 25f87ec208a642283e995811d809632129ca289a
      # bad: [21c6b85a6f5991aabcbcd41fedc860d662d478fb] [FLINK-14842][table] add logging for loaded modules and functions
      git bisect bad 21c6b85a6f5991aabcbcd41fedc860d662d478fb
      # bad: [48986aa0b89de731d1b9136b59d409933cc15408] [hotfix] Remove unnecessary comments about memory size calculation before network init
      git bisect bad 48986aa0b89de731d1b9136b59d409933cc15408
      # bad: [4c4652efa43ed8ab456f5f63c89b57d8c4a621f8] [hotfix] Remove unused number of slots in MemoryManager
      git bisect bad 4c4652efa43ed8ab456f5f63c89b57d8c4a621f8
      # bad: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] Shrink the scope of MemoryManager from TaskExecutor to slot
      git bisect bad 9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4
      # first bad commit: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] Shrink the scope of MemoryManager from TaskExecutor to slot
      

      Attachments

        1. DualInputWordCount.jar
          10 kB
          Aljoscha Krettek

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aljoscha Aljoscha Krettek
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: