Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35035 Reduce job pause time when cluster resources are expanded in adaptive mode
  3. FLINK-36279

AdaptiveScheduler#hasDesiredResources doesn't rely on all available slots which causes problems in Executing state

    XMLWordPrintableJSON

Details

    Description

      FLINK-36014 aligned the triggering of the execution graph creation in WaitingForResources and rescaling in Executing state. Before that change, only WaitingForResources relied on this method. Relying on free slots was good enough because in WaitingForResources state, there are no slots allocated, yet.

      Using this method for Executing state now as well changes this premise because there are slots allocated while checking the slot availability that would become available after the restart. Hence, considering these currently allocated slots as well in the slot availability check is good enough. This will not break the premise for the WaitingForResources state.

      RescaleOnCheckpointITCase fails because of that issue:
      https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62105&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=11287

      Sep 13 17:16:55 "ForkJoinPool-1-worker-25" #28 daemon prio=5 os_prio=0 tid=0x00007f973f0c2800 nid=0x31a1 waiting on condition [0x00007f97089fc000]
      Sep 13 17:16:55    java.lang.Thread.State: TIMED_WAITING (sleeping)
      Sep 13 17:16:55 	at java.lang.Thread.sleep(Native Method)
      Sep 13 17:16:55 	at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:152)
      Sep 13 17:16:55 	at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145)
      Sep 13 17:16:55 	at org.apache.flink.test.scheduling.UpdateJobResourceRequirementsITCase.waitForRunningTasks(UpdateJobResourceRequirementsITCase.java:219)
      Sep 13 17:16:55 	at org.apache.flink.test.scheduling.RescaleOnCheckpointITCase.testRescaleOnCheckpoint(RescaleOnCheckpointITCase.java:139)
      Sep 13 17:16:55 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      Sep 13 17:16:55 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      [...]
      

      Attachments

        1. FLINK-36279.20240914.6.success.log
          54 kB
          Matthias Pohl
        2. FLINK-36279.fixed.success.log
          122 kB
          Matthias Pohl
        3. FLINK-36279-FLINK-36014-pr.success.log
          54 kB
          Matthias Pohl

        Issue Links

          Activity

            People

              mapohl Matthias Pohl
              mapohl Matthias Pohl
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: