Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34636

Requesting exclusive buffers timeout causes repeated restarts and cannot be automatically recovered

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13.2
    • 1.14.0, 1.13.3
    • Runtime / Network
    • None

    Description

      Based on the observation of logs and metrics, it was found that a subtask deployed on a same TM consistently reported an exception of requesting exclusive buffers timeout. It was discovered that during the restart process, 【Network】 metric remained unchanged (heap memory usage did change). I suspect that the network buffer memory was not properly released during the restart process, which caused the newly deployed task to fail to obtain the network buffer. This problem persisted despite repeated restarts, and the application failed to recover automatically.

      (I'm not sure if there are other reasons for this issue)

      Attached below are screenshots of the exception stack and relevant metrics:

      2024-03-08 09:58:18,738 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - GroupWindowAggregate switched from DEPLOYING to FAILED with failure cause: java.io.IOException: Timeout triggered when requesting exclusive buffers: The total number of network buffers is currently set to 32768 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max',  or you may increase the timeout which is 30000ms by setting the key 'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'.
      at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246)
      at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169)
      at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
      at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427)  
      at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257)  
      at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84)  
      at org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952)  
      at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655)  
      at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)  
      at java.lang.Thread.run(Thread.java:748) 

      Network metric:Only this TM is always 100%, without any variation.

      The status of the task deployed to this TM cannot be RUNNING and the status change is slow

      Although the root exception thrown by the  application is PartitionNotFoundException, the actual underlying root cause exception log found is IOException: Timeout triggered when requesting exclusive buffers

      Attachments

        1. image-20240308100308649.png
          214 kB
          Vincent Woo
        2. image-20240308101008765.png
          66 kB
          Vincent Woo

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vincentwoo Vincent Woo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: