Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-31293

Request memory segment from LocalBufferPool may hanging forever.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      In our TPC-DS test, we found that in the case of fierce competition in network memory, some tasks may hanging forever.

      From the thread dump information, we can see that the task is waiting for the LocalBufferPool to become available. It is strange that other tasks have finished and released network memory already. Undoubtedly, this is an unexpected behavior, which implies that there must be a bug in the LocalBufferPool or NetworkBufferPool.

      By dumping the heap memory, we can find a strange phenomenon that there are available buffers in the LocalBufferPool, but it was considered to be un-available. Another thing to note is that it now holds an overdraft buffer.

      TL;DR: This problem occurred in multi-thread race related to the introduction of overdraft buffer.

      Suppose we have two threads, called A and B. For simplicity, LocalBufferPool is called LocalPool and NetworkBufferPool is called GlobalPool.

      Thread A continuously request buffers blocking from the LocalPool.
      Thread B continuously return buffers to GlobalPool.

      1. If thread A takes the last available buffer of LocalPool, but GlobalPool does not have a buffer at this time, it will register a callback function with GlobalPool.
      2. Thread B returns one buffer to GlobalPool, but has not started to trigger the callback.
      3. Thread A continues to request buffer. Because the availableMemorySegments of LocalPool is empty, it requests the overdraftBuffer instead. But there is already a buffer in the GlobalPool, it successfully gets the buffer.
      4. Thread B triggers the callback. Since there is no buffer in GlobalPool now, the callback is re-registered.
      5. Thread A continues to request buffer. Because there is no buffer in GlobalPool, it will block on CompletableFuture#get.
      6. Thread B continues to return a buffer and triggers the recently registered callback. As a result, LocalPool puts the buffer into availableMemorySegments. Unfortunately, the current logic of shouldBeAvailable method is: if there is an overdraft buffer, LocalPool is considered as un-available.
      7. Thread A will keep blocking forever.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Weijie Guo Weijie Guo
            Weijie Guo Weijie Guo
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment