Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33879

Hybrid Shuffle may stop working for a while during redistribution

    XMLWordPrintableJSON

Details

    Description

      Currently, the Hybrid Shuffle can work with the memory tier and disk tier together, however, in the following scenario the result partition would stop working.

      Suppose we have a shuffle task with 2 sub-partitions. The LocalBufferPool has 15 buffers, the memory tier can use at most 15-(2*(2+1)+1) = 8 buffers according to `TieredStorageMemoryManagerImpl#getMaxNonReclaimableBuffers`. If the memory tier uses up all 8 buffers and the input channel consumes them very slowly because of problems, e.g. unstable network, the disk tier can still work with 1 reserved buffer. However, if a redistribution happens now and the pool size is decreased to less than 8, then the BufferAccumulator can not request buffers anymore, and thus the result partition stops working until the buffers in the memory tier are recycled.

      The purpose is to make the result partition still work with the disk tier and write the shuffle data to disk so that once the input channel is ready, the data on the disk can be consumed immediately.

      Attachments

        Issue Links

          Activity

            People

              Jiang Xin Jiang Xin
              Jiang Xin Jiang Xin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: