[FLINK-13100] Fix the bug of throwing IOException while FileBufferReader#nextBuffer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Blocker
Resolution: Resolved
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Network
Labels:
- pull-request-available

Description

In the implementation of FileBufferReader#nextBuffer, we expect the next memory segment always available based on the assumption that the nextBuffer call could only happen when the previous buffer was recycled before. Otherwise it would throw an IOException in current implementation.

In fact, the above assumption is not making sense based on the credit-based and zero-copy features in network. The detail processes are as follows:

The netty thread finishes calling the channel.writeAndFlush() in PartitionRequestQueue and adds a listener to handle the ChannelFuture later. Before future done, the corresponding buffer is not recycled because of zero-copy improvement.

Before the previous future done, the netty thread could trigger next writeAndFlush via processing addCredit message, then FileBufferReader#nextBuffer would throw exception because of previous buffer not recycled.

We thought of several ways for solving this potential bug:

It does not trigger the next writeAndFlush before the previous future done. To do so it has to maintain the future state and check it in relevant actions. I wonder it might bring performance regression in network throughput and bring extra state management.

Adjust the implementation of current FileBufferReader. We ever regarded the blocking partition view as always available based on the next buffer read ahead, so it would be always added into available queue in PartitionRequestQueue. Actually this next buffer ahead only simplifies the process of BoundedBlockingSubpartitionReader#notifyDataAvailable. The view availability could be judged based on available buffers in FileBufferReader instead of next buffer ahead. When the buffer is recycled into FileBufferReader after writeAndFlush done, it could call notifyDataAvailable to add this view into available queue in PartitionRequestQueue.

I prefer the second way because it would not bring any bad impacts.

Attachments

Issue Links

links to

GitHub Pull Request #9062

Activity

People

Assignee:: Zhijiang

Reporter:: Zhijiang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Jul/19 09:58

Updated:: 02/Oct/19 17:44

Resolved:: 12/Jul/19 07:13

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m