[SPARK-30512] Use a dedicated boss event group loop in the netty pipeline for external shuffle service - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 2.4.5, 3.0.0
Component/s: Shuffle, Spark Core
Labels:
None

Description

We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service.
The issue and all the analysis we did is described here:
https://github.com/netty/netty/issues/9890

I added a LoggingHandler to netty pipeline and realized that even the channel registration is delayed by 30 seconds.
In the Spark External Shuffle service, the boss event group and the worker event group are same which is causing this delay.

    EventLoopGroup bossGroup =
      NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server");
    EventLoopGroup workerGroup = bossGroup;

    bootstrap = new ServerBootstrap()
      .group(bossGroup, workerGroup)
      .channel(NettyUtils.getServerChannelClass(ioMode))
      .option(ChannelOption.ALLOCATOR, allocator)
      .childOption(ChannelOption.ALLOCATOR, allocator);

When the load at the shuffle service increases, since the worker threads are busy with existing channels, registering new channels gets delayed.

The fix is simple. I created a dedicated boss thread event loop group with 1 thread.

    EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
      conf.getModuleName() + "-boss");
    EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, conf.serverThreads(),
    conf.getModuleName() + "-server");

    bootstrap = new ServerBootstrap()
      .group(bossGroup, workerGroup)
      .channel(NettyUtils.getServerChannelClass(ioMode))
      .option(ChannelOption.ALLOCATOR, allocator)

This fixed the issue.
We just need 1 thread in the boss group because there is only a single server bootstrap.

Attachments

Issue Links

is related to

SPARK-24355 Improve Spark shuffle server responsiveness to non-ChunkFetch requests

Resolved

SPARK-29206 Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads

Resolved

links to

GitHub Pull Request #27240

Activity

People

Assignee:: Chandni Singh

Reporter:: Chandni Singh

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Jan/20 19:54

Updated:: 17/May/20 18:30

Resolved:: 29/Jan/20 21:10