Details
Description
When running EC on one cluster, DataNode has millions of CLOSE_WAIT connections
$ grep CLOSE_WAIT lsof.out | wc -l 10358700 // All CLOSW_WAITs belong to the same DataNode process (pid=88527) $ grep CLOSE_WAIT lsof.out | awk '{print $2}' | sort | uniq 88527
And DN can not open any file / socket, as shown in the log:
2018-01-19 06:47:09,424 WARN io.netty.channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135) at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:75) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:563) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:504) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:418) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:390) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145) at java.lang.Thread.run(Thread.java:748)