Flume
  1. Flume
  2. FLUME-1641

Quickly reconnecting with Netty Avro RPC client causes OOME from lack of direct memory

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Client SDK
    • Labels:
      None

      Description

      There is an issue where a OutOfMemoryError can occur from too-quick reconnection attempts. Stack trace:

      Exception in thread "main" java.lang.OutOfMemoryError: Direct buffer memory
      at java.nio.Bits.reserveMemory(Bits.java:632)
      at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:97)
      at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
      at org.jboss.netty.channel.socket.nio.SocketSendBufferPool$Preallocation.<init>(SocketSendBufferPool.java:156)
      at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.<init>(SocketSendBufferPool.java:43)
      at org.jboss.netty.channel.socket.nio.NioWorker.<init>(NioWorker.java:84)
      at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.<init>(NioClientSocketPipelineSink.java:74)
      at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.<init>(NioClientSocketChannelFactory.java:135)
      at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.<init>(NioClientSocketChannelFactory.java:105)
      at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:116)
      at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:120)
      at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:109)
      at org.apache.flume.api.NettyAvroRpcClient.<init>(NettyAvroRpcClient.java:94)
      at org.apache.flume.api.RpcClientFactory.getDefaultInstance(RpcClientFactory.java:131)
      at org.apache.flume.api.RpcClientFactory.getDefaultInstance(RpcClientFactory.java:107)
      at org.apache.flume.api.FailoverRpcClient.getNextClient(FailoverRpcClient.java:270)
      at org.apache.flume.api.FailoverRpcClient.getClient(FailoverRpcClient.java:140)
      at org.apache.flume.api.FailoverRpcClient.append(FailoverRpcClient.java:174)
      ... snip ...

      Appears to be related to https://issues.jboss.org/browse/NETTY-424

        Activity

        Hide
        Mike Percy added a comment -

        Since Netty now lives at www.netty.io I don't know if they pay attention to the JIRA instance at issues.jboss.org anymore ... but I don't see an equivalent issue filed @ https://github.com/netty/netty/issues?state=open

        The workaround for this issue appears to be to sleep before trying to reconnect, to allow the JVM to release the direct memory resources.

        Show
        Mike Percy added a comment - Since Netty now lives at www.netty.io I don't know if they pay attention to the JIRA instance at issues.jboss.org anymore ... but I don't see an equivalent issue filed @ https://github.com/netty/netty/issues?state=open The workaround for this issue appears to be to sleep before trying to reconnect, to allow the JVM to release the direct memory resources.
        Hide
        Juhani Connolly added a comment -

        This bug appears to still be around.

        It also occurs in AvroSink when reconnecting(when using avro reconnect for load balancing). The fast discon-connect cycle very rarely appears to cause an OOM.

        I suspect this https://github.com/netty/netty/issues/1393 may address it? If we wanted that, upping the version to > 4.0 would be necessary

        Raising the memory limit also should fix it? Though that shouldn't be necessary.

        Also as suggested, adding in a sleep before the reconnect. Though this may disturb throughput. Perhaps configurable?

        Show
        Juhani Connolly added a comment - This bug appears to still be around. It also occurs in AvroSink when reconnecting(when using avro reconnect for load balancing). The fast discon-connect cycle very rarely appears to cause an OOM. I suspect this https://github.com/netty/netty/issues/1393 may address it? If we wanted that, upping the version to > 4.0 would be necessary Raising the memory limit also should fix it? Though that shouldn't be necessary. Also as suggested, adding in a sleep before the reconnect. Though this may disturb throughput. Perhaps configurable?

          People

          • Assignee:
            Unassigned
            Reporter:
            Mike Percy
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development