Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2812

Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.Error: Maximum permit count exceeded

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.8.0
    • Component/s: Channel, Sinks+Sources
    • Labels:
      None
    • Environment:

      Description

      We are finding that around after an hour or so of heavy processing of Flume data in an agent we are getting the following exception. This is after processing about 5-7 k lines/second during that time.

      The configuration of this agent is using a Kafka source, the one that comes with 1.6.0.

      It is also using a Memory channel, and a Thrift sink.

      =======
      Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.Error: Maximum permit count exceeded
      at java.util.concurrent.Semaphore$Sync.tryReleaseShared(Semaphore.java:192)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1341)
      at java.util.concurrent.Semaphore.release(Semaphore.java:609)
      at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:147)
      at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
      at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:379)
      at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
      at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
      at java.lang.Thread.run(Thread.java:745)
      =======

      The above error is from standard error when running the Flume agent. The effect is that the "SinkRunner-PollingRunner-DefaultSinkProcessor" thread disappears from the agent, this can be seen on a JMX console.
      For us, this means that the Flume agent needs to get restarted. It is an error that is terminal in that instance of the Java process due to the thread disappearing as a result.

      It sounds like something in JDK 7+ got stricter?!

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Flume-trunk-hbase-1 #226 (See https://builds.apache.org/job/Flume-trunk-hbase-1/226/)
          FLUME-2812. Fix semaphore leak causing java.lang.Error: Maximum permit (bessbd: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=f581f6ed0e820da370c4a83a4ff7b05278d09fc3)

          • (edit) flume-ng-core/src/test/java/org/apache/flume/channel/TestMemoryChannel.java
          • (edit) flume-ng-core/src/main/java/org/apache/flume/channel/MemoryChannel.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Flume-trunk-hbase-1 #226 (See https://builds.apache.org/job/Flume-trunk-hbase-1/226/ ) FLUME-2812 . Fix semaphore leak causing java.lang.Error: Maximum permit (bessbd: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=f581f6ed0e820da370c4a83a4ff7b05278d09fc3 ) (edit) flume-ng-core/src/test/java/org/apache/flume/channel/TestMemoryChannel.java (edit) flume-ng-core/src/main/java/org/apache/flume/channel/MemoryChannel.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flume/pull/83

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flume/pull/83
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f581f6ed0e820da370c4a83a4ff7b05278d09fc3 in flume's branch refs/heads/trunk from Denes Arvay
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=f581f6e ]

          FLUME-2812. Fix semaphore leak causing java.lang.Error: Maximum permit count exceeded in MemoryChannel

          bytesRemaining.release(putByteCounter) gets called in MemoryTransaction.doRollback while acquire is called only in doCommit. This results in semaphore leak and the number of permits in the semaphore eventually exceeds Integer.MAX_VALUE and Semaphore.release() throws java.lang.Error: Maximum permit count exceeded.

          This closes #83

          Reviewers: Attila Simon, Bessenyei Balázs Donát

          (Denes Arvay via Bessenyei Balázs Donát)

          Show
          jira-bot ASF subversion and git services added a comment - Commit f581f6ed0e820da370c4a83a4ff7b05278d09fc3 in flume's branch refs/heads/trunk from Denes Arvay [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=f581f6e ] FLUME-2812 . Fix semaphore leak causing java.lang.Error: Maximum permit count exceeded in MemoryChannel bytesRemaining.release(putByteCounter) gets called in MemoryTransaction.doRollback while acquire is called only in doCommit. This results in semaphore leak and the number of permits in the semaphore eventually exceeds Integer.MAX_VALUE and Semaphore.release() throws java.lang.Error: Maximum permit count exceeded. This closes #83 Reviewers: Attila Simon, Bessenyei Balázs Donát (Denes Arvay via Bessenyei Balázs Donát)
          Hide
          denes Denes Arvay added a comment -

          I have investigated this issue and the cause is that bytesRemaining.release(putByteCounter) gets called in MemoryTransaction.doRollback (https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/channel/MemoryChannel.java#L174) while acquire is called only in doCommit. This results in semaphore leak and the number of permits in the semaphore eventually exceeds Integer.MAX_VALUE and Semaphore.release() throws this error.
          I think this bug was introduced unintentionally when the bytesRemaining semaphore handling was refactored and moved to doCommit from doPut in FLUME-2233.
          If acquire is called in the doPut then release should happen in doRollback but since it has been moved to doCommit the release should have been deleted.
          GitHub Pull Request for this ticket: https://github.com/apache/flume/pull/83

          Show
          denes Denes Arvay added a comment - I have investigated this issue and the cause is that bytesRemaining.release(putByteCounter) gets called in MemoryTransaction.doRollback ( https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/channel/MemoryChannel.java#L174 ) while acquire is called only in doCommit . This results in semaphore leak and the number of permits in the semaphore eventually exceeds Integer.MAX_VALUE and Semaphore.release() throws this error. I think this bug was introduced unintentionally when the bytesRemaining semaphore handling was refactored and moved to doCommit from doPut in FLUME-2233 . If acquire is called in the doPut then release should happen in doRollback but since it has been moved to doCommit the release should have been deleted. GitHub Pull Request for this ticket: https://github.com/apache/flume/pull/83
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user adenes opened a pull request:

          https://github.com/apache/flume/pull/83

          FLUME-2812: Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.Error: Maximum permit count exceeded

          Fix semaphore leak causing `java.lang.Error: Maximum permit count exceeded` in `MemoryChannel`

          ```java
          Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.Error: Maximum permit count exceeded
          at java.util.concurrent.Semaphore$Sync.tryReleaseShared(Semaphore.java:192)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1341)
          at java.util.concurrent.Semaphore.release(Semaphore.java:609)
          at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:147)
          at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
          at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:379)
          at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
          at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
          at java.lang.Thread.run(Thread.java:745)
          ```

          for more details and explanation see my comment on FLUME-2812(https://issues.apache.org/jira/browse/FLUME-2812)

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/adenes/flume FLUME-2812

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flume/pull/83.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #83


          commit 538837d5c035461c9b3073a2ce89da375aa139f3
          Author: Denes Arvay <denes@cloudera.com>
          Date: 2016-10-28T15:38:33Z

          FLUME-2812: Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.Error: Maximum permit count exceeded

          Change-Id: Ib6761364abefca91ca50d80c073637766bfd93ca


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user adenes opened a pull request: https://github.com/apache/flume/pull/83 FLUME-2812 : Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.Error: Maximum permit count exceeded Fix semaphore leak causing `java.lang.Error: Maximum permit count exceeded` in `MemoryChannel` ```java Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.Error: Maximum permit count exceeded at java.util.concurrent.Semaphore$Sync.tryReleaseShared(Semaphore.java:192) at java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1341) at java.util.concurrent.Semaphore.release(Semaphore.java:609) at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:147) at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151) at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:379) at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68) at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147) at java.lang.Thread.run(Thread.java:745) ``` for more details and explanation see my comment on FLUME-2812 ( https://issues.apache.org/jira/browse/FLUME-2812 ) You can merge this pull request into a Git repository by running: $ git pull https://github.com/adenes/flume FLUME-2812 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flume/pull/83.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #83 commit 538837d5c035461c9b3073a2ce89da375aa139f3 Author: Denes Arvay <denes@cloudera.com> Date: 2016-10-28T15:38:33Z FLUME-2812 : Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.Error: Maximum permit count exceeded Change-Id: Ib6761364abefca91ca50d80c073637766bfd93ca
          Hide
          jfield Jeff Field added a comment -

          We're encountering this while a source (Kafka) is trying to deliver data to the channel (Memory) much much faster than the sink (HDFS) can write it out. The (repeatable) chain of events is: channel fills up, HDFS keeps writing, channel keeps throwing errors because it is full, eventually HDFS sinks all die with this error. We don't encounter it on 1.6 because in 1.6 we're using Kafka Channel -> Sink directly. We were trying to do a similar config for 1.7 but a separate issue is preventing that (we're getting Kafka consumer timeout/rebalance problems with the source and channel in 1.7 that we didn't get in 1.6).

          Show
          jfield Jeff Field added a comment - We're encountering this while a source (Kafka) is trying to deliver data to the channel (Memory) much much faster than the sink (HDFS) can write it out. The (repeatable) chain of events is: channel fills up, HDFS keeps writing, channel keeps throwing errors because it is full, eventually HDFS sinks all die with this error. We don't encounter it on 1.6 because in 1.6 we're using Kafka Channel -> Sink directly. We were trying to do a similar config for 1.7 but a separate issue is preventing that (we're getting Kafka consumer timeout/rebalance problems with the source and channel in 1.7 that we didn't get in 1.6).
          Hide
          rcritt Rollin Crittendon added a comment -

          One more note... This error is currently a fatal error when it occurs, as far as I can tell.

          We have needed to restart the process when these things occur.

          Show
          rcritt Rollin Crittendon added a comment - One more note... This error is currently a fatal error when it occurs, as far as I can tell. We have needed to restart the process when these things occur.
          Hide
          rcritt Rollin Crittendon added a comment -

          A note, we are also seeing this in some rollback cases on a staging system.
          I assume the release number gets greater than the take for a semaphore.
          Is there a quantification for how that could be happening?

          We are running jdk 1.8, and from source I have seen for jdk 1.7, it looks like there is a check we are encountering, https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/util/concurrent/Semaphore.java.

          ====================
          06 Jan 2016 01:03:59,476 ERROR [Flume Thrift IPC Thread 81] (org.apache.thrift.ProcessFunction.process:41) - Internal error processing appendBatch
          java.lang.Error: Maximum permit count exceeded
          at java.util.concurrent.Semaphore$Sync.tryReleaseShared(Semaphore.java:192)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1341)
          at java.util.concurrent.Semaphore.release(Semaphore.java:609)
          at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doRollback(MemoryChannel.java:176)
          at org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)
          at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:194)
          at org.apache.flume.source.ThriftSource$ThriftSourceHandler.appendBatch(ThriftSource.java:457)
          at org.apache.flume.thrift.ThriftSourceProtocol$Processor$appendBatch.getResult(ThriftSourceProtocol.java:259)
          at org.apache.flume.thrift.ThriftSourceProtocol$Processor$appendBatch.getResult(ThriftSourceProtocol.java:247)
          at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
          at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
          at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478)
          at org.apache.thrift.server.Invocation.run(Invocation.java:18)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)
          ====================

          Show
          rcritt Rollin Crittendon added a comment - A note, we are also seeing this in some rollback cases on a staging system. I assume the release number gets greater than the take for a semaphore. Is there a quantification for how that could be happening? We are running jdk 1.8, and from source I have seen for jdk 1.7, it looks like there is a check we are encountering, https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/util/concurrent/Semaphore.java . ==================== 06 Jan 2016 01:03:59,476 ERROR [Flume Thrift IPC Thread 81] (org.apache.thrift.ProcessFunction.process:41) - Internal error processing appendBatch java.lang.Error: Maximum permit count exceeded at java.util.concurrent.Semaphore$Sync.tryReleaseShared(Semaphore.java:192) at java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1341) at java.util.concurrent.Semaphore.release(Semaphore.java:609) at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doRollback(MemoryChannel.java:176) at org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168) at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:194) at org.apache.flume.source.ThriftSource$ThriftSourceHandler.appendBatch(ThriftSource.java:457) at org.apache.flume.thrift.ThriftSourceProtocol$Processor$appendBatch.getResult(ThriftSourceProtocol.java:259) at org.apache.flume.thrift.ThriftSourceProtocol$Processor$appendBatch.getResult(ThriftSourceProtocol.java:247) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478) at org.apache.thrift.server.Invocation.run(Invocation.java:18) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ====================
          Hide
          rcritt Rollin Crittendon added a comment -

          This is what we have, with machine names, ports, some things like that redacted, for example, <ZOOKEEPER-SETTINGS>.

          =============================
          bridge.channels = bridge-channel
          bridge.sources = spool-source kafka-source
          bridge.sinks = to-conduit-sink

          bridge.channels.bridge-channel.type = memory
          bridge.channels.bridge-channel.capacity = 100000
          bridge.channels.bridge-channel.transactionCapacity = 40000
          bridge.channels.bridge-channel.byteCapacityBufferPercentage = 20
          bridge.channels.bridge-channel.byteCapacity = 0

          bridge.sources.spool-source.channels = bridge-channel
          bridge.sources.spool-source.type = spooldir
          bridge.sources.spool-source.spoolDir = <SPOOL-DIRECTORY-LOCATION>
          bridge.sources.spool-source.fileHeader = false
          bridge.sources.spool-source.deletePolicy = immediate
          bridge.sources.spool-source.batchSize = 100
          bridge.sources.spool-source.deserializer.maxLineLength = 100000

          bridge.sources.kafka-source.channels = bridge-channel
          bridge.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource
          bridge.sources.kafka-source.zookeeperConnect = <ZOOKEEPER-SETTINGS>
          bridge.sources.kafka-source.groupId = <GROUP-ID>
          bridge.sources.kafka-source.topic = <TOPIC>
          bridge.sources.kafka-source.batchSize = 100

          bridge.sinks.to-conduit-sink.channel = bridge-channel
          bridge.sinks.to-conduit-sink.type = thrift
          bridge.sinks.to-conduit-sink.hostname = <OTHER-AGENT-MACHINE-NAME>
          bridge.sinks.to-conduit-sink.port = <OTHER-AGENT-MACHINE-PORT>
          bridge.sinks.to-conduit-sink.batch-size = 1000

          Show
          rcritt Rollin Crittendon added a comment - This is what we have, with machine names, ports, some things like that redacted, for example, <ZOOKEEPER-SETTINGS>. ============================= bridge.channels = bridge-channel bridge.sources = spool-source kafka-source bridge.sinks = to-conduit-sink bridge.channels.bridge-channel.type = memory bridge.channels.bridge-channel.capacity = 100000 bridge.channels.bridge-channel.transactionCapacity = 40000 bridge.channels.bridge-channel.byteCapacityBufferPercentage = 20 bridge.channels.bridge-channel.byteCapacity = 0 bridge.sources.spool-source.channels = bridge-channel bridge.sources.spool-source.type = spooldir bridge.sources.spool-source.spoolDir = <SPOOL-DIRECTORY-LOCATION> bridge.sources.spool-source.fileHeader = false bridge.sources.spool-source.deletePolicy = immediate bridge.sources.spool-source.batchSize = 100 bridge.sources.spool-source.deserializer.maxLineLength = 100000 bridge.sources.kafka-source.channels = bridge-channel bridge.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource bridge.sources.kafka-source.zookeeperConnect = <ZOOKEEPER-SETTINGS> bridge.sources.kafka-source.groupId = <GROUP-ID> bridge.sources.kafka-source.topic = <TOPIC> bridge.sources.kafka-source.batchSize = 100 bridge.sinks.to-conduit-sink.channel = bridge-channel bridge.sinks.to-conduit-sink.type = thrift bridge.sinks.to-conduit-sink.hostname = <OTHER-AGENT-MACHINE-NAME> bridge.sinks.to-conduit-sink.port = <OTHER-AGENT-MACHINE-PORT> bridge.sinks.to-conduit-sink.batch-size = 1000
          Hide
          hshreedharan Hari Shreedharan added a comment -

          What does your configuration look like?

          Show
          hshreedharan Hari Shreedharan added a comment - What does your configuration look like?

            People

            • Assignee:
              denes Denes Arvay
              Reporter:
              rcritt Rollin Crittendon
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development