Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4455

CommitFailedException during rebalance doesn't release resources in tasks/processors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.10.1.0
    • 0.10.2.0
    • streams
    • Kafka Streams were running on CentOS - I have observed this - after some time the locks were released even if the jvm/process wasn't restarted, so I guess CentOS has some lock cleaning policy.

    Description

      Problem description

      From time to time a rebalance in Kafka Streams causes the commit to throw CommitFailedException. When this exception is thrown, the tasks and processors are not closed. If some processor contains a state store (RocksDB), the RocksDB is not closed, which leads to not relasead LOCK's on OS level, and when the Kafka Streams app is trying to open tasks and their respective processors and state stores the org.rocksdb.RocksDBException: IO error: lock .../LOCK: No locks available is thrown. If the the jvm/process is restarted the locks are released.

      Additional info

      I have been running 3 Kafka Streams instances on separate machines with num.stream.threads=1 and each with it's own state directory. Other Kafka Streams apps were running on the same machines but they had separate directories for state stores. In the attached logs you can see StreamThread-1895, I'm not running 1895 StreamThreads, I have implemented some Kafka Streams restart policy in my UncaughtExceptionHandler which on some transient exceptions restarts the org.apache.kafka.streams.KafkaStreams topology, by calling org.apache.kafka.streams.KafkaStreams.stop() and then org.apache.kafka.streams.KafkaStreams.start(). This causes the thread names to have bigger numbers.

      Stacktrace

      RocksDBException_IO-error_stacktrace.txt

      Suggested solution

      To avoid restarting the jvm, modify Kafka Streams to close tasks, which will lead to release of resources - in this case - filesystem LOCK files.

      Possible solution code

      Branch: https://github.com/dpoldrugo/kafka/commits/infobip-fork
      Commit: BUGFIX: When commit fails during rebalance - release resources
      I have been running this fork in production for 3 days and the error doesn't come-up.

      Note

      This could be related this issues: KAFKA-3708 and KAFKA-3938
      Additinal conversation can be found here: stream shut down due to no locks for state store

      Attachments

        Issue Links

          Activity

            People

              ewencp Ewen Cheslack-Postava
              dpoldrugo Davor Poldrugo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: