Kafka
  1. Kafka
  2. KAFKA-148

Shutdown During a Log Flush Throws Exception, Triggers Startup in Recovery Mode

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 0.6
    • Fix Version/s: 0.7
    • Component/s: core
    • Labels:
    • Environment:
      Independent. Observed on Linux data05 2.6.38-10-server #46~lucid1-Ubuntu SMP Wed Jul 6 20:19:32 UTC 2011 x86_64 GNU/Linux.

      Description

      If a log is being flushed when Kafka is shut down, KafkaScheduler immediately terminates the execution of the flush tasks, causing exceptions to be thrown and aborting the clean shutdown process. When restarted, the instance will launch in recovery mode, scanning the most recent log segment for each topic before launching.

      This issue is particularly highlighted by environments which have a large number of topics and a low flush threshold. In our case, nearly every deploy to our Kafka service in which the app is restarted triggers an unclean shutdown followed by startup in recovery mode. The real pain here is reading through every log file on each restart, but this can be avoided entirely.

      The behavior is caused by the way the KafkaScheduler threadpool (responsible for log flushing) is shut down. Currently, KafkaScheduler.shutdown() calls "executor.shutdownNow()." executor.shutdownNow()'s behavior immediately cancels all running tasks in a threadpool and terminates. This causes each Log to throw a java.nio.channels.ClosedByInterruptException as the flush to disk is interrupted, aborting the clean shutdown. Here's a sample log of the exceptions we see thrown when restarting Kafka: https://gist.github.com/1265940

      The fix for this is pretty simple. Rather than calling executor.shutdownNow() in KafkaScheduler.shutdown(), we can call executor.shutdown() (compare Javadoc: http://bit.ly/mZUSaD). This method "initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted." Currently-running flushes are allowed to complete (generally very quick), future flush tasks are cancelled, no exceptions are thrown, and the shutdown finishes cleanly within about a second.

      I've attached the patch we've applied and deployed. It applies cleanly to 0.6.1.

      • Scott

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              C. Scott Andreas
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development