Kafka
  1. Kafka
  2. KAFKA-380

Enhance single_host_multi_brokers test with failure to trigger leader re-election in replication

    Details

    • Type: Task Task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    1. kafka-380-v1.patch
      11 kB
      John Fung
    2. kafka-380-v2.patch
      12 kB
      John Fung
    3. kafka-380-v3.patch
      13 kB
      John Fung
    4. kafka-380-v4.patch
      11 kB
      John Fung
    5. kafka-380-v5.patch
      11 kB
      John Fung

      Activity

      Hide
      John Fung added a comment - - edited

      Uploaded kafka-380-v1.patch with the following changes:

      A. Introduce failure to the brokers in a round robin fashion (as described in README):
      1. Start the Kafka cluster
      2. Create topic
      3. Find the leader
      4. Stop the broker in Step 3
      5. Send n messages
      6. Consume the messages
      7. Start the broker in Step 3
      8. Goto Step 3 for all servers in the cluster
      9. Validate test results

      B. Keep track of the leader re-election latency (the difference between the broker shutdown and leader re-elected timestamp).

      C. Report the max and min latency values

      Show
      John Fung added a comment - - edited Uploaded kafka-380-v1.patch with the following changes: A. Introduce failure to the brokers in a round robin fashion (as described in README): 1. Start the Kafka cluster 2. Create topic 3. Find the leader 4. Stop the broker in Step 3 5. Send n messages 6. Consume the messages 7. Start the broker in Step 3 8. Goto Step 3 for all servers in the cluster 9. Validate test results B. Keep track of the leader re-election latency (the difference between the broker shutdown and leader re-elected timestamp). C. Report the max and min latency values
      Hide
      Joel Koshy added a comment -
      • Looks good overall, but will revisit after KAFKA-350 is done.
      • Instead of sleeping 30s after shutting down a server, pgrep for it.
      • We should get rid of as many sleeps as possible. E.g., with producer acks
        we don't need to sleep after producer-performance. Likewise for all other
        sleeps, let see why they are needed and provide tooling (if necessary)
        to eliminate/reduce them.
      Show
      Joel Koshy added a comment - Looks good overall, but will revisit after KAFKA-350 is done. Instead of sleeping 30s after shutting down a server, pgrep for it. We should get rid of as many sleeps as possible. E.g., with producer acks we don't need to sleep after producer-performance. Likewise for all other sleeps, let see why they are needed and provide tooling (if necessary) to eliminate/reduce them.
      Hide
      Joel Koshy added a comment -

      Filed KAFKA-392 which includes the two points above.

      Show
      Joel Koshy added a comment - Filed KAFKA-392 which includes the two points above.
      Hide
      John Fung added a comment -

      Thanks Joel for reviewing kafka-380-v1.patch. I have uploaded kafka-380-v2.patch (not the changes you suggested for KAFKA-392) with additional minor changes:

      The changes made in kafka-380-v2.patch is to fix the problem to run this test in MacOS. The problem is due to the different argument options syntax between Linux and Darwin (MacOS) for the shell command "date".

      Show
      John Fung added a comment - Thanks Joel for reviewing kafka-380-v1.patch. I have uploaded kafka-380-v2.patch (not the changes you suggested for KAFKA-392 ) with additional minor changes: The changes made in kafka-380-v2.patch is to fix the problem to run this test in MacOS. The problem is due to the different argument options syntax between Linux and Darwin (MacOS) for the shell command "date".
      Hide
      Jun Rao added a comment -

      Patch v2 doesn't apply after kafka-306 is committed. Could you rebase?

      Show
      Jun Rao added a comment - Patch v2 doesn't apply after kafka-306 is committed. Could you rebase?
      Hide
      John Fung added a comment -

      Thanks Jun. Uploaded kafka-380-v3.patch with the following changes:

      1. Rebased from 0.8 branch after KAFKA-306 is checked in.
      2. Fixed running issue in MacOS

      Show
      John Fung added a comment - Thanks Jun. Uploaded kafka-380-v3.patch with the following changes: 1. Rebased from 0.8 branch after KAFKA-306 is checked in. 2. Fixed running issue in MacOS
      Hide
      Jun Rao added a comment -

      Thanks for patch v3. Shouldn't we set invoke_failures to true by default? However, when I do that, the test seems to hang after the following:

      2012-07-11 09:57:54 ---------------------------------------
      2012-07-11 09:57:54 leader re-election latency : 36 ms
      2012-07-11 09:57:54 ---------------------------------------
      2012-07-11 09:57:54 starting console consumer
      2012-07-11 09:57:54 sleeping for 5s
      2012-07-11 09:57:59 starting producer performance

      Show
      Jun Rao added a comment - Thanks for patch v3. Shouldn't we set invoke_failures to true by default? However, when I do that, the test seems to hang after the following: 2012-07-11 09:57:54 --------------------------------------- 2012-07-11 09:57:54 leader re-election latency : 36 ms 2012-07-11 09:57:54 --------------------------------------- 2012-07-11 09:57:54 starting console consumer 2012-07-11 09:57:54 sleeping for 5s 2012-07-11 09:57:59 starting producer performance
      Hide
      John Fung added a comment -

      Thanks Jun for reviewing. The test is hanging due to the following exception thrown by ProducerPerformance:

      [2012-07-10 15:38:31,510] INFO Beging shutting down ProducerSendThread (kafka.producer.async.ProducerSendThread)
      [2012-07-10 15:39:01,491] INFO Disconnecting from 127.0.0.1:9093 (kafka.producer.SyncProducer)
      [2012-07-10 15:39:01,491] INFO Disconnecting from 127.0.0.1:9093 (kafka.producer.SyncProducer)
      [2012-07-10 15:39:01,491] INFO Disconnecting from 127.0.0.1:9093 (kafka.producer.SyncProducer)
      [2012-07-10 15:39:01,491] INFO Disconnecting from 127.0.0.1:9093 (kafka.producer.SyncProducer)
      [2012-07-10 15:39:01,495] ERROR Error in handling batch of 1 events (kafka.producer.async.ProducerSendThread)
      java.net.SocketTimeoutException
      at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
      at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
      at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
      at kafka.utils.Utils$.read(Utils.scala:603)
      at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
      at kafka.network.Receive$class.readCompletely(Transmission.scala:55)
      at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
      at kafka.network.BlockingChannel.receive(BlockingChannel.scala:92)
      at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:78)
      at kafka.producer.SyncProducer.doSend(SyncProducer.scala:76)
      at kafka.producer.SyncProducer.send(SyncProducer.scala:115)
      at kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:76)
      at kafka.producer.BrokerPartitionInfo.getBrokerPartitionInfo(BrokerPartitionInfo.scala:45)
      at kafka.producer.async.DefaultEventHandler.kafka$producer$async$DefaultEventHandler$$getPartitionListForTopic(DefaultEventHandler.scala:129)
      at kafka.producer.async.DefaultEventHandler$$anonfun$partitionAndCollate$1.apply(DefaultEventHandler.scala:95)
      at kafka.producer.async.DefaultEventHandler$$anonfun$partitionAndCollate$1.apply(DefaultEventHandler.scala:94)
      at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
      at scala.collection.immutable.List.foreach(List.scala:45)
      at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:44)
      at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:42)
      at kafka.producer.async.DefaultEventHandler.partitionAndCollate(DefaultEventHandler.scala:94)
      at kafka.producer.async.DefaultEventHandler.dispatchSerializedData(DefaultEventHandler.scala:65)
      at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:49)
      at kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:96)
      at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:82)
      at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:60)
      at scala.collection.immutable.Stream.foreach(Stream.scala:254)
      at kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:59)
      at kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:37)

      Show
      John Fung added a comment - Thanks Jun for reviewing. The test is hanging due to the following exception thrown by ProducerPerformance: [2012-07-10 15:38:31,510] INFO Beging shutting down ProducerSendThread (kafka.producer.async.ProducerSendThread) [2012-07-10 15:39:01,491] INFO Disconnecting from 127.0.0.1:9093 (kafka.producer.SyncProducer) [2012-07-10 15:39:01,491] INFO Disconnecting from 127.0.0.1:9093 (kafka.producer.SyncProducer) [2012-07-10 15:39:01,491] INFO Disconnecting from 127.0.0.1:9093 (kafka.producer.SyncProducer) [2012-07-10 15:39:01,491] INFO Disconnecting from 127.0.0.1:9093 (kafka.producer.SyncProducer) [2012-07-10 15:39:01,495] ERROR Error in handling batch of 1 events (kafka.producer.async.ProducerSendThread) java.net.SocketTimeoutException at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86) at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221) at kafka.utils.Utils$.read(Utils.scala:603) at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) at kafka.network.Receive$class.readCompletely(Transmission.scala:55) at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29) at kafka.network.BlockingChannel.receive(BlockingChannel.scala:92) at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:78) at kafka.producer.SyncProducer.doSend(SyncProducer.scala:76) at kafka.producer.SyncProducer.send(SyncProducer.scala:115) at kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:76) at kafka.producer.BrokerPartitionInfo.getBrokerPartitionInfo(BrokerPartitionInfo.scala:45) at kafka.producer.async.DefaultEventHandler.kafka$producer$async$DefaultEventHandler$$getPartitionListForTopic(DefaultEventHandler.scala:129) at kafka.producer.async.DefaultEventHandler$$anonfun$partitionAndCollate$1.apply(DefaultEventHandler.scala:95) at kafka.producer.async.DefaultEventHandler$$anonfun$partitionAndCollate$1.apply(DefaultEventHandler.scala:94) at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61) at scala.collection.immutable.List.foreach(List.scala:45) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:44) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:42) at kafka.producer.async.DefaultEventHandler.partitionAndCollate(DefaultEventHandler.scala:94) at kafka.producer.async.DefaultEventHandler.dispatchSerializedData(DefaultEventHandler.scala:65) at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:49) at kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:96) at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:82) at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:60) at scala.collection.immutable.Stream.foreach(Stream.scala:254) at kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:59) at kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:37)
      Hide
      John Fung added a comment -

      Uploaded kafka-380-v4.patch with the following changes:

      1. Rebased with the latest 0.8 branch
      2. Fixed an issue running in MacOS
      3. Added validation for re-election latency

      Show
      John Fung added a comment - Uploaded kafka-380-v4.patch with the following changes: 1. Rebased with the latest 0.8 branch 2. Fixed an issue running in MacOS 3. Added validation for re-election latency
      Hide
      Neha Narkhede added a comment -

      1. We can probably get rid of the following information from the test output -

      2012-07-26 09:21:47 server_shutdown_unix_timestamp : 1343319697
      2012-07-26 09:21:47 server_shutdown_unix_timestamp_ms : 743
      2012-07-26 09:21:47 elected_leader_unix_timestamp : 1343319697
      2012-07-26 09:21:47 elected_leader_unix_timestamp_ms : 794
      2012-07-26 09:21:47 full_elected_leader_unix_timestamp : 1343319697.794
      2012-07-26 09:21:47 full_server_shutdown_unix_timestamp : 1343319697.743

      2. Can you make the test output also get logged to some file called test-output.log ? This will be helpful for debugging.

      Other than that, this patch looks good.

      Show
      Neha Narkhede added a comment - 1. We can probably get rid of the following information from the test output - 2012-07-26 09:21:47 server_shutdown_unix_timestamp : 1343319697 2012-07-26 09:21:47 server_shutdown_unix_timestamp_ms : 743 2012-07-26 09:21:47 elected_leader_unix_timestamp : 1343319697 2012-07-26 09:21:47 elected_leader_unix_timestamp_ms : 794 2012-07-26 09:21:47 full_elected_leader_unix_timestamp : 1343319697.794 2012-07-26 09:21:47 full_server_shutdown_unix_timestamp : 1343319697.743 2. Can you make the test output also get logged to some file called test-output.log ? This will be helpful for debugging. Other than that, this patch looks good.
      Hide
      John Fung added a comment -

      Hi Neha,

      Thanks for reviewing. Uploaded kafka-380-v5.patch which has the changes suggested.

      Show
      John Fung added a comment - Hi Neha, Thanks for reviewing. Uploaded kafka-380-v5.patch which has the changes suggested.
      Hide
      Neha Narkhede added a comment -

      +1.

      Show
      Neha Narkhede added a comment - +1.
      Hide
      Neha Narkhede added a comment -

      Thanks for the patch John !

      Show
      Neha Narkhede added a comment - Thanks for the patch John !

        People

        • Assignee:
          John Fung
          Reporter:
          John Fung
        • Votes:
          0 Vote for this issue
          Watchers:
          4 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development