Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-3694

Transient system test failure ReplicationTest.test_replication_with_broker_failure.security_protocol

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0.0
    • Component/s: system tests
    • Labels:
      None

      Description

      We've seen this failure in several recent builds:

      ====================================================================================================
      test_id:    2016-05-10--001.kafkatest.tests.core.replication_test.ReplicationTest.test_replication_with_broker_failure.security_protocol=PLAINTEXT.failure_mode=hard_bounce.broker_type=leader
      status:     FAIL
      run time:   2 minutes 1.184 seconds
      
      
          Kafka server didn't finish startup
      Traceback (most recent call last):
        File "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py", line 106, in run_all_tests
          data = self.run_single_test()
        File "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py", line 162, in run_single_test
          return self.current_test_context.function(self.current_test)
        File "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/mark/_mark.py", line 331, in wrapper
          return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
        File "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/replication_test.py", line 157, in test_replication_with_broker_failure
          self.run_produce_consume_validate(core_test_action=lambda: failures[failure_mode](self, broker_type))
        File "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py", line 79, in run_produce_consume_validate
          raise e
      TimeoutError: Kafka server didn't finish startup
      

      After some investigation, the problem seems to be caused by an unexpected partition leader change which is triggered proactively by the controller when the preferred leader becomes alive again. The test currently assumes that it is safe to restart the broker as soon as it observes a leadership change since this is typically caused by a Zk session timeout. However, in this case, the session hasn't actually expired when the leadership change occurs. So after starting up, the broker sees its brokerId still registered and immediately shuts down, which causes the test failure above. To fix the problem, we should probably have a stronger check to ensure that the broker has actually been deregesitered from Zk prior to restarting.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hachikuji Jason Gustafson
                Reporter:
                hachikuji Jason Gustafson
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: