Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4673

Python VerifiableConsumer service has thread-safety bug for event_handlers

    XMLWordPrintableJSON

    Details

      Description

      From http://confluent-kafka-0-10-1-system-test-results.s3-us-west-2.amazonaws.com/2017-01-17--001.1484653357--apache--0.10.1--d436fa2/report.html

      ====================================================================================================
      test_id: 2017-01-17--001.kafkatest.tests.client.consumer_test.OffsetValidationTest.test_consumer_bounce.clean_shutdown=True.bounce_mode=rolling
      status: FAIL
      run time: 1 minute 42.663 seconds

      dictionary changed size during iteration
      Traceback (most recent call last):
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.3-py2.7.egg/ducktape/tests/runner.py", line 106, in run_all_tests
      data = self.run_single_test()
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.3-py2.7.egg/ducktape/tests/runner.py", line 162, in run_single_test
      return self.current_test_context.function(self.current_test)
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.3-py2.7.egg/ducktape/mark/_mark.py", line 331, in wrapper
      return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/tests/client/consumer_test.py", line 137, in test_consumer_bounce
      self.await_all_members(consumer)
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/tests/verifiable_consumer_test.py", line 83, in await_all_members
      self.await_members(consumer, self.num_consumers)
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/tests/verifiable_consumer_test.py", line 80, in await_members
      err_msg="Consumers failed to join in a reasonable amount of time")
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.3-py2.7.egg/ducktape/utils/util.py", line 31, in wait_until
      if condition():
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/tests/verifiable_consumer_test.py", line 78, in <lambda>
      wait_until(lambda: len(consumer.joined_nodes()) == num_consumers,
      File "/var/lib/jenkins/workspace/system-test-kafka-0.10.1/kafka/tests/kafkatest/services/verifiable_consumer.py", line 317, in joined_nodes
      return [handler.node for handler in self.event_handlers.itervalues()
      RuntimeError: dictionary changed size during iteration

      It looks like the background thread is incorrectly inserting elements into self.event_handlers during this iteration. This is the first time I've seen this, so I suspect it's rarely hit. Looking at the code, most access is protected by a lock except for the additions at the beginning of the _worker method. Looking through the stacktrace, this looks like the likely culprit since we start the consumer and immediately start calling `await_all_members`, which ultimately iterates over that list. If that call happens faster than the thread gets started we could hit a bad interleaving.

      Fix is probably easy – just add the proper locking around use of that object.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ewencp Ewen Cheslack-Postava
                Reporter:
                ewencp Ewen Cheslack-Postava
                Reviewer:
                Ewen Cheslack-Postava
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 1h
                  1h
                  Remaining:
                  Remaining Estimate - 1h
                  1h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified