Kafka
  1. Kafka
  2. KAFKA-567

Replication Data Loss in Mirror Maker Bouncing testcase

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: core
    • Labels:

      Description

      • Test Description:
        1. Start a 3-broker cluster as source
        2. Start a 3-broker cluster as target
        3. Start 1 instance of Mirror Maker to replicate data from source to target
        4. While producer is sending data into source cluster, stop Mirror Maker with "kill -15". Start Mirror Maker again after 1 second.
        5. Start a consumer to consume data from target cluster.
        6. Compare the MessageID in the data between producer log and consumer log.
      • To reproduce this issue, please do the followings:

      1. Download the latest 0.8 branch
      2. Apply the patch attached to this JIRA
      3. Build kafka by running "./sbt update package"
      4. Execute the test in directory "system_test" : "python -B system_test_runner.py"

      • The test result may look like the following:

      _test_case_name : testcase_5002
      _test_class_name : MirrorMakerTest
      arg : bounce_leader : false
      arg : bounce_mirror_maker : true
      arg : message_producing_free_time_sec : 15
      arg : num_iteration : 1
      arg : num_messages_to_produce_per_producer_call : 50
      arg : num_partition : 1
      arg : replica_factor : 3
      arg : sleep_seconds_between_producer_calls : 1
      validation_status :
      Log segment checksum matching across all replicas : FAILED
      Unique messages from consumer on [test_1] : 355
      Unique messages from producer on [test_1] : 400
      Validate for data matched on topic [test_1] : FAILED

      • Attached a tar file for the system test output log, the brokers' log4j files and data log segment files.
      • There are no unusual Exception / Error found in the logs. However, there are consistently data loss in this Mirror Maker bouncing test case. Not sure if this is related to KAFKA-552.
      1. system_test_1349971807_testcase_5002.tar
        2.01 MB
        John Fung
      2. mirror_maker_12.log
        219 kB
        John Fung
      3. kafka-mirror-maker-bouncing-data-loss.patch
        18 kB
        John Fung
      4. KAFKA-567-mirrormakershutdown-v1.patch
        2 kB
        Joel Koshy
      5. kafka-567.patch
        0.6 kB
        Jun Rao

        Activity

        Hide
        Jun Rao added a comment -

        Took a quick look.

        Log in source broker:
        system_test_1349971807_testcase_5002 jrao$ ls kafka_server_4_logs/test_1-0/*.log
        kafka_server_4_logs/test_1-0/00000000000000000000.log kafka_server_4_logs/test_1-0/00000000000000000140.log kafka_server_4_logs/test_1-0/00000000000000000280.log
        kafka_server_4_logs/test_1-0/00000000000000000020.log kafka_server_4_logs/test_1-0/00000000000000000160.log kafka_server_4_logs/test_1-0/00000000000000000300.log
        kafka_server_4_logs/test_1-0/00000000000000000040.log kafka_server_4_logs/test_1-0/00000000000000000180.log kafka_server_4_logs/test_1-0/00000000000000000320.log
        kafka_server_4_logs/test_1-0/00000000000000000060.log kafka_server_4_logs/test_1-0/00000000000000000200.log kafka_server_4_logs/test_1-0/00000000000000000340.log
        kafka_server_4_logs/test_1-0/00000000000000000080.log kafka_server_4_logs/test_1-0/00000000000000000220.log kafka_server_4_logs/test_1-0/00000000000000000360.log
        kafka_server_4_logs/test_1-0/00000000000000000100.log kafka_server_4_logs/test_1-0/00000000000000000240.log kafka_server_4_logs/test_1-0/00000000000000000380.log
        kafka_server_4_logs/test_1-0/00000000000000000120.log kafka_server_4_logs/test_1-0/00000000000000000260.log

        Log in target broker
        system_test_1349971807_testcase_5002 jrao$ ls kafka_server_7_logs/test_1-0/*.log
        kafka_server_7_logs/test_1-0/00000000000000000000.log kafka_server_7_logs/test_1-0/00000000000000000185.log kafka_server_7_logs/test_1-0/00000000000000000312.log
        kafka_server_7_logs/test_1-0/00000000000000000100.log kafka_server_7_logs/test_1-0/00000000000000000235.log
        kafka_server_7_logs/test_1-0/00000000000000000155.log kafka_server_7_logs/test_1-0/00000000000000000262.log

        It seems that the test didn't give mirror maker enough time to catch up data after bouncing. Could you change the test to wait a bit longer after bounce?

        Show
        Jun Rao added a comment - Took a quick look. Log in source broker: system_test_1349971807_testcase_5002 jrao$ ls kafka_server_4_logs/test_1-0/*.log kafka_server_4_logs/test_1-0/00000000000000000000.log kafka_server_4_logs/test_1-0/00000000000000000140.log kafka_server_4_logs/test_1-0/00000000000000000280.log kafka_server_4_logs/test_1-0/00000000000000000020.log kafka_server_4_logs/test_1-0/00000000000000000160.log kafka_server_4_logs/test_1-0/00000000000000000300.log kafka_server_4_logs/test_1-0/00000000000000000040.log kafka_server_4_logs/test_1-0/00000000000000000180.log kafka_server_4_logs/test_1-0/00000000000000000320.log kafka_server_4_logs/test_1-0/00000000000000000060.log kafka_server_4_logs/test_1-0/00000000000000000200.log kafka_server_4_logs/test_1-0/00000000000000000340.log kafka_server_4_logs/test_1-0/00000000000000000080.log kafka_server_4_logs/test_1-0/00000000000000000220.log kafka_server_4_logs/test_1-0/00000000000000000360.log kafka_server_4_logs/test_1-0/00000000000000000100.log kafka_server_4_logs/test_1-0/00000000000000000240.log kafka_server_4_logs/test_1-0/00000000000000000380.log kafka_server_4_logs/test_1-0/00000000000000000120.log kafka_server_4_logs/test_1-0/00000000000000000260.log Log in target broker system_test_1349971807_testcase_5002 jrao$ ls kafka_server_7_logs/test_1-0/*.log kafka_server_7_logs/test_1-0/00000000000000000000.log kafka_server_7_logs/test_1-0/00000000000000000185.log kafka_server_7_logs/test_1-0/00000000000000000312.log kafka_server_7_logs/test_1-0/00000000000000000100.log kafka_server_7_logs/test_1-0/00000000000000000235.log kafka_server_7_logs/test_1-0/00000000000000000155.log kafka_server_7_logs/test_1-0/00000000000000000262.log It seems that the test didn't give mirror maker enough time to catch up data after bouncing. Could you change the test to wait a bit longer after bounce?
        Hide
        Neha Narkhede added a comment -

        We should probably increase time outs after knowing why the test is running slower. Increasing timeouts can hide performance regressions.

        Show
        Neha Narkhede added a comment - We should probably increase time outs after knowing why the test is running slower. Increasing timeouts can hide performance regressions.
        Hide
        John Fung added a comment -

        Uploaded mirror_maker_12.log for a similar test but with the following changes:

        11. After Mirror Maker is bounced and wait 15 seconds, Producer (sending data in the source cluster) is stopped. Then wait for 60 seconds more to make sure that Mirror Maker is able to catch up.

        12. kafka.consumer.ConsumerIterator is configured to be at TRACE level in log4j.properties

        13. kafka.producer.Producer is also configured to be at TRACE level.

        14. The attached mirror maker log file shows that there are 400 messages each showing in the TRACE level messages which is matching the Producer data count but Target consumer is still missing some messages.

        Show
        John Fung added a comment - Uploaded mirror_maker_12.log for a similar test but with the following changes: 11. After Mirror Maker is bounced and wait 15 seconds, Producer (sending data in the source cluster) is stopped. Then wait for 60 seconds more to make sure that Mirror Maker is able to catch up. 12. kafka.consumer.ConsumerIterator is configured to be at TRACE level in log4j.properties 13. kafka.producer.Producer is also configured to be at TRACE level. 14. The attached mirror maker log file shows that there are 400 messages each showing in the TRACE level messages which is matching the Producer data count but Target consumer is still missing some messages.
        Hide
        Jun Rao added a comment -

        Attach a patch. The problem is that async producer didn't drain the last batch of events on close, a bug introduced during producer refactoring. Now, the test has no data loss.

        Show
        Jun Rao added a comment - Attach a patch. The problem is that async producer didn't drain the last batch of events on close, a bug introduced during producer refactoring. Now, the test has no data loss.
        Hide
        John Fung added a comment -

        Thanks Jun for the patch. The fix is tested with the latest 0.8 branch (r.1397616) by apply both your patch and the patch to reproduce the issue. The following is the test result: (Please ignore the checksum validating failure for the time being)

        _test_case_name : testcase_5002
        _test_class_name : MirrorMakerTest
        arg : bounce_leader : false
        arg : bounce_mirror_maker : true
        arg : message_producing_free_time_sec : 15
        arg : num_iteration : 1
        arg : num_messages_to_produce_per_producer_call : 50
        arg : num_partition : 1
        arg : replica_factor : 3
        arg : sleep_seconds_between_producer_calls : 1
        validation_status :
        Log segment checksum matching across all replicas : FAILED
        Unique messages from consumer on [test_1] : 400
        Unique messages from producer on [test_1] : 400
        Validate for data matched on topic [test_1] : PASSED

        Show
        John Fung added a comment - Thanks Jun for the patch. The fix is tested with the latest 0.8 branch (r.1397616) by apply both your patch and the patch to reproduce the issue. The following is the test result: (Please ignore the checksum validating failure for the time being) _test_case_name : testcase_5002 _test_class_name : MirrorMakerTest arg : bounce_leader : false arg : bounce_mirror_maker : true arg : message_producing_free_time_sec : 15 arg : num_iteration : 1 arg : num_messages_to_produce_per_producer_call : 50 arg : num_partition : 1 arg : replica_factor : 3 arg : sleep_seconds_between_producer_calls : 1 validation_status : Log segment checksum matching across all replicas : FAILED Unique messages from consumer on [test_1] : 400 Unique messages from producer on [test_1] : 400 Validate for data matched on topic [test_1] : PASSED
        Hide
        Joel Koshy added a comment -

        +1 for kafka-567.patch

        Also, right now mirror maker does not always shutdown cleanly. E.g., if you want to use consumer.timeout.ms (and avoid having to do an external wait for mirroring to finish). John, I'll attach a patch for that as well if it helps the testing framework.

        Show
        Joel Koshy added a comment - +1 for kafka-567.patch Also, right now mirror maker does not always shutdown cleanly. E.g., if you want to use consumer.timeout.ms (and avoid having to do an external wait for mirroring to finish). John, I'll attach a patch for that as well if it helps the testing framework.
        Hide
        Neha Narkhede added a comment -

        +1 on KAFKA-567.patch, good catch !

        Show
        Neha Narkhede added a comment - +1 on KAFKA-567 .patch, good catch !
        Hide
        Jun Rao added a comment -

        Thanks for the reviews. Committed to 0.8.

        Show
        Jun Rao added a comment - Thanks for the reviews. Committed to 0.8.

          People

          • Assignee:
            Jun Rao
            Reporter:
            John Fung
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development