Kafka
  1. Kafka
  2. KAFKA-514

Replication with Leader Failure Test: Log segment files checksum mismatch

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: None

      Description

      Test Description:

      1. Produce and consume messages to 1 topics and 3 partitions.
      2. This test sends 10 messages every 2 sec to 3 replicas.
      3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.

      The issue:
      When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:

      • zookeeper: 1-node (local)
      • brokers: 3-node cluster (all local)
      • replica factor: 3
      • no. of topic: 1
      • no. of partition: 2
      • iterations of leader failure: 1

      Remarks:

      • It is rarely reproducible if the no. of partitions is 1.
      • Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal

      Test result (shown with log file checksum):

      broker-1 :
      test_1-0/00000000000000000000.kafka => 1690639555
      test_1-1/00000000000000000000.kafka => 4068655384 <<<< not matching across all replicas

      broker-2 :
      test_1-0/00000000000000000000.kafka => 1690639555
      test_1-1/00000000000000000000.kafka => 4068655384 <<<< not matching across all replicas

      broker-3 :
      test_1-0/00000000000000000000.kafka => 1690639555
      test_1-1/00000000000000000000.kafka => 3530842923 <<<< not matching across all replicas

      Errors:
      The following error is found in the terminated leader:

      [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
      [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) ->

      { "ISR": "1,2","leader": "1","leaderEpoch": "0" }

      , (test_1,0) ->

      { "ISR": " 1,2","leader": "1","leaderEpoch": "1" }

      )) (kafka.server.ReplicaManager)
      kafka.common.KafkaException: End index must be segment list size - 1
      at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
      at kafka.log.Log.truncateTo(Log.scala:471)
      at kafka.cluster.Partition.makeFollower(Partition.scala:171)
      at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
      at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
      at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
      at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
      at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
      at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
      at scala.collection.Iterator$class.foreach(Iterator.scala:631)
      at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
      at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
      at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
      at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
      at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
      at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
      at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
      at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
      at java.lang.Thread.run(Thread.java:662)

      1. testcase_2.tar
        10 kB
        John Fung
      2. system_test_output_archive.tar.gz
        81 kB
        John Fung
      3. kafka-514-reproduce-issue.patch
        330 kB
        John Fung
      4. kafka-514_v1.patch
        2 kB
        Jun Rao
      5. kafka-514_v2.patch
        5 kB
        Jun Rao

        Activity

        Hide
        John Fung added a comment - - edited

        This issue can be reproduced as follows:

        1. Download the latest 0.8 branch
        2. Apply kafka-502-v4.patch
        3. Under directory <kafka_home>, execute "./sbt update package" to build Kafka
        4. Untar testcase_2.tar to <kafka_home>/system_test/replication_testsuite/
        5. Modified <kafka_home>/system_test/testcase_to_run.json from "testcase_1" to "testcase_2"
        6. Under directory <kafka_home>/system_test, execute "python -B system_test_runner.py"
        7. The main test framework console output, broker logs, broker data log segment files are tarred in the file system_test_output_archive.tar.

        In this specific test run, there is a log segment file missing in broker-2:

        broker-1 :
        test_1-0/00000000000000000000.kafka => 4201569950
        test_1-0/00000000000000102510.kafka => 1868104866
        test_1-0/00000000000000205020.kafka => 1753379349
        test_1-0/00000000000000307530.kafka => 1518305117
        test_1-0/00000000000000410040.kafka => 3676899141 <<<< not matching across all replicas

        broker-2 :
        test_1-0/00000000000000000000.kafka => 4201569950
        test_1-0/00000000000000102510.kafka => 1868104866
        test_1-0/00000000000000205020.kafka => 1753379349
        test_1-0/00000000000000307530.kafka => 1518305117

        broker-3 :
        test_1-0/00000000000000000000.kafka => 4201569950
        test_1-0/00000000000000102510.kafka => 1868104866
        test_1-0/00000000000000205020.kafka => 1753379349
        test_1-0/00000000000000307530.kafka => 1518305117
        test_1-0/00000000000000410040.kafka => 3676899141 <<<< not matching across all replicas

        Show
        John Fung added a comment - - edited This issue can be reproduced as follows: 1. Download the latest 0.8 branch 2. Apply kafka-502-v4.patch 3. Under directory <kafka_home>, execute "./sbt update package" to build Kafka 4. Untar testcase_2.tar to <kafka_home>/system_test/replication_testsuite/ 5. Modified <kafka_home>/system_test/testcase_to_run.json from "testcase_1" to "testcase_2" 6. Under directory <kafka_home>/system_test, execute "python -B system_test_runner.py" 7. The main test framework console output, broker logs, broker data log segment files are tarred in the file system_test_output_archive.tar. In this specific test run, there is a log segment file missing in broker-2: broker-1 : test_1-0/00000000000000000000.kafka => 4201569950 test_1-0/00000000000000102510.kafka => 1868104866 test_1-0/00000000000000205020.kafka => 1753379349 test_1-0/00000000000000307530.kafka => 1518305117 test_1-0/00000000000000410040.kafka => 3676899141 <<<< not matching across all replicas broker-2 : test_1-0/00000000000000000000.kafka => 4201569950 test_1-0/00000000000000102510.kafka => 1868104866 test_1-0/00000000000000205020.kafka => 1753379349 test_1-0/00000000000000307530.kafka => 1518305117 broker-3 : test_1-0/00000000000000000000.kafka => 4201569950 test_1-0/00000000000000102510.kafka => 1868104866 test_1-0/00000000000000205020.kafka => 1753379349 test_1-0/00000000000000307530.kafka => 1518305117 test_1-0/00000000000000410040.kafka => 3676899141 <<<< not matching across all replicas
        Hide
        John Fung added a comment -

        Uploaded kafka-514-reproduce-issue.patch to reproduce the issue:

        1. Download the latest 0.8 branch
        2. Apply kafka-514-reproduce-issue.patch
        3. Under directory <kafka_home>, execute "./sbt update package" to build Kafka
        3. Under directory <kafka_home>/system_test, execute "python -B system_test_runner.py"

        Show
        John Fung added a comment - Uploaded kafka-514-reproduce-issue.patch to reproduce the issue: 1. Download the latest 0.8 branch 2. Apply kafka-514-reproduce-issue.patch 3. Under directory <kafka_home>, execute "./sbt update package" to build Kafka 3. Under directory <kafka_home>/system_test, execute "python -B system_test_runner.py"
        Hide
        Jun Rao added a comment -

        This seems to be the same problem as in kafka-525, which is supposed to be fixed in kafka-42. Adding a temporary patch to fix this specific issue. Could you try if this fixes the issue?

        Show
        Jun Rao added a comment - This seems to be the same problem as in kafka-525, which is supposed to be fixed in kafka-42. Adding a temporary patch to fix this specific issue. Could you try if this fixes the issue?
        Hide
        John Fung added a comment -

        Thanks Jun for the fix.

        • This testcase consistently failed before your fix.
        • After applying the fix:
        • the testcase failed twice and passed once (with full metrics.json mbean launched)
        • the testcase passed twice in a row (with less mbean specified in metrics.json)
        Show
        John Fung added a comment - Thanks Jun for the fix. This testcase consistently failed before your fix. After applying the fix: the testcase failed twice and passed once (with full metrics.json mbean launched) the testcase passed twice in a row (with less mbean specified in metrics.json)
        Hide
        Jun Rao added a comment -

        Attach patch v2 (includes v1 changes). This is just a temporary fix for kafka-551. Now the system test passes for me. Could you give it a try?

        Show
        Jun Rao added a comment - Attach patch v2 (includes v1 changes). This is just a temporary fix for kafka-551. Now the system test passes for me. Could you give it a try?
        Hide
        John Fung added a comment -

        Thanks Jun for patch v2. The system test is now passing consistently with the original full metrics.json.

        Show
        John Fung added a comment - Thanks Jun for patch v2. The system test is now passing consistently with the original full metrics.json.
        Hide
        Jun Rao added a comment -

        Fixed in kafka-551.

        Show
        Jun Rao added a comment - Fixed in kafka-551.

          People

          • Assignee:
            Unassigned
            Reporter:
            John Fung
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development