Kafka
  1. Kafka
  2. KAFKA-573

System Test : Leader Failure Log Segment Checksum Mismatched When request-num-acks is 1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None

      Description

      • Test Description:

      1. Start a 3-broker cluster as source
      2. Send messages to source cluster
      3. Find leader and terminate it (kill -15)
      4. Start the broker again
      5. Start a consumer to consume data
      6. Compare the MessageID in the data between producer log and consumer log.

      • Issue: There will be data loss if request-num-acks is set to 1.

      • To reproduce this issue, please do the followings:

      1. Download the latest 0.8 branch
      2. Apply the patch attached to this JIRA
      3. Build kafka by running "./sbt update package"
      4. Execute the test in directory "system_test" : "python -B system_test_runner.py"
      5. This test will execute testcase_2 with the following settings:
      Replica factor : 3
      No. of partitions : 1
      No. of bouncing : 1

      1. acks1_leader_failure_data_loss.tar.gz
        708 kB
        John Fung
      2. kafka-573.patch
        9 kB
        Jun Rao
      3. kafka-573-reproduce-issue.patch
        38 kB
        John Fung

        Activity

        Hide
        John Fung added a comment -

        These are some manual verification of the differences in the log segment files:

        • Merging the log segment files for each broker:
        $ for i in `find kafka_server_1_logs/ -name '0*.log' | sort`; do cat $i >> broker-1-merged/00000000000000000000.log; done
        $ for i in `find kafka_server_2_logs/ -name '0*.log' | sort`; do cat $i >> broker-2-merged/00000000000000000000.log; done
        $ for i in `find kafka_server_3_logs/ -name '0*.log' | sort`; do cat $i >> broker-3-merged/00000000000000000000.log; done

        • Verify the checksum of each merged log segment and they are different:
        $ cksum broker-1-merged/00000000000000000000.log
        1742950004 1638036 broker-1-merged/00000000000000000000.log
        $ cksum broker-2-merged/00000000000000000000.log
        2050258314 1639080 broker-2-merged/00000000000000000000.log
        $ cksum broker-3-merged/00000000000000000000.log
        1802214049 1639080 broker-3-merged/00000000000000000000.log

        • Get the dump log segments of the merged files:
        $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments broker-1-merged/00000000000000000000.log > broker-1-dump-log-segment.log
        $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments broker-2-merged/00000000000000000000.log > broker-2-dump-log-segment.log
        $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments broker-3-merged/00000000000000000000.log > broker-3-dump-log-segment.log

        • Diff the dump log segment between broker-1 & broker-2:

        $ diff broker-1-dump-log-segment.log broker-2-dump-log-segment.log

        1c1
        < Dumping broker-1-merged/00000000000000000000.log

        > Dumping broker-2-merged/00000000000000000000.log
        113a114
        > offset: 112 isvalid: true payloadsize: 500 magic: 2 compresscodec: NoCompressionCodec crc: 2581499653
        168a170
        > offset: 168 isvalid: true payloadsize: 500 magic: 2 compresscodec: NoCompressionCodec crc: 3880215630
        387a390
        > offset: 389 isvalid: true payloadsize: 500 magic: 2 compresscodec: NoCompressionCodec crc: 3744939326
        2734d2736
        < offset: 2737 isvalid: true payloadsize: 500 magic: 2 compresscodec: NoCompressionCodec crc: 314900536

        Show
        John Fung added a comment - These are some manual verification of the differences in the log segment files: • Merging the log segment files for each broker: $ for i in `find kafka_server_1_logs/ -name '0*.log' | sort`; do cat $i >> broker-1-merged/00000000000000000000.log; done $ for i in `find kafka_server_2_logs/ -name '0*.log' | sort`; do cat $i >> broker-2-merged/00000000000000000000.log; done $ for i in `find kafka_server_3_logs/ -name '0*.log' | sort`; do cat $i >> broker-3-merged/00000000000000000000.log; done • Verify the checksum of each merged log segment and they are different: $ cksum broker-1-merged/00000000000000000000.log 1742950004 1638036 broker-1-merged/00000000000000000000.log $ cksum broker-2-merged/00000000000000000000.log 2050258314 1639080 broker-2-merged/00000000000000000000.log $ cksum broker-3-merged/00000000000000000000.log 1802214049 1639080 broker-3-merged/00000000000000000000.log • Get the dump log segments of the merged files: $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments broker-1-merged/00000000000000000000.log > broker-1-dump-log-segment.log $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments broker-2-merged/00000000000000000000.log > broker-2-dump-log-segment.log $ bin/kafka-run-class.sh kafka.tools.DumpLogSegments broker-3-merged/00000000000000000000.log > broker-3-dump-log-segment.log • Diff the dump log segment between broker-1 & broker-2: $ diff broker-1-dump-log-segment.log broker-2-dump-log-segment.log 1c1 < Dumping broker-1-merged/00000000000000000000.log — > Dumping broker-2-merged/00000000000000000000.log 113a114 > offset: 112 isvalid: true payloadsize: 500 magic: 2 compresscodec: NoCompressionCodec crc: 2581499653 168a170 > offset: 168 isvalid: true payloadsize: 500 magic: 2 compresscodec: NoCompressionCodec crc: 3880215630 387a390 > offset: 389 isvalid: true payloadsize: 500 magic: 2 compresscodec: NoCompressionCodec crc: 3744939326 2734d2736 < offset: 2737 isvalid: true payloadsize: 500 magic: 2 compresscodec: NoCompressionCodec crc: 314900536
        Hide
        John Fung added a comment -

        If ack is set to 1, we actually don't guarantee no data loss. This is because when the client receives an ack, data is only guaranteed to be in the leader, but not necessarily in other replicas. So, if a leader is bounced, some acked data could be lost.

        Nevertheless, merged checksums should still match among all replicas.

        Show
        John Fung added a comment - If ack is set to 1, we actually don't guarantee no data loss. This is because when the client receives an ack, data is only guaranteed to be in the leader, but not necessarily in other replicas. So, if a leader is bounced, some acked data could be lost. Nevertheless, merged checksums should still match among all replicas.
        Hide
        Jun Rao added a comment -

        Attach a patch. There are 2 problems. The first one is the most severe one. We recently changed FileMessageSet to remove the mutable flag. As a result, everytime a new FileMessageSet is created, the constructor sets the file channel's position to the end of the file. What's happening is that while a file channel is being appended for newly produced data, the file position is moved by FileMessageSet created for fetch requests. Since they are not properly synchronized, occasionally, a message in the log is overwritten. The second issue is that in ByteBufferMessageSet.writeTo. We try to reset the buffer position after writing the data in the buffer to the channel. However, since there is no guarantee that the whole buffer will be written to the channel in a single write, resetting the buffer position could cause incorrect bytes being written to the channel.

        The patch fixes both issues. The changes are: (1) Added a new flag in the constructor of FileMessageSet to control whether the channel position is set to the end of the file or not. (2) Changed ByteBufferMessageSet.writeTo so that we wait until the whole buffer is written to the channel before resetting the buffer position. (3) Added a few more logging that I found useful while investigating the issues.

        The system test passes now.

        Show
        Jun Rao added a comment - Attach a patch. There are 2 problems. The first one is the most severe one. We recently changed FileMessageSet to remove the mutable flag. As a result, everytime a new FileMessageSet is created, the constructor sets the file channel's position to the end of the file. What's happening is that while a file channel is being appended for newly produced data, the file position is moved by FileMessageSet created for fetch requests. Since they are not properly synchronized, occasionally, a message in the log is overwritten. The second issue is that in ByteBufferMessageSet.writeTo. We try to reset the buffer position after writing the data in the buffer to the channel. However, since there is no guarantee that the whole buffer will be written to the channel in a single write, resetting the buffer position could cause incorrect bytes being written to the channel. The patch fixes both issues. The changes are: (1) Added a new flag in the constructor of FileMessageSet to control whether the channel position is set to the end of the file or not. (2) Changed ByteBufferMessageSet.writeTo so that we wait until the whole buffer is written to the channel before resetting the buffer position. (3) Added a few more logging that I found useful while investigating the issues. The system test passes now.
        Hide
        John Fung added a comment -

        Thanks Jun for the patch. The patch is applied together with the reproducing issue patch to the latest 0.8 branch and it is working correctly now.

        Show
        John Fung added a comment - Thanks Jun for the patch. The patch is applied together with the reproducing issue patch to the latest 0.8 branch and it is working correctly now.
        Hide
        John Fung added a comment -

        A full regression test (35 test cases) has been launched with this patch and the following is the summary of the results:

        1. There are all together 5 failures in 5 test cases out of 35 cases.

        • testcase_0103 : 2 out of 2300 msg lost (checksum matched) - failure case with Ack == 1 in Sync mode (1 topic, 1 partition)
        • testcase_0105 : 2 out of 2300 msg lost (checksum matched) - failure case with Ack == 1 in Async mode (1 topic, 1 partition)
        • testcase_0114 : 11 out of 5100 msg lost (checksum matched) - failure case with Ack == 1 in Async mode (1 topic, 3 partitions)
        • testcase_0117 : 44 out of 5100 msg lost (checksum matched) - failure case with Ack == 1 in Sync mode (1 topic, 3 partitions)
        • testcase_0122 : 1524 out of 4600 msg lost (checksum matched) - failure case with Ack == 1 in Sync mode ( 2 topics, 3 partitions)

        2. The results suggest that the fix for KAFKA-573 is working well (minor data loss in the first 4 cases are expected in leader failure cases when ack == 1).

        3. The result in testcase_0122 could be related to a different issue and a new JIRA will be created to keep track of that.

        Show
        John Fung added a comment - A full regression test (35 test cases) has been launched with this patch and the following is the summary of the results: 1. There are all together 5 failures in 5 test cases out of 35 cases. testcase_0103 : 2 out of 2300 msg lost (checksum matched) - failure case with Ack == 1 in Sync mode (1 topic, 1 partition) testcase_0105 : 2 out of 2300 msg lost (checksum matched) - failure case with Ack == 1 in Async mode (1 topic, 1 partition) testcase_0114 : 11 out of 5100 msg lost (checksum matched) - failure case with Ack == 1 in Async mode (1 topic, 3 partitions) testcase_0117 : 44 out of 5100 msg lost (checksum matched) - failure case with Ack == 1 in Sync mode (1 topic, 3 partitions) testcase_0122 : 1524 out of 4600 msg lost (checksum matched) - failure case with Ack == 1 in Sync mode ( 2 topics, 3 partitions) 2. The results suggest that the fix for KAFKA-573 is working well (minor data loss in the first 4 cases are expected in leader failure cases when ack == 1). 3. The result in testcase_0122 could be related to a different issue and a new JIRA will be created to keep track of that.
        Hide
        Joel Koshy added a comment -

        +1

        This must have been a pain to track down. Minor comment: In FileMessageSet, since we always open mutable, can you fix the comment (from javadoc) that says it can be opened immutable. Also, the info log reads a bit odd.

        Show
        Joel Koshy added a comment - +1 This must have been a pain to track down. Minor comment: In FileMessageSet, since we always open mutable, can you fix the comment (from javadoc) that says it can be opened immutable. Also, the info log reads a bit odd.
        Hide
        Neha Narkhede added a comment -

        +1, this one was fun to track down. Reminded me of the FileChannel truncate bug.

        Minor comment - Probably best to delete the info statement -
        info("After changed to position %d with size %d".format(channel.position(), channel.size()))

        Show
        Neha Narkhede added a comment - +1, this one was fun to track down. Reminded me of the FileChannel truncate bug. Minor comment - Probably best to delete the info statement - info("After changed to position %d with size %d".format(channel.position(), channel.size()))
        Hide
        Jun Rao added a comment -

        Thanks for the review. Removed the unnecessary log statement and committed to 0.8.

        Show
        Jun Rao added a comment - Thanks for the review. Removed the unnecessary log statement and committed to 0.8.

          People

          • Assignee:
            Jun Rao
            Reporter:
            John Fung
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development