Kafka
  1. Kafka
  2. KAFKA-562

Non-failure System Test Log Segment File Checksums mismatched

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None

      Description

      To reproduce this issue
      1. Download 0.8 branch (reproduced in r1396343)
      2. Apply the patch attached
      3. Build Kafka under <kafka_home> by running "./sbt update package"
      4. In the directory <kafka_home>/system_test, run "python -B system_test_runner.py" and it will run the case "testcase_0002" which will reproduce this issue.
      5. The log segment files will be located in /tmp

        Activity

        John Fung created issue -
        John Fung made changes -
        Field Original Value New Value
        Attachment kafka-562-reproduce-issue.patch [ 12548475 ]
        Hide
        John Fung added a comment - - edited

        In this testcase, the observations are as follows:

        1. By comparing the messages produced and consumed, there is no data loss.

        2. There are unequal no. of log segment files

        [/tmp] find kafka_server_* -name '*.log' -ls
        20189793 12 rw-rr- 1 jfung eng 10322 Oct 9 14:44 kafka_server_1_logs/test_1-0/00000000000000000000.log
        20189809 4 rw-rr- 1 jfung eng 1537 Oct 9 14:44 kafka_server_1_logs/test_1-0/00000000000000000261.log

        20189797 12 rw-rr- 1 jfung eng 10271 Oct 9 14:44 kafka_server_2_logs/test_1-0/00000000000000000000.log
        20189807 12 rw-rr- 1 jfung eng 10189 Oct 9 14:44 kafka_server_2_logs/test_1-0/00000000000000000201.log
        20189803 12 rw-rr- 1 jfung eng 10293 Oct 9 14:44 kafka_server_2_logs/test_1-0/00000000000000000101.log

        20189798 12 rw-rr- 1 jfung eng 10322 Oct 9 14:44 kafka_server_3_logs/test_1-0/00000000000000000000.log
        20189819 4 rw-rr- 1 jfung eng 1537 Oct 9 14:44 kafka_server_3_logs/test_1-0/00000000000000000261.log

        3. If the corresponding files for broker 1, broker 2, ... are merged together in sequence, their checksum would not match either.

        4. Running the tool DumpLogSegments on each individual file as shown below.

        Broker 1
        ========
        bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_1_logs/test_1-0/00000000000000000000.log

        Dumping kafka_server_1_logs/test_1-0/00000000000000000000.log

        Starting offset: 0

        offset: 4 isvalid: true payloadsize: 167 magic: 2 compresscodec: GZIPCompressionCodec crc: 2048048444
        offset: 9 isvalid: true payloadsize: 171 magic: 2 compresscodec: GZIPCompressionCodec crc: 594807606
        offset: 14 isvalid: true payloadsize: 172 magic: 2 compresscodec: GZIPCompressionCodec crc: 1696552621
        offset: 19 isvalid: true payloadsize: 172 magic: 2 compresscodec: GZIPCompressionCodec crc: 3794535639
        offset: 24 isvalid: true payloadsize: 172 magic: 2 compresscodec: GZIPCompressionCodec crc: 4167930995
        . . .
        offset: 245 isvalid: true payloadsize: 174 magic: 2 compresscodec: GZIPCompressionCodec crc: 3337681338
        offset: 250 isvalid: true payloadsize: 174 magic: 2 compresscodec: GZIPCompressionCodec crc: 2655434756
        offset: 255 isvalid: true payloadsize: 174 magic: 2 compresscodec: GZIPCompressionCodec crc: 5551624
        offset: 260 isvalid: true payloadsize: 171 magic: 2 compresscodec: GZIPCompressionCodec crc: 53200305

        bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_1_logs/test_1-0/00000000000000000261.log

        Dumping kafka_server_1_logs/test_1-0/00000000000000000261.log

        Starting offset: 261

        offset: 265 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 1224901688
        offset: 270 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 2027726868
        offset: 275 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 559159044
        offset: 280 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 157990978
        offset: 285 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 2995943272
        offset: 290 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 3964443281
        offset: 295 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 140848011
        offset: 299 isvalid: true payloadsize: 150 magic: 2 compresscodec: GZIPCompressionCodec crc: 657039729

        Broker 2
        =========

        bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_2_logs/test_1-0/00000000000000000000.log

        Dumping kafka_server_2_logs/test_1-0/00000000000000000000.log

        Starting offset: 0

        offset: 0 isvalid: true payloadsize: 77 magic: 2 compresscodec: GZIPCompressionCodec crc: 2305854709
        offset: 1 isvalid: true payloadsize: 79 magic: 2 compresscodec: GZIPCompressionCodec crc: 1768470661
        offset: 2 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 657973900
        offset: 3 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 3672345982
        offset: 4 isvalid: true payloadsize: 77 magic: 2 compresscodec: GZIPCompressionCodec crc: 3431890374
        . . .
        offset: 98 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 2479186795
        offset: 99 isvalid: true payloadsize: 79 magic: 2 compresscodec: GZIPCompressionCodec crc: 2127679297
        offset: 100 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 3367058812

        /bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_2_logs/test_1-0/00000000000000000101.log

        Dumping kafka_server_2_logs/test_1-0/00000000000000000101.log

        Starting offset: 101

        offset: 101 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 2061836440
        offset: 102 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 1118186556
        offset: 103 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 374092732
        offset: 104 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 1013512453
        . . .
        offset: 198 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 1754585683
        offset: 199 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 3604597143
        offset: 200 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 1508187619

        bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_2_logs/test_1-0/00000000000000000201.log

        Dumping kafka_server_2_logs/test_1-0/00000000000000000201.log

        Starting offset: 201

        offset: 201 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 782997024
        offset: 202 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 1935012961
        offset: 203 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 554951891
        offset: 204 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 14519573
        . . .
        offset: 297 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 1204716196
        offset: 298 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 2682999595
        offset: 299 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 4284106376

        Show
        John Fung added a comment - - edited In this testcase, the observations are as follows: 1. By comparing the messages produced and consumed, there is no data loss. 2. There are unequal no. of log segment files [/tmp] find kafka_server_* -name '*.log' -ls 20189793 12 rw-r r - 1 jfung eng 10322 Oct 9 14:44 kafka_server_1_logs/test_1-0/00000000000000000000.log 20189809 4 rw-r r - 1 jfung eng 1537 Oct 9 14:44 kafka_server_1_logs/test_1-0/00000000000000000261.log 20189797 12 rw-r r - 1 jfung eng 10271 Oct 9 14:44 kafka_server_2_logs/test_1-0/00000000000000000000.log 20189807 12 rw-r r - 1 jfung eng 10189 Oct 9 14:44 kafka_server_2_logs/test_1-0/00000000000000000201.log 20189803 12 rw-r r - 1 jfung eng 10293 Oct 9 14:44 kafka_server_2_logs/test_1-0/00000000000000000101.log 20189798 12 rw-r r - 1 jfung eng 10322 Oct 9 14:44 kafka_server_3_logs/test_1-0/00000000000000000000.log 20189819 4 rw-r r - 1 jfung eng 1537 Oct 9 14:44 kafka_server_3_logs/test_1-0/00000000000000000261.log 3. If the corresponding files for broker 1, broker 2, ... are merged together in sequence, their checksum would not match either. 4. Running the tool DumpLogSegments on each individual file as shown below. Broker 1 ======== bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_1_logs/test_1-0/00000000000000000000.log Dumping kafka_server_1_logs/test_1-0/00000000000000000000.log Starting offset: 0 offset: 4 isvalid: true payloadsize: 167 magic: 2 compresscodec: GZIPCompressionCodec crc: 2048048444 offset: 9 isvalid: true payloadsize: 171 magic: 2 compresscodec: GZIPCompressionCodec crc: 594807606 offset: 14 isvalid: true payloadsize: 172 magic: 2 compresscodec: GZIPCompressionCodec crc: 1696552621 offset: 19 isvalid: true payloadsize: 172 magic: 2 compresscodec: GZIPCompressionCodec crc: 3794535639 offset: 24 isvalid: true payloadsize: 172 magic: 2 compresscodec: GZIPCompressionCodec crc: 4167930995 . . . offset: 245 isvalid: true payloadsize: 174 magic: 2 compresscodec: GZIPCompressionCodec crc: 3337681338 offset: 250 isvalid: true payloadsize: 174 magic: 2 compresscodec: GZIPCompressionCodec crc: 2655434756 offset: 255 isvalid: true payloadsize: 174 magic: 2 compresscodec: GZIPCompressionCodec crc: 5551624 offset: 260 isvalid: true payloadsize: 171 magic: 2 compresscodec: GZIPCompressionCodec crc: 53200305 bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_1_logs/test_1-0/00000000000000000261.log Dumping kafka_server_1_logs/test_1-0/00000000000000000261.log Starting offset: 261 offset: 265 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 1224901688 offset: 270 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 2027726868 offset: 275 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 559159044 offset: 280 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 157990978 offset: 285 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 2995943272 offset: 290 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 3964443281 offset: 295 isvalid: true payloadsize: 173 magic: 2 compresscodec: GZIPCompressionCodec crc: 140848011 offset: 299 isvalid: true payloadsize: 150 magic: 2 compresscodec: GZIPCompressionCodec crc: 657039729 Broker 2 ========= bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_2_logs/test_1-0/00000000000000000000.log Dumping kafka_server_2_logs/test_1-0/00000000000000000000.log Starting offset: 0 offset: 0 isvalid: true payloadsize: 77 magic: 2 compresscodec: GZIPCompressionCodec crc: 2305854709 offset: 1 isvalid: true payloadsize: 79 magic: 2 compresscodec: GZIPCompressionCodec crc: 1768470661 offset: 2 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 657973900 offset: 3 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 3672345982 offset: 4 isvalid: true payloadsize: 77 magic: 2 compresscodec: GZIPCompressionCodec crc: 3431890374 . . . offset: 98 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 2479186795 offset: 99 isvalid: true payloadsize: 79 magic: 2 compresscodec: GZIPCompressionCodec crc: 2127679297 offset: 100 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 3367058812 /bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_2_logs/test_1-0/00000000000000000101.log Dumping kafka_server_2_logs/test_1-0/00000000000000000101.log Starting offset: 101 offset: 101 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 2061836440 offset: 102 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 1118186556 offset: 103 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 374092732 offset: 104 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 1013512453 . . . offset: 198 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 1754585683 offset: 199 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 3604597143 offset: 200 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 1508187619 bin/kafka-run-class.sh kafka.tools.DumpLogSegments kafka_server_2_logs/test_1-0/00000000000000000201.log Dumping kafka_server_2_logs/test_1-0/00000000000000000201.log Starting offset: 201 offset: 201 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 782997024 offset: 202 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 1935012961 offset: 203 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 554951891 offset: 204 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 14519573 . . . offset: 297 isvalid: true payloadsize: 80 magic: 2 compresscodec: GZIPCompressionCodec crc: 1204716196 offset: 298 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 2682999595 offset: 299 isvalid: true payloadsize: 81 magic: 2 compresscodec: GZIPCompressionCodec crc: 4284106376
        Hide
        Jay Kreps added a comment -

        Yeah, this is definitely a real problem that is leading to not all messages being replicated. With the new logical offsets the followers should have messages with offset 0, 1, 2, 3, 4... as the leader does. But instead the followers have offsets 4,9,14,19,24... I.e. only every fifth message. Not sure of the cause, looking into it.

        Show
        Jay Kreps added a comment - Yeah, this is definitely a real problem that is leading to not all messages being replicated. With the new logical offsets the followers should have messages with offset 0, 1, 2, 3, 4... as the leader does. But instead the followers have offsets 4,9,14,19,24... I.e. only every fifth message. Not sure of the cause, looking into it.
        Hide
        Jay Kreps added a comment -

        Okay, this is not a bug exactly, I was mistaken. Here is what is happening:

        The leader receives one message at a time, gzip'd. The follower fetches chunks of multiple gzip'd messages.

        The current logic is that when appending a message set we check if there are any compressed messages. If there are we need to uncompress all messages and re-compress with new offsets assigned. Because the follower is getting chunks of five messages at a time, it is compressing these together. The reason the follower logs are so much smaller is because they are batch compressed.

        Not sure what the best thing to do here is. On one hand it is much nicer if the follower has byte-for-byte identical logs. On the other hand batch compression is a good thing.

        Show
        Jay Kreps added a comment - Okay, this is not a bug exactly, I was mistaken. Here is what is happening: The leader receives one message at a time, gzip'd. The follower fetches chunks of multiple gzip'd messages. The current logic is that when appending a message set we check if there are any compressed messages. If there are we need to uncompress all messages and re-compress with new offsets assigned. Because the follower is getting chunks of five messages at a time, it is compressing these together. The reason the follower logs are so much smaller is because they are batch compressed. Not sure what the best thing to do here is. On one hand it is much nicer if the follower has byte-for-byte identical logs. On the other hand batch compression is a good thing.
        Hide
        Jay Kreps added a comment -

        So the proposed fix for this is to special case appends that come from replication and not do any offset assignment, re-compression, or anything else during these. This had already been proposed as a performance improvement. It also ensures that the offset assignment on the leader and followers will match. A patch is attached to KAFKA-557, and I verified that it fixes this system test issue.

        Show
        Jay Kreps added a comment - So the proposed fix for this is to special case appends that come from replication and not do any offset assignment, re-compression, or anything else during these. This had already been proposed as a performance improvement. It also ensures that the offset assignment on the leader and followers will match. A patch is attached to KAFKA-557 , and I verified that it fixes this system test issue.
        Hide
        Jay Kreps added a comment -

        This was addressed a while back.

        Show
        Jay Kreps added a comment - This was addressed a while back.
        Jay Kreps made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 0.8 [ 12317244 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            John Fung
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development