Kafka
  1. Kafka
  2. KAFKA-372

Consumer doesn't receive all data if there are multiple segment files

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: core
    • Labels:
      None

      Description

      This issue happens inconsistently but could be reproduced by following the steps below (repeat step 4 a few times to reproduce it):

      1. Check out 0.8 branch (currently reproducible with rev. 1352634)

      2. Apply kafka-306-v4.patch

      3. Please note that the log.file.size is set to 10000000 in system_test/broker_failure/config/server_*.properties (small enough to trigger multi segment files)

      4. Under the directory <kafka home>/system_test/broker_failure, execute command:
      $ bin/run-test.sh 20 0

      5. After the test is completed, the result will probably look like the following:

      ========================================================
      no. of messages published : 14000
      producer unique msg rec'd : 14000
      source consumer msg rec'd : 7271
      source consumer unique msg rec'd : 7271
      mirror consumer msg rec'd : 6960
      mirror consumer unique msg rec'd : 6960
      total source/mirror duplicate msg : 0
      source/mirror uniq msg count diff : 311
      ========================================================

      6. By checking the kafka log files, the sum of the sizes of the source cluster segments files are equal to those in the target cluster.

      [/tmp] $ find kafka* -name *.kafka -ls

      18620155 9860 rw-rr- 1 jfung eng 10096535 Jun 21 11:09 kafka-source3-logs/test01-0/00000000000000000000.kafka
      18620161 9772 rw-rr- 1 jfung eng 10004418 Jun 21 11:11 kafka-source3-logs/test01-0/00000000000020105286.kafka
      18620160 9776 rw-rr- 1 jfung eng 10008751 Jun 21 11:10 kafka-source3-logs/test01-0/00000000000010096535.kafka
      18620162 4708 rw-rr- 1 jfung eng 4819067 Jun 21 11:11 kafka-source3-logs/test01-0/00000000000030109704.kafka
      19406431 9920 rw-rr- 1 jfung eng 10157685 Jun 21 11:10 kafka-target2-logs/test01-0/00000000000010335039.kafka
      19406429 10096 rw-rr- 1 jfung eng 10335039 Jun 21 11:09 kafka-target2-logs/test01-0/00000000000000000000.kafka
      19406432 10300 rw-rr- 1 jfung eng 10544850 Jun 21 11:11 kafka-target2-logs/test01-0/00000000000020492724.kafka
      19406433 3800 rw-rr- 1 jfung eng 3891197 Jun 21 11:12 kafka-target2-logs/test01-0/00000000000031037574.kafka

      7. If the log.file.size in target cluster is configured to a very large value such that there is only 1 data file, the result would look like this:

      ========================================================
      no. of messages published : 14000
      producer unique msg rec'd : 14000
      source consumer msg rec'd : 7302
      source consumer unique msg rec'd : 7302
      mirror consumer msg rec'd : 13750
      mirror consumer unique msg rec'd : 13750
      total source/mirror duplicate msg : 0
      source/mirror uniq msg count diff : -6448
      ========================================================

      8. The log files are like these:

      [/tmp] $ find kafka* -name *.kafka -ls

      18620160 9840 rw-rr- 1 jfung eng 10075058 Jun 21 11:24 kafka-source2-logs/test01-0/00000000000010083679.kafka
      18620155 9848 rw-rr- 1 jfung eng 10083679 Jun 21 11:23 kafka-source2-logs/test01-0/00000000000000000000.kafka
      18620162 4484 rw-rr- 1 jfung eng 4589474 Jun 21 11:26 kafka-source2-logs/test01-0/00000000000030269045.kafka
      18620161 9876 rw-rr- 1 jfung eng 10110308 Jun 21 11:25 kafka-source2-logs/test01-0/00000000000020158737.kafka
      19406429 34048 rw-rr- 1 jfung eng 34858519 Jun 21 11:26 kafka-target3-logs/test01-0/00000000000000000000.kafka

        Activity

        Hide
        John Fung added a comment -
          • Uploaded a patch with a simplified scenario to reproduce the data loss in multi segment files.
          • This patch provides a script "run-test-debug.sh" to do the following:
            1. Start 1 broker
            2. Start a modified version of Producer to send 300 messages with user specified message string length (500 chars will reproduce the issue while 50 chars will not). This producer produces messages with sequence ID and send the messages in sequence starting from 1, 2, 3, … Etc.
            3. Start ConsoleConsumer to receive data
          • To reproduce the issue, under <kafka home>/system_test/broker_failure, execute the following command:

        $ bin/run-test-debug.sh 500 (which means each message string is 500 chars long)

        The consumer only receives the first 120 messages. (This is verified by checking kafka.tools.DumpLogSegments.
        ========================================================
        no. of messages published : 300
        producer unique msg rec'd : 300
        source consumer msg rec'd : 120
        source consumer unique msg rec'd : 120
        ========================================================

        The number of segment files are

        $ ls -l /tmp/kafka-source1-logs/test01-0/
        rw-rr- 1 jfung wheel 10431 Jun 24 20:59:21 2012 00000000000000000000.kafka
        rw-rr- 1 jfung wheel 10440 Jun 24 20:59:22 2012 00000000000000010431.kafka
        rw-rr- 1 jfung wheel 10440 Jun 24 20:59:23 2012 00000000000000020871.kafka
        rw-rr- 1 jfung wheel 10440 Jun 24 20:59:24 2012 00000000000000031311.kafka
        rw-rr- 1 jfung wheel 10441 Jun 24 20:59:26 2012 00000000000000041751.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:27 2012 00000000000000052192.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:28 2012 00000000000000062652.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:29 2012 00000000000000073112.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:31 2012 00000000000000083572.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:32 2012 00000000000000094032.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:33 2012 00000000000000104492.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:34 2012 00000000000000114952.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:35 2012 00000000000000125412.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:37 2012 00000000000000135872.kafka
        rw-rr- 1 jfung wheel 10460 Jun 24 20:59:38 2012 00000000000000146332.kafka
        rw-rr- 1 jfung wheel 0 Jun 24 20:59:38 2012 00000000000000156792.kafka
        rw-rr- 1 jfung wheel 8 Jun 24 21:00:08 2012 highwatermark

          • However, if the length of each message string is changed to a lower value 50, the issue won't be showing:

        $ bin/run-test-debug.sh 50

        The consumer receives all data:
        ========================================================
        no. of messages published : 300
        producer unique msg rec'd : 300
        source consumer msg rec'd : 300
        source consumer unique msg rec'd : 300
        ========================================================

        The number of segment files are

        $ ls -l /tmp/kafka-source1-logs/test01-0
        total 64
        rw-rr- 1 jfung wheel 10039 Jun 24 20:29:26 2012 00000000000000000000.kafka
        rw-rr- 1 jfung wheel 10001 Jun 24 20:29:34 2012 00000000000000010039.kafka
        rw-rr- 1 jfung wheel 1752 Jun 24 20:29:36 2012 00000000000000020040.kafka
        rw-rr- 1 jfung wheel 8 Jun 24 20:30:06 2012 highwatermark

        Show
        John Fung added a comment - Uploaded a patch with a simplified scenario to reproduce the data loss in multi segment files. This patch provides a script "run-test-debug.sh" to do the following: 1. Start 1 broker 2. Start a modified version of Producer to send 300 messages with user specified message string length (500 chars will reproduce the issue while 50 chars will not). This producer produces messages with sequence ID and send the messages in sequence starting from 1, 2, 3, … Etc. 3. Start ConsoleConsumer to receive data To reproduce the issue, under <kafka home>/system_test/broker_failure, execute the following command: $ bin/run-test-debug.sh 500 (which means each message string is 500 chars long) The consumer only receives the first 120 messages. (This is verified by checking kafka.tools.DumpLogSegments. ======================================================== no. of messages published : 300 producer unique msg rec'd : 300 source consumer msg rec'd : 120 source consumer unique msg rec'd : 120 ======================================================== The number of segment files are $ ls -l /tmp/kafka-source1-logs/test01-0/ rw-r r - 1 jfung wheel 10431 Jun 24 20:59:21 2012 00000000000000000000.kafka rw-r r - 1 jfung wheel 10440 Jun 24 20:59:22 2012 00000000000000010431.kafka rw-r r - 1 jfung wheel 10440 Jun 24 20:59:23 2012 00000000000000020871.kafka rw-r r - 1 jfung wheel 10440 Jun 24 20:59:24 2012 00000000000000031311.kafka rw-r r - 1 jfung wheel 10441 Jun 24 20:59:26 2012 00000000000000041751.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:27 2012 00000000000000052192.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:28 2012 00000000000000062652.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:29 2012 00000000000000073112.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:31 2012 00000000000000083572.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:32 2012 00000000000000094032.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:33 2012 00000000000000104492.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:34 2012 00000000000000114952.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:35 2012 00000000000000125412.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:37 2012 00000000000000135872.kafka rw-r r - 1 jfung wheel 10460 Jun 24 20:59:38 2012 00000000000000146332.kafka rw-r r - 1 jfung wheel 0 Jun 24 20:59:38 2012 00000000000000156792.kafka rw-r r - 1 jfung wheel 8 Jun 24 21:00:08 2012 highwatermark However, if the length of each message string is changed to a lower value 50, the issue won't be showing: $ bin/run-test-debug.sh 50 The consumer receives all data: ======================================================== no. of messages published : 300 producer unique msg rec'd : 300 source consumer msg rec'd : 300 source consumer unique msg rec'd : 300 ======================================================== The number of segment files are $ ls -l /tmp/kafka-source1-logs/test01-0 total 64 rw-r r - 1 jfung wheel 10039 Jun 24 20:29:26 2012 00000000000000000000.kafka rw-r r - 1 jfung wheel 10001 Jun 24 20:29:34 2012 00000000000000010039.kafka rw-r r - 1 jfung wheel 1752 Jun 24 20:29:36 2012 00000000000000020040.kafka rw-r r - 1 jfung wheel 8 Jun 24 20:30:06 2012 highwatermark
        Hide
        Jun Rao added a comment -

        There were several issues that caused the problem.

        1. Log.nextAppendOffset() calls flush each time. Since this method is called for every produce request, we force a disk flush for every produce request independent of the flush interval in the broker config. This makes producers very slow.

        2. The default value for MaxFetchWaitMs in consumer config is 3 secs, which is too long.

        3. The script runs console consumer in background and only waits for 20 secs, which is too short. What we should do is to run console consumer in foreground and wait until it finishes (since it has consumer timeout).

        Attach patch v1 that fixes items 1 and 2. The test now passes. However, we should address item 3 in the script too.

        Show
        Jun Rao added a comment - There were several issues that caused the problem. 1. Log.nextAppendOffset() calls flush each time. Since this method is called for every produce request, we force a disk flush for every produce request independent of the flush interval in the broker config. This makes producers very slow. 2. The default value for MaxFetchWaitMs in consumer config is 3 secs, which is too long. 3. The script runs console consumer in background and only waits for 20 secs, which is too short. What we should do is to run console consumer in foreground and wait until it finishes (since it has consumer timeout). Attach patch v1 that fixes items 1 and 2. The test now passes. However, we should address item 3 in the script too.
        Hide
        John Fung added a comment -

        Thanks Jun. It is working correctly after applying kafka-372-v1.patch.

        Show
        John Fung added a comment - Thanks Jun. It is working correctly after applying kafka-372-v1.patch.
        Hide
        Jun Rao added a comment -

        Thanks John for reviewing the patch. Committed to 0.8.

        Show
        Jun Rao added a comment - Thanks John for reviewing the patch. Committed to 0.8.

          People

          • Assignee:
            Unassigned
            Reporter:
            John Fung
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development