Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-8814

Consumer benchmark test for paused partitions

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: consumer, system tests, tools
    • Labels:
      None

      Description

      A new performance benchmark and corresponding ConsumerPerformance tools addition to support the paused partition performance improvement implemented in KAFKA-7548. Before the fix, when the user would poll for completed fetched records for partitions that were paused, the consumer would throw away the data because it no longer fetchable. If the partition is resumed then the data would have to be fetched over again. The fix will cache completed fetched records for paused partitions indefinitely so they can be potentially be returned once the partition is resumed.

      In the Jira issue KAFKA-7548 there are several informal test results shown based on a number of different paused partition scenarios, but it was suggested that a test in the benchmarks testsuite would be ideal to demonstrate the performance improvement. In order to the implement this benchmark we must implement a new feature in ConsumerPerformance used by the benchmark testsuite and the kafka-consumer-perf-test.sh bin script that will pause partitions. I added the following parameter:

          val pausedPartitionsOpt = parser.accepts("paused-partitions-percent", "The percentage [0-1] of subscribed " +
            "partitions to pause each poll.")
              .withOptionalArg()
              .describedAs("percent")
              .withValuesConvertedBy(regex("^0(\\.\\d+)?|1\\.0$")) // matches [0-1] with decimals
              .ofType(classOf[Float])
              .defaultsTo(0F)
      

      This allows the user to specify a percentage (represented a floating point value from 0..1) of partitions to pause each poll interval. When the value is greater than 0 then we will take the next n partitions to pause. I ran the test on `trunk` and rebased onto the `2.3.0` tag for the following test summaries of kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput. The test will rotate through pausing 80% of assigned partitions (5/6) each poll interval. I ran this on my laptop.

      trunk (aa4ba8eee8e6f52a9d80a98fb2530b5bcc1b9a11)

      ================================================================================
      SESSION REPORT (ALL TESTS)
      ducktape version: 0.7.5
      session_id:       2019-08-18--010
      run time:         2 minutes 29.104 seconds
      tests run:        1
      passed:           1
      failed:           0
      ignored:          0
      ================================================================================
      test_id:    kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.paused_partitions_percent=0.8
      status:     PASS
      run time:   2 minutes 29.048 seconds
      {"records_per_sec": 450207.0953, "mb_per_sec": 42.9351}
      --------------------------------------------------------------------------------
      

      2.3.0

      ================================================================================
      SESSION REPORT (ALL TESTS)
      ducktape version: 0.7.5
      session_id:       2019-08-18--011
      run time:         2 minutes 41.228 seconds
      tests run:        1
      passed:           1
      failed:           0
      ignored:          0
      ================================================================================
      test_id:    kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.paused_partitions_percent=0.8
      status:     PASS
      run time:   2 minutes 41.168 seconds
      {"records_per_sec": 246574.6024, "mb_per_sec": 23.5152}
      --------------------------------------------------------------------------------
      

      The increase in record and data throughput is significant. Based on other consumer fetch metrics there are also improvements to fetch rate. Depending on how often partitions are paused and resumed it's possible to save a lot of data transfer between the consumer and broker as well.

      Please see the pull request for the associated changes. I was unsure if I needed to create a KIP because while technically I added a new public api to the ConsumerPerformance tool, it was only to enable this benchmark to run. If you feel that a KIP is necessary I'll create one.

        Attachments

          Activity

            People

            • Assignee:
              seglo Sean Glover
              Reporter:
              seglo Sean Glover
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: