Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24707

Enable spark-kafka-streaming to maintain min buffer using async thread to avoid blocking kafka poll

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • DStreams
    • None

    Description

      Currently Spark Kafka RDD will block on kafka consumer poll. Specially in Spark-Kafka-streaming job this poll duration adds into batch processing time which result in

      • Increased batch processing time (which is apart from time taken to process records)
      • Results in unpredictable batch processing time based on poll time.

      If we can poll kafka in background thread and maintain buffer for each partition, poll time will not get added into batch processing time, and this will make processing time more predicatble (based on time taken to process each record, instead of extra time taken to poll records from source)

      For ex. we are facing issues where sometime kafka poll is ~30 secs, and sometime it returns within second. With backpressure enabled this reduces our job speed to great extent. In this situation it is also difficult to scale our processing or calculate resource requirement for future increase in records.

      Even if someone does not face varying kafka poll time, it will be provide performance improvement if some buffer is already maintained for each partition, so that each batch can just concentrate on processing records.

      Ex :
      Lets consider

      • each kafka poll takes 2sec average
      • batch duration is 10 sec
      • to process 100 records we take 10 sec
      • each kafka poll returns 300 recordsĀ 
        1. Spark Job starts
        2. Batch-1 (100 records) (buffer = 0) (processing time = 10 sec + 2sec) => 12 sec processing time
        3. Batch-2 (100 records) (buffer = 200) (processing time = 10 sec) => 10 sec processing time
        4. Batch-3 (100 records) (buffer = 100) (processing time = 10 sec) => 10 sec processing time
        5. Batch-4 (100 records) (buffer = 0) (processing time = 10 sec + 2 sec) => 12 sec processing time

      If we poll in async and always maintain 500 records for each partition, only Batch-1 will take 12 sec. After that all batches will complete in 10 sec (unless some rebalancing/failure happens, in that case buffer will be cleaned and next batch will take 12 sec).

      Attachments

        1. 40_partition_topic_without_buffer.pdf
          1.94 MB
          Sidhavratha Kumar

        Activity

          People

            Unassigned Unassigned
            sidhavratha Sidhavratha Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: