[SPARK-24707] Enable spark-kafka-streaming to maintain min buffer using async thread to avoid blocking kafka poll - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: DStreams
Labels:
None

Description

Currently Spark Kafka RDD will block on kafka consumer poll. Specially in Spark-Kafka-streaming job this poll duration adds into batch processing time which result in

Increased batch processing time (which is apart from time taken to process records)
Results in unpredictable batch processing time based on poll time.

If we can poll kafka in background thread and maintain buffer for each partition, poll time will not get added into batch processing time, and this will make processing time more predicatble (based on time taken to process each record, instead of extra time taken to poll records from source)

For ex. we are facing issues where sometime kafka poll is ~30 secs, and sometime it returns within second. With backpressure enabled this reduces our job speed to great extent. In this situation it is also difficult to scale our processing or calculate resource requirement for future increase in records.

Even if someone does not face varying kafka poll time, it will be provide performance improvement if some buffer is already maintained for each partition, so that each batch can just concentrate on processing records.

Ex :
Lets consider

each kafka poll takes 2sec average
batch duration is 10 sec
to process 100 records we take 10 sec
each kafka poll returns 300 records
1. Spark Job starts
2. Batch-1 (100 records) (buffer = 0) (processing time = 10 sec + 2sec) => 12 sec processing time
3. Batch-2 (100 records) (buffer = 200) (processing time = 10 sec) => 10 sec processing time
4. Batch-3 (100 records) (buffer = 100) (processing time = 10 sec) => 10 sec processing time
5. Batch-4 (100 records) (buffer = 0) (processing time = 10 sec + 2 sec) => 12 sec processing time

If we poll in async and always maintain 500 records for each partition, only Batch-1 will take 12 sec. After that all batches will complete in 10 sec (unless some rebalancing/failure happens, in that case buffer will be cleaned and next batch will take 12 sec).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

40_partition_topic_without_buffer.pdf
01/Jul/18 02:15
1.94 MB
Sidhavratha Kumar

Issue Links

links to

[Github] Pull Request #21685 (sidhavratha)

GitHub Pull Request #21685

Activity

People

Assignee:: Unassigned

Reporter:: Sidhavratha Kumar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Jul/18 02:15

Updated:: 16/Mar/20 22:53