[SPARK-18371] Spark Streaming backpressure bug - generates a batch with large number of records - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.4.0
Component/s: DStreams
Labels:
None

Description

When the streaming job is configured with backpressureEnabled=true, it generates a GIANT batch of records if the processing time + scheduled delay is (much) larger than batchDuration. This creates a backlog of records like no other and results in batches queueing for hours until it chews through this giant batch.
Expectation is that it should reduce the number of records per batch in some time to whatever it can really process.
Attaching some screen shots where it seems that this issue is quite easily reproducible.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Look_at_batch_at_22_14.png
09/Nov/16 00:27
267 kB
mapreduced
GiantBatch2.png
09/Nov/16 00:28
213 kB
mapreduced
GiantBatch3.png
09/Nov/16 00:29
289 kB
mapreduced
Giant_batch_at_23_00.png
09/Nov/16 00:29
187 kB
mapreduced
01.png
26/Apr/17 15:10
78 kB
Sebastian Arzt
02.png
26/Apr/17 15:10
96 kB
Sebastian Arzt
Screen Shot 2019-09-16 at 12.27.25 PM.png
18/Sep/19 07:28
86 kB
Karthikeyan Ravi

Issue Links

links to

[Github] Pull Request #17774 (arzt)

Activity

People

Assignee:: Sebastian Arzt

Reporter:: mapreduced

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 09/Nov/16 00:26

Updated:: 29/Sep/19 13:14

Resolved:: 16/Mar/18 17:33