Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18371

Spark Streaming backpressure bug - generates a batch with large number of records

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.4.0
    • DStreams
    • None

    Description

      When the streaming job is configured with backpressureEnabled=true, it generates a GIANT batch of records if the processing time + scheduled delay is (much) larger than batchDuration. This creates a backlog of records like no other and results in batches queueing for hours until it chews through this giant batch.
      Expectation is that it should reduce the number of records per batch in some time to whatever it can really process.
      Attaching some screen shots where it seems that this issue is quite easily reproducible.

      Attachments

        1. Screen Shot 2019-09-16 at 12.27.25 PM.png
          86 kB
          Karthikeyan Ravi
        2. Look_at_batch_at_22_14.png
          267 kB
          mapreduced
        3. GiantBatch3.png
          289 kB
          mapreduced
        4. GiantBatch2.png
          213 kB
          mapreduced
        5. Giant_batch_at_23_00.png
          187 kB
          mapreduced
        6. 02.png
          96 kB
          Sebastian Arzt
        7. 01.png
          78 kB
          Sebastian Arzt

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            seb.arzt Sebastian Arzt
            mapreduced mapreduced
            Votes:
            2 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment