Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18371

Spark Streaming backpressure bug - generates a batch with large number of records

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.4.0
    • DStreams
    • None

    Description

      When the streaming job is configured with backpressureEnabled=true, it generates a GIANT batch of records if the processing time + scheduled delay is (much) larger than batchDuration. This creates a backlog of records like no other and results in batches queueing for hours until it chews through this giant batch.
      Expectation is that it should reduce the number of records per batch in some time to whatever it can really process.
      Attaching some screen shots where it seems that this issue is quite easily reproducible.

      Attachments

        1. Look_at_batch_at_22_14.png
          267 kB
          mapreduced
        2. GiantBatch2.png
          213 kB
          mapreduced
        3. GiantBatch3.png
          289 kB
          mapreduced
        4. Giant_batch_at_23_00.png
          187 kB
          mapreduced
        5. 01.png
          78 kB
          Sebastian Arzt
        6. 02.png
          96 kB
          Sebastian Arzt
        7. Screen Shot 2019-09-16 at 12.27.25 PM.png
          86 kB
          Karthikeyan Ravi

        Activity

          People

            seb.arzt Sebastian Arzt
            mapreduced mapreduced
            Votes:
            2 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: