Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18371

Spark Streaming backpressure bug - generates a batch with large number of records

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.4.0
    • Component/s: DStreams
    • Labels:
      None

      Description

      When the streaming job is configured with backpressureEnabled=true, it generates a GIANT batch of records if the processing time + scheduled delay is (much) larger than batchDuration. This creates a backlog of records like no other and results in batches queueing for hours until it chews through this giant batch.
      Expectation is that it should reduce the number of records per batch in some time to whatever it can really process.
      Attaching some screen shots where it seems that this issue is quite easily reproducible.

        Attachments

        1. Look_at_batch_at_22_14.png
          267 kB
          mapreduced
        2. GiantBatch2.png
          213 kB
          mapreduced
        3. GiantBatch3.png
          289 kB
          mapreduced
        4. Giant_batch_at_23_00.png
          187 kB
          mapreduced
        5. 01.png
          78 kB
          Sebastian Arzt
        6. 02.png
          96 kB
          Sebastian Arzt
        7. Screen Shot 2019-09-16 at 12.27.25 PM.png
          86 kB
          Karthikeyan Ravi

          Activity

            People

            • Assignee:
              seb.arzt Sebastian Arzt
              Reporter:
              mapreduced mapreduced
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: