Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34063

Major slowdown in spark streaming after 6 days

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Scheduler, Spark Core
    • None
    • AWS EMR 6.1.0

      Spark 3.0.0

      Kinesis

    Description

      Spark streaming application runs at 60s batch intervals.

      The application runs fine processing batches around 40s. After ~8600 batches (around 6 days), the application all of a sudden hits a wall and processing time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This happens consistently every 6 days, regardless of data. 

      Looking at the application logs, it seems like when the issue begins, tasks are being completed by executors, however the driver is taking a while to acknowledge. I have taken numerous memory dumps of the driver (before it hits the 6 day wall) using jcmd and can see the org.apache.spark.scheduler.AsyncEventQueue is growing in size despite the fact that the application is able to keep up with batches. I have yet to take a snapshot of the application in the broken state.

       

       

       

      Attachments

        1. slow-job
          86 kB
          Calvin Pietersen
        2. normal-job
          75 kB
          Calvin Pietersen
        3. 2020-12-29.pdf
          3.74 MB
          Calvin Pietersen

        Activity

          People

            Unassigned Unassigned
            milksteak Calvin Pietersen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: