Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46702

Spark Cluster Crashing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 3.4.0, 3.5.0
    • None
    • Spark Core, Spark Docker

    Description

      Description:

      • We have a spark cluster installed over a k8s cluster with one driver and multiple executors (120).
      • We configure our batch duration to 30 seconds.
      • The Spark Cluster is reading from a 120 partition topic at Kafka and writing to an hourly index at ElasticSearch.
      • ES has 30 DataNodes, 1 shard per DataNode for each index.
      • Configuration of Driver STS is in Appendix.
      • The driver is observed periodically restarting every 10 mins, although the restart do not necessarily occur each 10mins, but when it happens it happens each 10 mins.
      • The restarts frequency increase with the increase of the throughput.
      • When the restarts are happening, we see OptionalDataException, attached “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log” is the log resulting in a restart of the driver.

      Analysis:

      1. We’ve done a test with 250 K Records/second, and the processing was good between 15 and 20 seconds.
      2. We were able to avoid all the restarts by simply disabling liveness checks.
      3. This resulted in NO RESTARTS to Streaming Core, we tried the above with two scenarios:
      • Speculation Disabled --> After 10 to 20 minutes the batch duration increased to minutes and eventually processing was very slow, during which, main error logs observed are about The executor with id 7 exited with exit code 50(Uncaught exception)., logs at WARN level and TRACE level were collected:
      • WARN: Logs attached “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
      • TRACE: Logs attached “cveshv-events-streaming-TRACE (2).zip”
      • Speculation Enabled -->  the batch duration increased to minutes (big lag) only after around 2 hours, logs related are “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.

      Conclusion:

      • The liveness check is failing and thus causing the restarts.
      • The logs indicates that there are some unhandled exceptions to executors.
      • Issue can be somewhere else as well, below is the liveness check that was disabled and that was causing the restarts initially every 10 mins after 3 occurrences.

       

      Next Action:

      • Please help us identify the RC of the issue, we’ve tried too many configurations and with 2 different spark versions 3.4 and 3.5 and we’re not able to avoid the issue.

       

      Appendix:

       

      Attachments

        1. logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log
          5.73 MB
          Mohamad Haidar
        2. image-2024-01-12-10-45-50-427.png
          266 kB
          Mohamad Haidar
        3. image-2024-01-12-10-45-40-397.png
          449 kB
          Mohamad Haidar
        4. image-2024-01-12-10-45-30-398.png
          431 kB
          Mohamad Haidar
        5. image-2024-01-12-10-45-18-905.png
          391 kB
          Mohamad Haidar
        6. image-2024-01-12-10-44-45-717.png
          35 kB
          Mohamad Haidar
        7. cveshv-events-streaming-TRACE (2).zip
          53.21 MB
          Mohamad Haidar
        8. cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log
          12.57 MB
          Mohamad Haidar
        9. CV62A4~1.LOG
          955 kB
          Mohamad Haidar

        Activity

          People

            Unassigned Unassigned
            mohhai1 Mohamad Haidar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: