Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26713

PipedRDD may holds stdin writer and stdout read threads even if the task is finished

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.3, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.4.0
    • 2.4.5, 3.0.0
    • Spark Core
    • None

    Description

      During an investigation of OOM of one internal production job, I found that PipedRDD leaks memory. After some digging, the problem lies down to the fact that PipedRDD doesn't release stdin writer and stdout threads even if the task is finished.

       

      PipedRDD creates two threads: stdin writer and stdout reader. If we are lucky and the task is finished normally, these two threads exit normally. If the subprocess(pipe command) is failed, the task will be marked failed, however the stdin writer will be still running until it consumes its parent RDD's iterator. There is even a race condition with ShuffledRDD + PipedRDD: the ShuffleBlockFetchIterator is cleaned up at task completion and hangs stdin writer thread, which leaks memory. 

      Attachments

        Activity

          People

            advancedxy YE
            advancedxy YE
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: