Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45057

Deadlock caused by rdd replication level of 2

    XMLWordPrintableJSON

Details

    Description

       
      When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen.

      Task only release lock after writing into local machine and replicate to remote executor.

       

      Time Exe 1 (Task Thread T1) Exe 1 (Shuffle Server Thread T2) Exe 2 (Task Thread T3) Exe 2 (Shuffle Server Thread T4)
      T0 write lock of rdd      
      T1     write lock of rdd  
      T2 replicate -> UploadBlockSync (blocked by T4)      
      T3       Received UploadBlock request from T1 (blocked by T4)
      T4     replicate -> UploadBlockSync (blocked by T2)  
      T5   Received UploadBlock request from T3 (blocked by T1)    
      T6 Deadlock Deadlock Deadlock Deadlock

      Attachments

        Issue Links

          Activity

            People

              warrenzhu25 Zhongwei Zhu
              warrenzhu25 Zhongwei Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: