Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26927

Race condition may cause dynamic allocation not working

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0, 2.4.0
    • 2.3.4, 2.4.1, 3.0.0
    • Spark Core
    • None

    Description

      Recently, we catch a bug that caused our production spark thriftserver hangs:

      There is a race condition in the ExecutorAllocationManager that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`, then when some executor idles, the real executors will be removed even executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong number of executorIds was kept in memory.

      What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap`  of `CoaseGrainedSchedulerBackend`.

      Logs:

       

      EventLogs(DisOrder of events):

      {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor ID":"131","Removed Reason":"Container container_e28_1547530852233_236191_02_000180 exited from explicit termination request."}
      
      {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch Time":1549936032872,"Executor ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", "Speculative":false,"Getting Result Time":0,"Finish Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count Faile d Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count Failed Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val ue":39,"Internal":true,"Count Failed Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS ize","Update":3578,"Value":7156,"Internal":true,"Count Failed Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count Failed Values":true},{"ID":12923962,"Na me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count Failed Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou nt Failed Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count Failed Values":true},{"ID":12921550,"Name":"number of output rows","Update":"158","Value" :"289","Internal":true,"Count Failed Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output rows","Update":"23","Value":"45","Internal":true,"Count Failed Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total (min, med, max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed Values":true,"Metadata":"sql"}]}}
      
      

       

      Attachments

        1. Selection_046.jpg
          56 kB
          liupengcheng
        2. Selection_045.jpg
          31 kB
          liupengcheng
        3. Selection_044.jpg
          5 kB
          liupengcheng
        4. Selection_043.jpg
          21 kB
          liupengcheng
        5. Selection_042.jpg
          77 kB
          liupengcheng

        Issue Links

          Activity

            People

              liupengcheng liupengcheng
              liupengcheng liupengcheng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: