Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25563

Spark application hangs If container allocate on lost Nodemanager

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:

      Description

          I met a issue that if  I start a spark application use yarn client mode, application sometimes hang.
          I check the application logs,  container allocate on a lost NodeManager, but AM don't retry to start another executor.
          My spark version is 2.3.1
          Here is my ApplicationMaster log.
       
      2018-09-26 05:21:15 INFO YarnRMClient:54 - Registering the ApplicationMaster
      2018-09-26 05:21:15 INFO ConfiguredRMFailoverProxyProvider:100 - Failing over to rm2
      2018-09-26 05:21:15 WARN Utils:66 - spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
      2018-09-26 05:21:15 INFO Utils:54 - Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
      2018-09-26 05:21:15 INFO YarnAllocator:54 - Will request 1 executor container(s), each with 24 core(s) and 20275 MB memory (including 1843 MB of overhead)
      2018-09-26 05:21:15 INFO YarnAllocator:54 - Submitted 1 unlocalized container requests.
      2018-09-26 05:21:15 INFO ApplicationMaster:54 - Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
      2018-09-26 05:21:27 WARN YarnAllocator:66 - Cannot find executorId for container: container_1532951609168_4721728_01_000002
      2018-09-26 05:21:27 INFO YarnAllocator:54 - Completed container container_1532951609168_4721728_01_000002 (state: COMPLETE, exit status: -100)
      2018-09-26 05:21:27 WARN YarnAllocator:66 - Container marked as failed: container_1532951609168_4721728_01_000002. Exit status: -100. Diagnostics: Container released on a lost node

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              csudevinduan devinduan
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: