Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7377

Duplicate Containers allocated for Long-Running Application after NM lost and restart and RM restart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0-alpha3
    • None
    • applications, nodemanager, RM, yarn
    • RM recovery and NM recovery enabled;
      Spark streaming application, a long-running application on yarn

    Description

      Case:
      A Spark streaming application named app1 running on yarn for a long time; app1 has 3 containers in total, one of them named c1 runs on a NM named nm1;

      1. The NM named nm1 was lost for some reason, but the containers on it runs well;

      2. 10 minutes later, RM lost this NM because of no heartbeats received; so RM tells app1's AM that a container of app1 was failed because of NM lost, so app1's AM killed that container through RPC and then request a new container named c2 from RM, which is duplicate to c1;

      3. Administrator found nm1 lost, so he restart it; since NM's recovery was enabled, NM restore all the containers including container c1, but now c1's status is 'DONE';
      A bug here: nm1 will list this container c1 in webui forever;

      4. RM restart for some reason; since RM's recovery was enabled, RM restore all the apps including app1, and all the NM need re-register to RM; However, when nm1 registers to RM, RM found the container c1's status was DONE, so RM tells app1's AM that a container of app1 was complete, since spark streaming application has fixed number of containers, so AM request a new container named c3 from RM, which is duplicate to c1.

      A bug here:
      Now, app1 has 4 containers in total, while c2 and c3 were the same.

      Attachments

        Activity

          People

            Unassigned Unassigned
            NeoMatrix rangjiaheng
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: