Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8958

Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.2.1
    • None
    • capacityscheduler
    • None

    Description

      We found a NPE in ClientRMService#getApplications when querying apps with specified queue. The cause is that there is one app which can't be found by calling RMContextImpl#getRMApps(is finished and swapped out of memory) but still can be queried from fair ordering policy.

      To reproduce schedulable entities leak in fair ordering policy:
      (1) create app1 and launch container1 on node1
      (2) restart RM
      (3) remove app1 attempt, app1 is removed from the schedulable entities.
      (4) recover container1 after node1 reconnected to RM, then the state of contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder after container released, then app1 will be added back into schedulable entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
      (5) remove app1

      To solve this problem, we should make sure schedulableEntities can only be affected by add or remove app attempt, new entity should not be added into schedulableEntities by reordering process.

        protected void reorderSchedulableEntity(S schedulableEntity) {
          //remove, update comparable data, and reinsert to update position in order
          schedulableEntities.remove(schedulableEntity);
          updateSchedulingResourceUsage(
            schedulableEntity.getSchedulingResourceUsage());
          schedulableEntities.add(schedulableEntity);
        }
      

      Related codes above can be improved as follow to make sure only existent entity can be re-add into schedulableEntities.

        protected void reorderSchedulableEntity(S schedulableEntity) {
          //remove, update comparable data, and reinsert to update position in order
          boolean exists = schedulableEntities.remove(schedulableEntity);
          updateSchedulingResourceUsage(
            schedulableEntity.getSchedulingResourceUsage());
          if (exists) {
            schedulableEntities.add(schedulableEntity);
          } else {
            LOG.info("Skip reordering non-existent schedulable entity: "
                + schedulableEntity.getId());
          }
        }
      

      Attachments

        1. YARN-8958.001.patch
          9 kB
          Tao Yang
        2. YARN-8958.002.patch
          11 kB
          Tao Yang

        Activity

          People

            Tao Yang Tao Yang
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: