Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7051

Avoid concurrent modification exception in FifoIntraQueuePreemptionPlugin

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue has the following code:

          Collection<FiCaSchedulerApp> runningApps = leafQueue.getApplications();
          Resource amUsed = Resources.createResource(0, 0);
      
          for (FiCaSchedulerApp app : runningApps) {
      

      runningApps is unmodifiable but not concurrent. This caused the preemption monitor thread to crash in the RM in one of our clusters.

      1. YARN-7051.001.patch
        2 kB
        Eric Payne
      2. YARN-7051.002.patch
        3 kB
        Eric Payne

        Activity

        Hide
        eepayne Eric Payne added a comment -

        In the cross-queue preemption policy, it is synchronizing on the leafqueue when it walks the applications list:

        FifoCandidatesSelector#selectCandidates
        ...
              synchronized (leafQueue) {
        ...
                Iterator<FiCaSchedulerApp> desc =
                    leafQueue.getOrderingPolicy().getPreemptionIterator();
                while (desc.hasNext()) {
                  FiCaSchedulerApp fc = desc.next();
        
        Show
        eepayne Eric Payne added a comment - In the cross-queue preemption policy, it is synchronizing on the leafqueue when it walks the applications list: FifoCandidatesSelector#selectCandidates ... synchronized (leafQueue) { ... Iterator<FiCaSchedulerApp> desc = leafQueue.getOrderingPolicy().getPreemptionIterator(); while (desc.hasNext()) { FiCaSchedulerApp fc = desc.next();
        Hide
        eepayne Eric Payne added a comment -

        The YARN-7051.001.patch adds a leafqueue synchronization around the vulnerable code. I am still doing manual testing.

        Show
        eepayne Eric Payne added a comment - The YARN-7051 .001.patch adds a leafqueue synchronization around the vulnerable code. I am still doing manual testing.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 17s Docker mode activated.
              Prechecks
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
              trunk Compile Tests
        +1 mvninstall 14m 27s trunk passed
        +1 compile 0m 37s trunk passed
        +1 checkstyle 0m 30s trunk passed
        +1 mvnsite 0m 44s trunk passed
        +1 findbugs 1m 13s trunk passed
        +1 javadoc 0m 27s trunk passed
              Patch Compile Tests
        +1 mvninstall 0m 40s the patch passed
        +1 compile 0m 39s the patch passed
        +1 javac 0m 39s the patch passed
        +1 checkstyle 0m 26s the patch passed
        +1 mvnsite 0m 40s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 15s the patch passed
        +1 javadoc 0m 22s the patch passed
              Other Tests
        -1 unit 49m 0s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        72m 58s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
        Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
          org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:14b5c93
        JIRA Issue YARN-7051
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12882751/YARN-7051.001.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 6cd7d0d621d0 3.13.0-123-generic #172-Ubuntu SMP Mon Jun 26 18:04:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 436c263
        Default Java 1.8.0_144
        findbugs v3.1.0-RC1
        unit https://builds.apache.org/job/PreCommit-YARN-Build/17016/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17016/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/17016/console
        Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 14m 27s trunk passed +1 compile 0m 37s trunk passed +1 checkstyle 0m 30s trunk passed +1 mvnsite 0m 44s trunk passed +1 findbugs 1m 13s trunk passed +1 javadoc 0m 27s trunk passed       Patch Compile Tests +1 mvninstall 0m 40s the patch passed +1 compile 0m 39s the patch passed +1 javac 0m 39s the patch passed +1 checkstyle 0m 26s the patch passed +1 mvnsite 0m 40s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 15s the patch passed +1 javadoc 0m 22s the patch passed       Other Tests -1 unit 49m 0s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 72m 58s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA   org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-7051 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12882751/YARN-7051.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6cd7d0d621d0 3.13.0-123-generic #172-Ubuntu SMP Mon Jun 26 18:04:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 436c263 Default Java 1.8.0_144 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/17016/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17016/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17016/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        sunilg Sunil G added a comment -

        Thanks Eric Payne
        I just checked code base, and I think there is one more place we used getApplications w/o any lock.

        In computeAppsIdealAllocation,
        we are passing apps to createTempAppForResCalculation which internally is doing a loop

            // 2. tq.leafQueue will not be null as we validated it in caller side
            Collection<FiCaSchedulerApp> apps = tq.leafQueue.getAllApplications();
        
            // We do not need preemption for a single app
            if (apps.size() == 1) {
              return;
            }
        
            // 3. Create all tempApps for internal calculation and return a list from
            // high priority to low priority order.
            PriorityQueue<TempAppPerPartition> orderedByPriority = createTempAppForResCalculation(
                tq, apps, clusterResource, perUserAMUsed);
        
        Show
        sunilg Sunil G added a comment - Thanks Eric Payne I just checked code base, and I think there is one more place we used getApplications w/o any lock. In computeAppsIdealAllocation , we are passing apps to createTempAppForResCalculation which internally is doing a loop // 2. tq.leafQueue will not be null as we validated it in caller side Collection<FiCaSchedulerApp> apps = tq.leafQueue.getAllApplications(); // We do not need preemption for a single app if (apps.size() == 1) { return ; } // 3. Create all tempApps for internal calculation and return a list from // high priority to low priority order. PriorityQueue<TempAppPerPartition> orderedByPriority = createTempAppForResCalculation( tq, apps, clusterResource, perUserAMUsed);
        Hide
        eepayne Eric Payne added a comment - - edited

        Hi Sunil G. Thanks for the review and the detailed reply.

        I think there is one more place we used getApplications w/o any

             Collection<FiCaSchedulerApp> apps = tq.leafQueue.getAllApplications(); 

        The call to leafQueue.getApplications() within calculateUsedAMResourcesPerQueue gets the actual collection of apps from the ordering policy, which can obviously change because the leaf queue is modifying it. However, the call to getAllApplications makes a copy of the list of running and pending apps, so this won't be changing while createTempAppForResCalculation is looping over the list.

        Show
        eepayne Eric Payne added a comment - - edited Hi Sunil G . Thanks for the review and the detailed reply. I think there is one more place we used getApplications w/o any Collection<FiCaSchedulerApp> apps = tq.leafQueue.getAllApplications(); The call to leafQueue.getApplications() within calculateUsedAMResourcesPerQueue gets the actual collection of apps from the ordering policy, which can obviously change because the leaf queue is modifying it. However, the call to getAllApplications makes a copy of the list of running and pending apps, so this won't be changing while createTempAppForResCalculation is looping over the list.
        Hide
        eepayne Eric Payne added a comment -

        so this won't be changing while createTempAppForResCalculation is looping over the list.

        However, I did find a race condition that throws an NPE within createTempAppForResCalculation.

        java.lang.NullPointerException
                at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:155)
                at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoIntraQueuePreemptionPlugin.createTempAppForResCalculation(FifoIntraQueuePreemptionPlugin.java:403)
                at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoIntraQueuePreemptionPlugin.computeAppsIdealAllocation(FifoIntraQueuePreemptionPlugin.java:140)
                at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:283)
        

        The reason for this is that perUserAMUsed was populated with running apps prior to calling createTempAppForResCalculation, but then createTempAppForResCalculation loops through both running and pending apps.

        Attaching new patch that addresses this.

        Show
        eepayne Eric Payne added a comment - so this won't be changing while createTempAppForResCalculation is looping over the list. However, I did find a race condition that throws an NPE within createTempAppForResCalculation . java.lang.NullPointerException at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:155) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoIntraQueuePreemptionPlugin.createTempAppForResCalculation(FifoIntraQueuePreemptionPlugin.java:403) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoIntraQueuePreemptionPlugin.computeAppsIdealAllocation(FifoIntraQueuePreemptionPlugin.java:140) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:283) The reason for this is that perUserAMUsed was populated with running apps prior to calling createTempAppForResCalculation , but then createTempAppForResCalculation loops through both running and pending apps. Attaching new patch that addresses this.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 19s Docker mode activated.
              Prechecks
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
              trunk Compile Tests
        +1 mvninstall 14m 31s trunk passed
        +1 compile 0m 36s trunk passed
        +1 checkstyle 0m 26s trunk passed
        +1 mvnsite 0m 38s trunk passed
        +1 findbugs 1m 4s trunk passed
        +1 javadoc 0m 21s trunk passed
              Patch Compile Tests
        +1 mvninstall 0m 37s the patch passed
        +1 compile 0m 35s the patch passed
        +1 javac 0m 35s the patch passed
        +1 checkstyle 0m 23s the patch passed
        +1 mvnsite 0m 36s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 10s the patch passed
        +1 javadoc 0m 19s the patch passed
              Other Tests
        -1 unit 49m 48s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 18s The patch does not generate ASF License warnings.
        73m 1s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
          hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:14b5c93
        JIRA Issue YARN-7051
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883610/YARN-7051.002.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux d6de4cf349de 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / c2cb7ea
        Default Java 1.8.0_144
        findbugs v3.1.0-RC1
        unit https://builds.apache.org/job/PreCommit-YARN-Build/17124/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17124/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/17124/console
        Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 14m 31s trunk passed +1 compile 0m 36s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 38s trunk passed +1 findbugs 1m 4s trunk passed +1 javadoc 0m 21s trunk passed       Patch Compile Tests +1 mvninstall 0m 37s the patch passed +1 compile 0m 35s the patch passed +1 javac 0m 35s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 36s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 10s the patch passed +1 javadoc 0m 19s the patch passed       Other Tests -1 unit 49m 48s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 73m 1s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation   hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-7051 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883610/YARN-7051.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux d6de4cf349de 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c2cb7ea Default Java 1.8.0_144 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/17124/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17124/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17124/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        sunilg Sunil G added a comment -

        Thanks Eric Payne for clarification. Adding to same thought, getAllApplications in LeafQueue then has to be under readlock also, correct?

        Show
        sunilg Sunil G added a comment - Thanks Eric Payne for clarification. Adding to same thought, getAllApplications in LeafQueue then has to be under readlock also, correct?
        Hide
        eepayne Eric Payne added a comment -

        Unit test failures are not related to this patch.

        • TestContainerAllocation: YARN-7044
        • h.y.s.r.security.TestDelegationTokenRenewer: YARN-5816
        LeafQueue#getAllApplications
          public Collection<FiCaSchedulerApp> getAllApplications() {
            Collection<FiCaSchedulerApp> apps = new HashSet<FiCaSchedulerApp>(
                pendingOrderingPolicy.getSchedulableEntities());
            apps.addAll(orderingPolicy.getSchedulableEntities());
        
            return Collections.unmodifiableCollection(apps);
          }
        

        getAllApplications in LeafQueue then has to be under readlock also, correct?

        Possibly. It looks like HashSet#addAll will iterate through orderingPolicy, which could possibly change during the loop. However, I would like to have that discussion on a separate JIRA since I may be misinterpreting how addAll works and since the usage of getAAllApplications affects more than just preemption.

        Show
        eepayne Eric Payne added a comment - Unit test failures are not related to this patch. TestContainerAllocation : YARN-7044 h.y.s.r.security.TestDelegationTokenRenewer : YARN-5816 LeafQueue#getAllApplications public Collection<FiCaSchedulerApp> getAllApplications() { Collection<FiCaSchedulerApp> apps = new HashSet<FiCaSchedulerApp>( pendingOrderingPolicy.getSchedulableEntities()); apps.addAll(orderingPolicy.getSchedulableEntities()); return Collections.unmodifiableCollection(apps); } getAllApplications in LeafQueue then has to be under readlock also, correct? Possibly. It looks like HashSet#addAll will iterate through orderingPolicy , which could possibly change during the loop. However, I would like to have that discussion on a separate JIRA since I may be misinterpreting how addAll works and since the usage of getAAllApplications affects more than just preemption.
        Hide
        sunilg Sunil G added a comment -

        Makes sense Eric Payne. Thank you. I am fine with current patch, and we could open another ticket investigate orderingpolicy entities getters. I f no other objection, i could commit this patch late today.

        Show
        sunilg Sunil G added a comment - Makes sense Eric Payne . Thank you. I am fine with current patch, and we could open another ticket investigate orderingpolicy entities getters. I f no other objection, i could commit this patch late today.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12253 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12253/)
        YARN-7051. Avoid concurrent modification exception in (sunilg: rev 02599bda04e0ef46f4628b006f2430ad63cac97e)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/FifoIntraQueuePreemptionPlugin.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12253 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12253/ ) YARN-7051 . Avoid concurrent modification exception in (sunilg: rev 02599bda04e0ef46f4628b006f2430ad63cac97e) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/FifoIntraQueuePreemptionPlugin.java
        Hide
        sunilg Sunil G added a comment -

        Thanks Eric Payne for the contribution.
        Committed to trunk/branch-2 and branch-2.8.

        Show
        sunilg Sunil G added a comment - Thanks Eric Payne for the contribution. Committed to trunk/branch-2 and branch-2.8.
        Hide
        eepayne Eric Payne added a comment -

        Thanks Sunil G for your help in resolving this problem.

        Show
        eepayne Eric Payne added a comment - Thanks Sunil G for your help in resolving this problem.
        Hide
        djp Junping Du added a comment -

        Sounds like we forget to merge patch to branch-2.8.2. Just merge it.

        Show
        djp Junping Du added a comment - Sounds like we forget to merge patch to branch-2.8.2. Just merge it.

          People

          • Assignee:
            eepayne Eric Payne
            Reporter:
            eepayne Eric Payne
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development