Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7636

Re-reservation count may overflow when cluster resource exhausted for a long time

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0, 2.9.1
    • 3.1.0, 2.9.1, 3.0.3
    • capacityscheduler
    • None
    • Reviewed

    Description

      This happens on our production cluster twice, when a request cannot be satisfied for a long time, it continually triggers the re-reservation and eventually caused the overflow. This will crash the scheduler.

      Exception stack:

      java.lang.IllegalArgumentException: Overflow adding 1 occurrences to a count of 2147483647
              at com.google.common.collect.ConcurrentHashMultiset.add(ConcurrentHashMultiset.java:246)
              at com.google.common.collect.AbstractMultiset.add(AbstractMultiset.java:80)
              at com.google.common.collect.ConcurrentHashMultiset.add(ConcurrentHashMultiset.java:51)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.addReReservation(SchedulerApplicationAttempt.java:406)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.reserve(SchedulerApplicationAttempt.java:555)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1076)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
      

      Refer to handling of SchedulerApplicationAttempt#addSchedulingOpportunity, we can ignore this exception to avoid this problem.

      This problem may happens in SchedulerApplicationAttempt#addMissedNonPartitionedRequestSchedulingOpportunity, fix it in the same way.

      Attachments

        1. YARN-7636.003.patch
          2 kB
          Tao Yang
        2. YARN-7636.002.patch
          2 kB
          Tao Yang
        3. YARN-7636.001.patch
          1 kB
          Tao Yang

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Tao Yang Tao Yang
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment