Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.1.0, 2.9.1
-
None
-
Reviewed
Description
This happens on our production cluster twice, when a request cannot be satisfied for a long time, it continually triggers the re-reservation and eventually caused the overflow. This will crash the scheduler.
Exception stack:
java.lang.IllegalArgumentException: Overflow adding 1 occurrences to a count of 2147483647 at com.google.common.collect.ConcurrentHashMultiset.add(ConcurrentHashMultiset.java:246) at com.google.common.collect.AbstractMultiset.add(AbstractMultiset.java:80) at com.google.common.collect.ConcurrentHashMultiset.add(ConcurrentHashMultiset.java:51) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.addReReservation(SchedulerApplicationAttempt.java:406) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.reserve(SchedulerApplicationAttempt.java:555) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1076) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
Refer to handling of SchedulerApplicationAttempt#addSchedulingOpportunity, we can ignore this exception to avoid this problem.
This problem may happens in SchedulerApplicationAttempt#addMissedNonPartitionedRequestSchedulingOpportunity, fix it in the same way.