The bug cause by NaN.
I wrote a test case to verify FairShareComparator(see patch), and then found that the FairShareComparator can not deal with weights=0 correctly. We dump the collection(see timsort.log) that broke sorting from our cluster to confirm whether it is 0. The weight should be greater than or equal to 1(I think). In fact, weight would be 0.
We get NaN when memorySize=0 and weight=0.
useToWeightRatio1 = s1.getResourceUsage().getMemorySize() /
I'm not sure whether this is a bug for weight.We can talk about this in another issue.
If weight = 0 , the demand memory must be 0 and yarn.scheduler.fair.sizebasedweight is enable.
Formula: weight = log2(1 + demand)
it seems that a meaningful weight must be greater than or equal to 1. So I just fix weight to 1 in patch. Anyway we need more strict code.
I think there are still problems related to concurrency(Like the description says that).
If you enable yarn.resourcemanager.work-preserving-recovery.enabled(default is false in my version), recoverContainer method would be invoked in another thread. The method can modify attemptResourceUsage. This will go wrong when you are sorting FSAppAttempt.
If as I think, we may need to create a new issue to fix this.