Long running section of code in PendingRangeCalculatorService is synchronized on bootstrapTokens. This causes gossip to stop working as it waits for the same lock when a large number of nodes (hundreds in our case) are bootstrapping. Consequently, the whole cluster becomes non-functional.
I experimented with the following change in PendingRangeCalculatorService.java and it resolved the problem in our case. Prior code had synchronized around the for loop.
bootstrapTokens = new LinkedHashMap<Token, InetAddress>(bootstrapTokens);
for (Map.Entry<Token, InetAddress> entry : bootstrapTokens.entrySet())
InetAddress endpoint = entry.getValue();
for (Range<Token> range : strategy.getAddressRanges(allLeftMetadata).get(endpoint))