Adding the design doc from
SPARK-8426 here: https://docs.google.com/document/d/1R2CVKctUZG9xwD67jkRdhBR4sCgccPR2dhTYSRXFEmg/edit?usp=sharing
Also I want to point a change in behavior I am proposing in the design doc – I think its best if there is no timeout for the blacklist within one stage. Once a task gets blacklisted for a particular stage, it will there forever. The timeout will only be for when executors and nodes get blacklisted across all stages. This greatly simplifies the implementation, and I dont' really think there is any significant downside.
OTOH, it is a behavior change from the old blacklisting.