Description
In our production environment, finalizeShuffleMerge processing took longer time (p90 is around 20s) than other PRC requests. This is due to finalizeShuffleMerge invoking IO operations like truncate and file open/close.
More importantly, processing this finalizeShuffleMerge can block other critical lightweight messages like authentications, which can cause authentication timeout as well as fetch failures. Those timeout and fetch failures affect the stability of the Spark job executions.