For reducers in large jobs our users cannot easily spot portions of the log associated with problems with their code. An example reducer with INFO-level logging generates ~3500 lines / ~700KiB lines per second. 95% of the log is the client-side of the shuffle org.apache.hadoop.mapreduce.task.reduce.*
$ wc syslog 3642 48192 691013 syslog $ grep task.reduce syslog | wc 3424 46534 659038 $ grep task.reduce.ShuffleScheduler syslog | wc 1521 17745 251458 $ grep task.reduce.Fetcher syslog | wc 1045 15340 223683 $ grep task.reduce.InMemoryMapOutput syslog | wc 400 4800 72060 $ grep task.reduce.MergeManagerImpl syslog | wc 432 8200 106555
Byte percentage breakdown:
Shuffle total: 95% ShuffleScheduler: 36% Fetcher: 32% InMemoryMapOutput: 10% MergeManagerImpl: 15%
While this is information is actually often useful for devops debugging shuffle performance issues, the job users are often lost.
We propose to have a dedicated syslog.shuffle file.