I am observing cases where a single host in a cluster of 150 slaves "goes bad" w.r.t. Snappy compression
Many, but not all, of its map-phase tasks produce the buggy exception message "java.lang.ClassNotFoundException: Ljava.lang.InternalError" (see
HADOOP-8151) during on-disk merging, and then a smattering of reducer tasks across the cluster report the same message on every attempt during the "reduce > reduce" phase, leading to job failure with no manual intervention. If I log into the rogue host and kill its tasktracker process while the job is still running, Hadoop's self-healing (rescheduling the map tasks from the dead tasktracker) seems to fix the next reducer attempt for each of the formerly-doomed reducer tasks, and the job succeeds. Subsequent jobs on the same cluster show a different message on occasion as well on that same bad host: "org.apache.hadoop.fs.ChecksumException: Checksum Error".
This evidence leads me to believe that some of the intermediate map output was corrupted by the file system, but this condition was only caught when those writes occurred during merging (and not caught when the last write was the corrupt one).
The strategy for aggressively detecting shuffle failures via exception regex matching (
MAPREDUCE-2529) might be useful as a way to solve this case as well; if a tasktracker process could commit suicide if it detected this issue often enough, we would have no reason to manually intervene. Unfortunately, I'm only seeing this message show up after the shuffle phase is finished; we would need to scan for this exception during the map phase.
I did not see this issue occur on the previous version of Hadoop we were using on Amazon EMR (0.20) using lzo compression for intermediate map outputs.