Thanks for the patch, Maysam!
I think it would be a little easier for users if they could configure zero or a negative number in the limit to disable it rather than a giant value. The description of the property should explain that this limit only applies to writes that go through the Hadoop filesystem APIs within the task process (i.e.: writes that will update the local filesystem BYTES_WRITTEN counter). It does not cover other writes such as logging, sideband writes from subprocesses (e.g.: streaming jobs), etc.
Should we be using exit code 65? I got the impression the original code was trying to use different exit codes for different task failure reasons. Seems like this would deserve a separate code.
The warn message should say why the task is being killed, otherwise users will have little clue if the INFO message is suppressed. Speaking of users, this doesn't fail in a very graceful way for users to determine what happened. The history will just show the task exiting with exit code 65 (or some other number), with no useful diagnostic message sent to the AM explaining what went wrong. We should use umbilical.fatalError to report the fatal error before tearing down so the UI and history has a useful diagnostic that clearly shows why the task failed.
TASK_LOCAL_WRITE_LIMIT and DEFAULT_TASK_LOCAL_WRITE_LIMIT should be public static final.
Nit: write.limit.bytes should be write-limit-bytes to be consistent with local-fs, otherwise one would expect local.fs.write.limit.bytes. Normally dots separate namespaces and dashes take the place of spaces for an identifier within the namespace. Granted the existing code is very inconsistent about this.
It would be nice to see a test that verifies writes to the local filesystem trigger the failure. As it is now the test would pass even if we were looking at the wrong counter or for some reason the counter wasn't working properly.