Some relevant properties:
<description>A base for other temporary directories.</description>
The permissions on these dirs are 775. User and group match the user we run the tasktracker as. (So, with DefaultTaskController, this should work just fine.)
Some other questions I've been asked over IM:
- Nodes can show failures with one run, be perfectly clean the next, then show failures during a third run. Some nodes will throw failures during all three.
- This problem is reflected in both map tasks and reduce tasks.
- The dir permissions really are the same across all dirs and all nodes.
- I have not tried LTC because my test grid is not configured to support it yet.
- I've been testing the Apache releases with no custom patches other than including the LZO bits.
- The number of failures per run is wildly inconsistent.
- Running 203 on the same gear with the same config shows zero failures. So this is clearly a result of something added in 204.
- Yes, enough tasks have failed during certain runs that tasktrackers are getting blacklisted from the job.
I'm currently playing with a debug jar from Owen to try and gather more information. Part of the problem is that there clearly isn't enough information on why tasks are failing. The tasktracker logs throw the symlink error but see
MAPREDUCE-2804. The child error stack trace:
java.lang.Throwable: Child Error
Caused by: java.io.IOException: Task process exit with nonzero status of -1.
is equally unhelpful.