|
This is happening on a daily basis on our 25 node cluster running 0.12.3 causing serious delays.
I noticed there's no news for about a month so thought I'd ask if a hack would be to just make it overwrite the file that already exists? I've not had time to look at the code but as long as it doesn't hang the job I'm fine with a minor performance hit until a real solution can be found. Actually the problem might not be in the rename. The NPE should not happen in the first place.
Here is a patch that will kill the task if it ever encounters a NPE in the affected part of the code. It will also log the state of the ramfs in terms of the files it currently has and their lengths. It will also log a diagnostic message which you would see in the web ui for the killed task. Please upload those log messages whenever you see tasks failing with this exception. Looked at this with Devaraj yesterday and our theory about why this fails is that in MapTask.MapOutputCopier.copyOutput(), there's a call to rename a temporary file to the actual .out file. After the rename, another call is made to get the length of the actual file (which is same as that of the temporary file, obviously). Between these calls, if a MergeThread flushes the file to .out file to disk and deletes it, the call to getLength() will fail.
Sameer's suggestion was to simply invoke getLength() on the temporary file before the rename and discard the value in case the rename fails. After the file is renamed, it should be assumed that the local code no longer owns it. 1152.patch has these changes incorporated. +1, regardless of the race condition theory, it makes sense to not call getLength on a renamed ramfs file (that the merge thread might have already deleted in the time interval between rename and getLength). But it seems very likely the race condition is caused due to this.
-1, build or testing failed
2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355904/1152.patch Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/65/testReport/ Please note that this message is automatically generated and may represent a problem with the automation system and not the patch. Integrated in Hadoop-Nightly #66 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/66/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2007-03-22 18:17:28,670 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0 copy failed: task_0026_m_007854_0 from __
2007-03-22 18:17:28,670 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Path /export/crawlspace3/kryptonite/hadoop/mapred/local/task_0026_r_000307_0/.map_7854.out.crc already exists
at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:480)
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)