Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Duplicate
-
0.12.0
-
None
-
None
Description
We've observed that Crunch jobs can fail because the output temp dir already exists:
2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
One possible cause is the choice of random directory name, which is based on a random nonnegative 32-bit int. The chance of collision is more than 50% at about 55,000 temp dirs, which is not unimaginable.
A suggested fix, at least for that theoretical cause, is to generate a much larger random value. 64 bits should put this firmly in the realm of extremely improbably (billions, not tens of thousands).
Attachments
Attachments
Issue Links
- duplicates
-
CRUNCH-515 Decrease probability of collision on Crunch temp directories
- Closed