I've heard reports of failures of Crunch pipelines at our organization due to collision on temp directories.
Take the following stack trace from an old internal email thread I dug up as an example:
What we found in this case is the pre-existing directory was rather old. It hung around because we're doing a poor job of cleaning old garbage out of our HDFS /tmp directory. We intend to set up a job to delete stuff older than a couple of weeks or so out of /tmp but I think the chances of a collision will still be high enough that failures like this might still happen on occasion.
The temp directory Crunch chooses is a random 31-bit value:
I say 31 bit value because it comes from a 32-bit random integer but only includes positive values, thereby excluding 1 bit.
The following blog post shows some probabilities for 32-bit hash collisions, which are essentially the same problem:
Since we're dealing with 31 bits instead of 32 the probabilities will be higher than expressed there for 32 bits. Even with 32 bits the probability of collision is 1 in 100 with just 9292 values.
I have not done any thorough investigation to understand why, but in our production environment we have a lot of Crunch jobs and we are leaving 200-300 stray Crunch temp directories per day. Depending on how aggressive we get with a scheduled job to clean old stuff out of temp we could still have a realistic chance of hitting a collision.
My proposal is to change the random integer component of the temp path to a UUID or something similar to make it drastically more unlikely that a collision will ever occur regardless of whether or not "/tmp" is ever cleaned up.