Uploaded image for project: 'Crunch'
  1. Crunch
  2. CRUNCH-515

Decrease probability of collision on Crunch temp directories



    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.4, 0.11.0
    • Fix Version/s: 0.13.0, 0.14.0
    • Component/s: Core
    • Labels:


      I've heard reports of failures of Crunch pipelines at our organization due to collision on temp directories.

      Take the following stack trace from an old internal email thread I dug up as an example:

      2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output already exists
          at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
          at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:1013)
          at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:394)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
          at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974)
          at org.apache.hadoop.mapreduce.Job.submit(Job.java:582)
          at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:340)
          at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:277)
          at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:316)
          at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:113)
          at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:55)
          at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:84)
          at java.lang.Thread.run(Thread.java:682)

      What we found in this case is the pre-existing directory was rather old. It hung around because we're doing a poor job of cleaning old garbage out of our HDFS /tmp directory. We intend to set up a job to delete stuff older than a couple of weeks or so out of /tmp but I think the chances of a collision will still be high enough that failures like this might still happen on occasion.

      The temp directory Crunch chooses is a random 31-bit value:

      I say 31 bit value because it comes from a 32-bit random integer but only includes positive values, thereby excluding 1 bit.

      The following blog post shows some probabilities for 32-bit hash collisions, which are essentially the same problem:

      Since we're dealing with 31 bits instead of 32 the probabilities will be higher than expressed there for 32 bits. Even with 32 bits the probability of collision is 1 in 100 with just 9292 values.

      I have not done any thorough investigation to understand why, but in our production environment we have a lot of Crunch jobs and we are leaving 200-300 stray Crunch temp directories per day. Depending on how aggressive we get with a scheduled job to clean old stuff out of temp we could still have a realistic chance of hitting a collision.

      My proposal is to change the random integer component of the temp path to a UUID or something similar to make it drastically more unlikely that a collision will ever occur regardless of whether or not "/tmp" is ever cleaned up.


        1. CRUNCH-515-1.patch
          1 kB
          Ben Roling
        2. CRUNCH_515.patch
          3 kB
          Sean R. Owen

          Issue Links



              • Assignee:
                jwills Josh Wills
                ben.roling Ben Roling
              • Votes:
                0 Vote for this issue
                5 Start watching this issue


                • Created: