Uploaded image for project: 'Crunch'
  1. Crunch
  2. CRUNCH-575

DistributedPipeline temp dir choice can collide with itself

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 0.12.0
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      We've observed that Crunch jobs can fail because the output temp dir already exists:

      2015-04-02 04:45:49,208 INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/crunch-686245394/p2/output already exists
      at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
      

      One possible cause is the choice of random directory name, which is based on a random nonnegative 32-bit int. The chance of collision is more than 50% at about 55,000 temp dirs, which is not unimaginable.

      A suggested fix, at least for that theoretical cause, is to generate a much larger random value. 64 bits should put this firmly in the realm of extremely improbably (billions, not tens of thousands).

      (HT Wilfred Spiegelenburg / CC Thomas White)

        Attachments

        1. CRUNCH_575.patch
          0.8 kB
          Sean R. Owen

          Issue Links

            Activity

              People

              • Assignee:
                jwills Josh Wills
                Reporter:
                srowen Sean R. Owen
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: