Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-834

rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.6, 0.7
    • 0.7
    • classic
    • None

    Description

      If I do this:

      mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity

      then clean my output and rerun,

      rm -rf sims/ # (though this step doesn't even seem needed)

      then try again:

      mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity

      The temp files left from the first run make a re-run impossible - we get: "Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/weights already exists".

      Manually deleting the temp directory fixes this.

      I get same behaviour if I explicitly pass in a --tempdir path, e.g.:

      mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity --tempDir tmp2/

      Presumably something like HadoopUtil.delete(getConf(),tempDirPath) is needed somewhere? (and maybe --overwrite too ?)

      Attachments

        1. Mahout-834.patch
          2 kB
          Suneel Marthi
        2. Mahout-834.patch
          2 kB
          Suneel Marthi

        Activity

          People

            smarthi Suneel Marthi
            danbri Dan Brickley
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: