Mahout
  1. Mahout
  2. MAHOUT-834

rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.6, 0.7
    • Fix Version/s: 0.7
    • Component/s: Integration
    • Labels:
      None

      Description

      If I do this:

      mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity

      then clean my output and rerun,

      rm -rf sims/ # (though this step doesn't even seem needed)

      then try again:

      mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity

      The temp files left from the first run make a re-run impossible - we get: "Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/weights already exists".

      Manually deleting the temp directory fixes this.

      I get same behaviour if I explicitly pass in a --tempdir path, e.g.:

      mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity --tempDir tmp2/

      Presumably something like HadoopUtil.delete(getConf(),tempDirPath) is needed somewhere? (and maybe --overwrite too ?)

      1. Mahout-834.patch
        2 kB
        Suneel Marthi
      2. Mahout-834.patch
        2 kB
        Suneel Marthi

        Activity

        Dan Brickley created issue -
        Hide
        Dan Brickley added a comment -

        ignore '(though this step doesn't even seem needed)', this was a mistake; the org.apache.hadoop.mapred.FileAlreadyExistsException happens at the end of the job. Main point I think still stands.

        Show
        Dan Brickley added a comment - ignore '(though this step doesn't even seem needed)', this was a mistake; the org.apache.hadoop.mapred.FileAlreadyExistsException happens at the end of the job. Main point I think still stands.
        Hide
        Sebastian Schelter added a comment -

        Isn't that the standard behavior of all jobs? Maybe we should a generic option that makes a job delete the temp dir if it already exists.

        Show
        Sebastian Schelter added a comment - Isn't that the standard behavior of all jobs? Maybe we should a generic option that makes a job delete the temp dir if it already exists.
        Hide
        Sean Owen added a comment -

        Out of interest, I tried making all jobs delete their temp output. Many fail as they chain together and do use previous jobs' intermediate output, it seems. Certainly the tests inspect this output.

        This is just my own preferences or habits speaking here, but I'm accustomed to these big jobs tend to leave all their output and logs and such around for inspection and possible debugging, and left to the caller to decide when and how to clean up.

        Certainly I think the current behavior is reasonable, and needs to stay the default behavior. Could there be a flag? sure. It seems almost harder to write (into all jobs), document, debug and support a flag that's just replacing one "delete" line of code or script somewhere.

        I think it's a defensible idea; I would lean towards leaving it as-is though. Making a mistake here just gives you an exception that reasonably clearly states the problem. Making a mistake the other way might be surprising or undesirable as it's deleting stuff.

        Show
        Sean Owen added a comment - Out of interest, I tried making all jobs delete their temp output. Many fail as they chain together and do use previous jobs' intermediate output, it seems. Certainly the tests inspect this output. This is just my own preferences or habits speaking here, but I'm accustomed to these big jobs tend to leave all their output and logs and such around for inspection and possible debugging, and left to the caller to decide when and how to clean up. Certainly I think the current behavior is reasonable, and needs to stay the default behavior. Could there be a flag? sure. It seems almost harder to write (into all jobs), document, debug and support a flag that's just replacing one "delete" line of code or script somewhere. I think it's a defensible idea; I would lean towards leaving it as-is though. Making a mistake here just gives you an exception that reasonably clearly states the problem. Making a mistake the other way might be surprising or undesirable as it's deleting stuff.
        Hide
        Dan Brickley added a comment -

        Makes sense. Should --overwrite wipe out temp dirs, typically? Or cleanup is left for outside scripting?

        Show
        Dan Brickley added a comment - Makes sense. Should --overwrite wipe out temp dirs, typically? Or cleanup is left for outside scripting?
        Hide
        Sean Owen added a comment -

        Oh, probably so. I didn't know about this option. Anything that requests to overwrite output also probably intends to overwrite temp files. Is there a spot that's not happening?

        Show
        Sean Owen added a comment - Oh, probably so. I didn't know about this option. Anything that requests to overwrite output also probably intends to overwrite temp files. Is there a spot that's not happening?
        Hide
        Dan Brickley added a comment -

        I've just gone back to do some tests. I composed a new commandline,

        mahout rowsimilarity --input matrixified/matrix --output sims_foo/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity

        ...where sims_foo/ is a new home for output. However because the previous run was from the same base directory, we get "Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/weights already exists"

        This leads me to think that a cleaner pattern would be for these 'temp' files to be packaged inside the output directory. That seems quite common in other jobs, eg. clustering you get all the iterations in different subdirs, and when generating sparse vectors, the output directory has something along these lines: "df-count dictionary.file-0 frequency.file-0 tf-vectors tfidf-vectors tokenized-documents wordcount". But in both cases, the mess is constrained to live within the output dir.

        To answer your question, "11/10/24 14:04:20 ERROR common.AbstractJob: Unexpected --overwrite while processing Job-Specific Options:" ... it seems rowsimilarity doesn't offer this option. Neither does the rowid job - my (mistakenly designed) patch for MAHOUT-839 also tried to sneak in --overwrite there.

        As a user, it does seem counter-intuitive for any use of 'mahout rowsimilarity' in a given directory to create a temp/ right there, blocking any re-runs of that same-named job unless the intermediate files are wiped; perhaps I might want to run the job twice over the same data, using different similarity measures? For the reasons you give above, keeping the files around makes sense, but I'd suggest having them live beneath the output dir by default.

        Show
        Dan Brickley added a comment - I've just gone back to do some tests. I composed a new commandline, mahout rowsimilarity --input matrixified/matrix --output sims_foo/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity ...where sims_foo/ is a new home for output. However because the previous run was from the same base directory, we get "Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/weights already exists" This leads me to think that a cleaner pattern would be for these 'temp' files to be packaged inside the output directory. That seems quite common in other jobs, eg. clustering you get all the iterations in different subdirs, and when generating sparse vectors, the output directory has something along these lines: "df-count dictionary.file-0 frequency.file-0 tf-vectors tfidf-vectors tokenized-documents wordcount". But in both cases, the mess is constrained to live within the output dir. To answer your question, "11/10/24 14:04:20 ERROR common.AbstractJob: Unexpected --overwrite while processing Job-Specific Options:" ... it seems rowsimilarity doesn't offer this option. Neither does the rowid job - my (mistakenly designed) patch for MAHOUT-839 also tried to sneak in --overwrite there. As a user, it does seem counter-intuitive for any use of 'mahout rowsimilarity' in a given directory to create a temp/ right there, blocking any re-runs of that same-named job unless the intermediate files are wiped; perhaps I might want to run the job twice over the same data, using different similarity measures? For the reasons you give above, keeping the files around makes sense, but I'd suggest having them live beneath the output dir by default.
        Hide
        Sean Owen added a comment -

        On the one hand I'm reluctant to mix output and intermediate output. You make a good point that cluster iterations before the final one could be considered intermediate results, but are in the output folder. I don't know how much it changes things to move it all under output; there's still the question of whether it's overwritten or not. And then you're more strongly tying together whether the output and temp both stay or are deleted.

        I'd rather avoid change unless there's a clear deficiency. I'm also trying to avoid surprise; in Hadoop-land, which is historically based on HDFS, which is a write-once sort of storage system, things tend to prefer to not be deleted. For example Hadoop will refuse by default to overwrite anything as you see. I think not-overwriting is a fine default to follow.

        Well, you proposed a flag. And there's apparently already an --overwrite flag in circulation in some jobs to control this behavior. How about we port and extend that to cover temp/output – does that substantially answer the issue here? It would avoid changes to current behavior which people might be used to.

        Show
        Sean Owen added a comment - On the one hand I'm reluctant to mix output and intermediate output. You make a good point that cluster iterations before the final one could be considered intermediate results, but are in the output folder. I don't know how much it changes things to move it all under output; there's still the question of whether it's overwritten or not. And then you're more strongly tying together whether the output and temp both stay or are deleted. I'd rather avoid change unless there's a clear deficiency. I'm also trying to avoid surprise; in Hadoop-land, which is historically based on HDFS, which is a write-once sort of storage system, things tend to prefer to not be deleted. For example Hadoop will refuse by default to overwrite anything as you see. I think not-overwriting is a fine default to follow. Well, you proposed a flag. And there's apparently already an --overwrite flag in circulation in some jobs to control this behavior. How about we port and extend that to cover temp/output – does that substantially answer the issue here? It would avoid changes to current behavior which people might be used to.
        Hide
        Suneel Marthi added a comment -

        1. What was the outcome of this thread? I am having the same issue and had opened a Jira ticket - Mahout-964 (much before I saw this thread). I do agree that the RowSimilarityJob needs an overwrite option to cleanup the output and temp folders from a previous run.

        2. Another concern I have is if the input similarity measure specified is not a valid one, like for example:-

        mahout rowsimilarity --input matrixified/matrix --output sims_foo/ --numberOfColumns 27684 --similarityClassname SIMILARITY_COS --excludeSelfSimilarity

        then RowSimilarityJob should exit immediately instead of going ahead with trying to execute the Normalizer, CooccurrencesMapper and UnsymmetrifyMapper.

        3. The 'excludeSelfSimilarity' option needs to be given an explicit value of 'true' or 'false' otherwise the following always defaults to 'false'

        mahout rowsimilarity --input matrixified/matrix --output sims_foo/ --numberOfColumns 27684 --similarityClassname SIMILARITY_COSINE --excludeSelfSimilarity

        This is inconsistent with the way --overwrite option works. Merely specifying --excludeSelfSimilarity on the Commandline does not set it to 'true'.

        Show
        Suneel Marthi added a comment - 1. What was the outcome of this thread? I am having the same issue and had opened a Jira ticket - Mahout-964 (much before I saw this thread). I do agree that the RowSimilarityJob needs an overwrite option to cleanup the output and temp folders from a previous run. 2. Another concern I have is if the input similarity measure specified is not a valid one, like for example:- mahout rowsimilarity --input matrixified/matrix --output sims_foo/ --numberOfColumns 27684 --similarityClassname SIMILARITY_COS --excludeSelfSimilarity then RowSimilarityJob should exit immediately instead of going ahead with trying to execute the Normalizer, CooccurrencesMapper and UnsymmetrifyMapper. 3. The 'excludeSelfSimilarity' option needs to be given an explicit value of 'true' or 'false' otherwise the following always defaults to 'false' mahout rowsimilarity --input matrixified/matrix --output sims_foo/ --numberOfColumns 27684 --similarityClassname SIMILARITY_COSINE --excludeSelfSimilarity This is inconsistent with the way --overwrite option works. Merely specifying --excludeSelfSimilarity on the Commandline does not set it to 'true'.
        Grant Ingersoll made changes -
        Field Original Value New Value
        Assignee Suneel Marthi [ smarthi ]
        Suneel Marthi made changes -
        Attachment Mahout-834.patch [ 12512478 ]
        Hide
        Suneel Marthi added a comment -

        Added --overwrite to the RowSimilarityJob CLI to be able to overwrite the output folders.

        Added call to HadoopUtil.delete(getConf(), getTempPath()) to always clear the temporary path.

        Show
        Suneel Marthi added a comment - Added --overwrite to the RowSimilarityJob CLI to be able to overwrite the output folders. Added call to HadoopUtil.delete(getConf(), getTempPath()) to always clear the temporary path.
        Suneel Marthi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.6 [ 12316364 ]
        Hide
        Suneel Marthi added a comment -

        This patch only has the fix for the issue that was originally raised, i.e - 'RowSimilarityJob doesn't clean its temp directory, and fails when seeing it again'. I'll open a separate jira for the other issues I had reported.

        Show
        Suneel Marthi added a comment - This patch only has the fix for the issue that was originally raised, i.e - 'RowSimilarityJob doesn't clean its temp directory, and fails when seeing it again'. I'll open a separate jira for the other issues I had reported.
        Suneel Marthi made changes -
        Assignee Suneel Marthi [ smarthi ] Grant Ingersoll [ gsingers ]
        Suneel Marthi made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Affects Version/s 0.7 [ 12319261 ]
        Assignee Grant Ingersoll [ gsingers ] Suneel Marthi [ smarthi ]
        Hide
        Suneel Marthi added a comment -

        Per Sebastian's feedback, fixed the patch to only delete the temp outputif the -ow option has been specified.

        Show
        Suneel Marthi added a comment - Per Sebastian's feedback, fixed the patch to only delete the temp outputif the -ow option has been specified.
        Suneel Marthi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 0.7 [ 12319261 ]
        Suneel Marthi made changes -
        Attachment Mahout-834.patch [ 12525838 ]
        Sebastian Schelter made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1465 (See https://builds.apache.org/job/Mahout-Quality/1465/)
        MAHOUT-834 rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again (Revision 1335032)

        Result = SUCCESS
        ssc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1335032
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/RowSimilarityJob.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1465 (See https://builds.apache.org/job/Mahout-Quality/1465/ ) MAHOUT-834 rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again (Revision 1335032) Result = SUCCESS ssc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1335032 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/RowSimilarityJob.java
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Suneel Marthi
            Reporter:
            Dan Brickley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development