Mahout
  1. Mahout
  2. MAHOUT-978

spectralkmeans utility fails when input filename begins with leading underscore

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Component/s: Clustering
    • Labels:
      None
    • Environment:

      Tested on a real Linux-based cluster running Hadoop 0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop 0.20.203.0 running 16 Feb trunk build.

      Description

      The commandline 'bin/mahout spectralkmeans' utility fails with NoSuchElementException after "Loading vector from: spectral/output/results2/calculations/diagonal/part-r-00000" when input data in hdfs has filename beginning with a leading underscore.

      This was partially reported in comments for MAHOUT-524 but I believe identified now as a distinct issue (thanks to Shannon for help diagnosing). I have not investigated if there is an equivalent problem for API-based use of this piece of Mahout.

      Steps to reproduce:

      1. put affinity file into hdfs, following https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs count from zero etc. Name your file with a leading underscore. For example, try http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in spectral/input/_topic_skm.csv

      (I'll leave that example input file in place unchanged for others to try. It is built from dbpedia data, encoding associations from Wikipedia pages to categories. Whether it is a good use of spectral clustering I'm not sure, but I'd at least hope the job would run to completion.)

      2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o spectral/output/results1'

      3. Wait for it to fail just after printing "Loading vector from: spectral/output/results1/calculations/diagonal/part-r-00000", with java.util.NoSuchElementException at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152).

      4. Rename the file in hdfs to eliminate the leading underscore. Re-run the command (give a different results dir or cleanup from the first run, to avoid mixing the tests). This attempt should succeed and you'll see it proceed deeper into the job, i.e. something like

      12/02/19 14:38:32 INFO common.VectorCache: Loading vector from: spectral/output/results2/calculations/diagonal/part-r-00000
      12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
      12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1
      12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005
      12/02/19 14:38:45 INFO mapred.JobClient: map 0% reduce 0%
      12/02/19 14:39:31 INFO mapred.JobClient: map 1% reduce 0%

      (5. You might get a memory-based failure some time later; that is a separate problem.)

      I'll attach a more detailed transcript. I've made no attempt to diagnose internals yet, but did make some other tests and can confirm that it does not seem to matter whether the commandline invocation names the file explicitly, or by directory name only. Also trailing slash does not seem to be an issue. Finally, a related 'gotcha': make sure the results directory is not inside the input directory when testing.

        Activity

        Dan Brickley created issue -
        Hide
        Dan Brickley added a comment -

        Log of an unsuccessful run. Then renaming of input, then a successful run.

        Show
        Dan Brickley added a comment - Log of an unsuccessful run. Then renaming of input, then a successful run.
        Dan Brickley made changes -
        Field Original Value New Value
        Attachment jira-underscore-spectral-log.txt [ 12515167 ]
        Hide
        Dan Brickley added a comment -

        According to https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/404229d8b0ef044b/eb7f5d17823b63f1 files named with a leading '_' (or '.') are considered hidden files, at least in some aspects of Hadoop/HDFS. More discussion here: http://lucene.472066.n3.nabble.com/do-HDFS-files-starting-with-underscore-have-special-properties-td3305238.html

        In this light, I'd recommend treating this as a documentation issue. Not sure which other bits of Mahout use Hadoop APIs that give this same issue. I simply hadn't heard this about '_' in Hadoop, and let my own practice of naming generated files that way leak into my hdfs file naming.

        Show
        Dan Brickley added a comment - According to https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/404229d8b0ef044b/eb7f5d17823b63f1 files named with a leading '_' (or '.') are considered hidden files, at least in some aspects of Hadoop/HDFS. More discussion here: http://lucene.472066.n3.nabble.com/do-HDFS-files-starting-with-underscore-have-special-properties-td3305238.html In this light, I'd recommend treating this as a documentation issue. Not sure which other bits of Mahout use Hadoop APIs that give this same issue. I simply hadn't heard this about '_' in Hadoop, and let my own practice of naming generated files that way leak into my hdfs file naming.
        Hide
        Suneel Marthi added a comment -

        I had seen this issue in other parts of Mahout that use Hadoop APIs, the work around is to specify a PathFilter that ignores these hidden files from being processed, the fix is to specify a PathFilters.logsCRCFilter() somewhere in the code to get past this error.

        Show
        Suneel Marthi added a comment - I had seen this issue in other parts of Mahout that use Hadoop APIs, the work around is to specify a PathFilter that ignores these hidden files from being processed, the fix is to specify a PathFilters.logsCRCFilter() somewhere in the code to get past this error.
        Hide
        Dan Brickley added a comment -

        Thanks for the confirmation, Suneel. For now I've put a note into the spectral clustering wiki page.

        Show
        Dan Brickley added a comment - Thanks for the confirmation, Suneel. For now I've put a note into the spectral clustering wiki page.
        Hide
        Suneel Marthi added a comment - - edited

        Digging furthur into this, I suspect its this line of code in the constructor of SequenceFileValueIterator that could be the cause of the issue:-

        FileSystem fs = path.getFileSystem(conf);

        path = path.makeQualified(fs);

        The call path.makeQualified() seems to overwrite the actual filename that's being passed in. I could be wrong though!!

        Show
        Suneel Marthi added a comment - - edited Digging furthur into this, I suspect its this line of code in the constructor of SequenceFileValueIterator that could be the cause of the issue:- FileSystem fs = path.getFileSystem(conf); path = path.makeQualified(fs); The call path.makeQualified() seems to overwrite the actual filename that's being passed in. I could be wrong though!!
        Hide
        Shannon Quinn added a comment -

        Since this isn't limited only to spectral clustering, perhaps we should put this somewhere more general in the wiki?

        Show
        Shannon Quinn added a comment - Since this isn't limited only to spectral clustering, perhaps we should put this somewhere more general in the wiki?
        Hide
        Sean Owen added a comment -

        It is making the argument into a qualified version of the argument, not overwriting it.

        I am not sure it's not specific to spectral clustering. Code ought to be filtering out these files with PathFilters as Suneel says, and I have tried my best to catch the many places code has been committed that doesn't do this. Maybe this is another example.

        Show
        Sean Owen added a comment - It is making the argument into a qualified version of the argument, not overwriting it. I am not sure it's not specific to spectral clustering. Code ought to be filtering out these files with PathFilters as Suneel says, and I have tried my best to catch the many places code has been committed that doesn't do this. Maybe this is another example.
        Hide
        Suneel Marthi added a comment -

        Question:- Why do you have to rename your file to start with an '_'?

        Show
        Suneel Marthi added a comment - Question:- Why do you have to rename your file to start with an '_'?
        Hide
        Dan Brickley added a comment -

        I don't have to. It is/was an arbitrary personal choice. I often use _ as a prefix for script-generated files, and I was unaware of the special treatment this has in Hadoop.

        Show
        Dan Brickley added a comment - I don't have to. It is/was an arbitrary personal choice. I often use _ as a prefix for script-generated files, and I was unaware of the special treatment this has in Hadoop.
        Hide
        Suneel Marthi added a comment -

        Hadoop does treat files beginning with an underscore differently. I have been through the code for Spectral KMeans now and the code looks good (no need to specify PathFilters or needs any fixing for this issue). This is more of a human error as opposed to a code fix, agree with Dan that we need to update the wiki documentation to avoid naming input files to not start with a '_'.

        Show
        Suneel Marthi added a comment - Hadoop does treat files beginning with an underscore differently. I have been through the code for Spectral KMeans now and the code looks good (no need to specify PathFilters or needs any fixing for this issue). This is more of a human error as opposed to a code fix, agree with Dan that we need to update the wiki documentation to avoid naming input files to not start with a '_'.
        Hide
        Grant Ingersoll added a comment -

        I'd say, won't fix, as there is a workaround. Please re-open if there is a specific patch.

        Show
        Grant Ingersoll added a comment - I'd say, won't fix, as there is a workaround. Please re-open if there is a specific patch.
        Grant Ingersoll made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Suneel Marthi made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Dan Brickley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development