Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.4
    • Component/s: Integration
    • Labels:
      None

      Description

      In seq2sparse, TFIDFPartialVectorReducer and TFPartialVectorReducer should write NamedVectors. It appears that a lack of labels on the vector input to k-means at least breaks the cluster-dumper in the sense that it no longer prints the original document ids for points.

      See: http://lucene.472066.n3.nabble.com/where-are-the-points-in-each-cluster-kmeans-clusterdump-td838683.html#a845600

      I wonder if this is also an issue with the code that generates vectors from lucene indexes?

      1. MAHOUT-401.patch
        30 kB
        Drew Farris
      2. MAHOUT-401.patch
        4 kB
        Drew Farris
      3. pv.patch
        3 kB
        Drew Farris

        Activity

        Hide
        Drew Farris added a comment -

        Jeff's patch posted to mahout user. Couldn't get this to apply cleanly to my local copy but didn't spend much time with it. Just sort of using this as a placeholder to revisit the issue.

        Show
        Drew Farris added a comment - Jeff's patch posted to mahout user. Couldn't get this to apply cleanly to my local copy but didn't spend much time with it. Just sort of using this as a placeholder to revisit the issue.
        Hide
        Drew Farris added a comment -

        This also patches PartialVectorMergeReducer – I believe this captures the 3 main cases where non-named vectors are created in the seq2sparse output.

        Show
        Drew Farris added a comment - This also patches PartialVectorMergeReducer – I believe this captures the 3 main cases where non-named vectors are created in the seq2sparse output.
        Hide
        Drew Farris added a comment -

        Any issue with a commit on this?

        Show
        Drew Farris added a comment - Any issue with a commit on this?
        Hide
        Drew Farris added a comment -

        Actually, most of this was committed as a part of MAHOUT-167 (committed in r952758) - the only thing missing was the fix to PartialVectorMergeReducer, which I've committed.

        Show
        Drew Farris added a comment - Actually, most of this was committed as a part of MAHOUT-167 (committed in r952758) - the only thing missing was the fix to PartialVectorMergeReducer, which I've committed.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #113 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/113/)
        MAHOUT-401: Creates NamedVectors when writing out merged vectors.

        Show
        Hudson added a comment - Integrated in Mahout-Quality #113 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/113/ ) MAHOUT-401 : Creates NamedVectors when writing out merged vectors.
        Hide
        Jake Mannix added a comment -

        So does this mean that seq2sparse will always put out NamedVectors? What about when there is no name desired or needed? Is it set to be optional?

        Show
        Jake Mannix added a comment - So does this mean that seq2sparse will always put out NamedVectors? What about when there is no name desired or needed? Is it set to be optional?
        Hide
        Drew Farris added a comment -

        So does this mean that seq2sparse will always put out NamedVectors? What about when there is no name desired or needed? Is it set to be optional?

        It's not optional at this point, but that's certainly a reasonable thing to do. I'll see what kind of patch I can get together for this.

        Show
        Drew Farris added a comment - So does this mean that seq2sparse will always put out NamedVectors? What about when there is no name desired or needed? Is it set to be optional? It's not optional at this point, but that's certainly a reasonable thing to do. I'll see what kind of patch I can get together for this.
        Hide
        Sean Owen added a comment -

        The immediate issue appears resolved?

        Show
        Sean Owen added a comment - The immediate issue appears resolved?
        Hide
        Drew Farris added a comment -

        Should I open another issue for the optional NamedVector creation? I might get to this one this week too.

        Show
        Drew Farris added a comment - Should I open another issue for the optional NamedVector creation? I might get to this one this week too.
        Hide
        Drew Farris added a comment -

        Reopening to submit patch that adds options to seq2sparse to control whether named vectors are generated.

        Show
        Drew Farris added a comment - Reopening to submit patch that adds options to seq2sparse to control whether named vectors are generated.
        Hide
        Drew Farris added a comment -

        This patch:

        Adds the -nv option to SparseVectorFromSequenceFiles.
        Enhances DictionaryVictorizerTest to assert that the proper vector types are generated
        Adds SparseVectorFromSequenceFilesTest to validate the proper command-line option behavior and vector types.
        Extracts random document generation code to RandomDocumentGenerator utility class.

        Show
        Drew Farris added a comment - This patch: Adds the -nv option to SparseVectorFromSequenceFiles. Enhances DictionaryVictorizerTest to assert that the proper vector types are generated Adds SparseVectorFromSequenceFilesTest to validate the proper command-line option behavior and vector types. Extracts random document generation code to RandomDocumentGenerator utility class.
        Hide
        Sean Owen added a comment -

        I imagine you are welcome to commit this as you know most about it. At a glance, seems fine to me. Go for it so we can close it out for 0.4

        Show
        Sean Owen added a comment - I imagine you are welcome to commit this as you know most about it. At a glance, seems fine to me. Go for it so we can close it out for 0.4
        Hide
        Drew Farris added a comment -

        Ok, will do asap.

        Show
        Drew Farris added a comment - Ok, will do asap.
        Hide
        Drew Farris added a comment -

        Committed, Hudson, how's it look?

        Show
        Drew Farris added a comment - Committed, Hudson, how's it look?
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #326 (See https://hudson.apache.org/hudson/job/Mahout-Quality/326/)
        MAHOUT-401: Use NamedVector in seq2sparse
        Adds the -nv option to SparseVectorFromSequenceFiles to create NamedVectors instead of Random or SequentialAccess vectors
        Enhances DictionaryVictorizerTest to assert that the proper vector types are generated
        Adds SparseVectorFromSequenceFilesTest to validate the proper command-line option behavior and vector types.
        Extracts random document generation code to RandomDocumentGenerator utility clas

        Show
        Hudson added a comment - Integrated in Mahout-Quality #326 (See https://hudson.apache.org/hudson/job/Mahout-Quality/326/ ) MAHOUT-401 : Use NamedVector in seq2sparse Adds the -nv option to SparseVectorFromSequenceFiles to create NamedVectors instead of Random or SequentialAccess vectors Enhances DictionaryVictorizerTest to assert that the proper vector types are generated Adds SparseVectorFromSequenceFilesTest to validate the proper command-line option behavior and vector types. Extracts random document generation code to RandomDocumentGenerator utility clas
        Hide
        Drew Farris added a comment -

        Thanks Hudson!

        Show
        Drew Farris added a comment - Thanks Hudson!

          People

          • Assignee:
            Drew Farris
            Reporter:
            Drew Farris
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development