Mahout
  1. Mahout
  2. MAHOUT-845

Make cluster top terms code more reusable

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.7
    • Component/s: Clustering
    • Labels:
      None

      Description

      When working with Mahout text clustering I find that I keep writing code similar to the contents of

      public static String getTopFeatures(Cluster cluster, String[] dictionary, int numTerms)

      in ClusterDumper in order to determine cluster labels.

      I think it would be useful if (parts of) this code are added to the cluster or vector API so that you could do something like

      Cluster cluster = ... // get the cluster from seq file iterable
      String clusterLabel = cluster.getTopTerms(1, dictionary); // Do something with the label

      I think this would make it easier to export and post-process clustering results, like indexing or storing them elsewhere.

      Thoughts?

      1. MAHOUT-845.patch
        25 kB
        Frank Scholten
      2. MAHOUT-845.patch
        20 kB
        Frank Scholten
      3. MAHOUT-845.patch
        18 kB
        Frank Scholten

        Activity

        Hide
        Jeff Eastman added a comment -

        +1 We could easily add a static method to AbstractCluster, or add an operator to Vector as you suggest. If getting the n largest element indices (and values) from a Vector is useful to other applications this would be a good place to add it. It seems to me the method needs to return another sparse vector containing just the n top term indices and values. The same dictionary would still be valid since the indices are identical so it does not need to be involved. Something like:

        Vector:

        Vector getTopElements;

        Show
        Jeff Eastman added a comment - +1 We could easily add a static method to AbstractCluster, or add an operator to Vector as you suggest. If getting the n largest element indices (and values) from a Vector is useful to other applications this would be a good place to add it. It seems to me the method needs to return another sparse vector containing just the n top term indices and values. The same dictionary would still be valid since the indices are identical so it does not need to be involved. Something like: Vector: Vector getTopElements ;
        Hide
        Frank Scholten added a comment -

        +1 Adding that method to Vector. This way we have both top weights and top terms for any vector. We can remove the TermIndexWeight helper class from ClusterDumper. I'll submit a patch tomorrow.

        Show
        Frank Scholten added a comment - +1 Adding that method to Vector. This way we have both top weights and top terms for any vector. We can remove the TermIndexWeight helper class from ClusterDumper. I'll submit a patch tomorrow.
        Hide
        Frank Scholten added a comment -

        Here is a patch for retrieving the top k elements on Vector, implemented on AbstractVector.

        It returns an array of type Vector.Element with index and value. It can't return a Vector because then you lose the original index positions and access to the corresponding terms from the dictionary.

        I used the existing TopK class from the taste module and moved it to math commons. The top k elements code was also used by RowSimilarityJob and the Vectors class so I also updated these parts.

        Comments and suggestions welcome!

        Show
        Frank Scholten added a comment - Here is a patch for retrieving the top k elements on Vector, implemented on AbstractVector. It returns an array of type Vector.Element with index and value. It can't return a Vector because then you lose the original index positions and access to the corresponding terms from the dictionary. I used the existing TopK class from the taste module and moved it to math commons. The top k elements code was also used by RowSimilarityJob and the Vectors class so I also updated these parts. Comments and suggestions welcome!
        Hide
        Jake Mannix added a comment -

        Ooh, actually, I have code which does this on my github branch, in fact. I keep saying that, I really need to merge it over. So you have to be really careful with Vector.Element instances, as they are often "virtual" - the same object is reused over and over again if you iterate. So you end up getting very strange/wrong results if you try to hang onto them.

        But I'll take a look at this patch, I agree this functionality is totally needed (I needed it badly enough I just hacked it into my code) for vectors in general (have to be careful of negative values though!).

        Show
        Jake Mannix added a comment - Ooh, actually, I have code which does this on my github branch, in fact. I keep saying that, I really need to merge it over. So you have to be really careful with Vector.Element instances, as they are often "virtual" - the same object is reused over and over again if you iterate. So you end up getting very strange/wrong results if you try to hang onto them. But I'll take a look at this patch, I agree this functionality is totally needed (I needed it badly enough I just hacked it into my code) for vectors in general (have to be careful of negative values though!).
        Hide
        Frank Scholten added a comment -

        Newer patch that also updates ClusterDumperWriter

        Show
        Frank Scholten added a comment - Newer patch that also updates ClusterDumperWriter
        Hide
        Jake Mannix added a comment -

        So this is good, I like that this puts the queue work and extraction in one place, and we can merge with a dictionary later.

        What I've often found I need to do is have something along the lines of

        Vector.topKformatString(String[] dictionary, int k);

        when dealing with text / labeled stuff.

        Show
        Jake Mannix added a comment - So this is good, I like that this puts the queue work and extraction in one place, and we can merge with a dictionary later. What I've often found I need to do is have something along the lines of Vector.topKformatString(String[] dictionary, int k); when dealing with text / labeled stuff.
        Hide
        Frank Scholten added a comment -

        Yes you make some good points about the virtual instances and negative values. Btw it returns a List<Vector.Element> not an array because the TopK also returns a List.

        Show
        Frank Scholten added a comment - Yes you make some good points about the virtual instances and negative values. Btw it returns a List<Vector.Element> not an array because the TopK also returns a List.
        Hide
        Frank Scholten added a comment -

        Yes and this method would return an array or a List of Strings right?

        Show
        Frank Scholten added a comment - Yes and this method would return an array or a List of Strings right?
        Hide
        Jake Mannix added a comment -

        Well, that's a good question. I've used it the same way we do Vector.asFormatString(), where the output is a String which is JSON formatted, with key-value pairs of String -> weight, in descending order by the weight.

        Show
        Jake Mannix added a comment - Well, that's a good question. I've used it the same way we do Vector.asFormatString(), where the output is a String which is JSON formatted, with key-value pairs of String -> weight, in descending order by the weight.
        Hide
        Frank Scholten added a comment -

        Or we could add a getTerm(String[] dictionary) method to SparseElement (see patch)

        This way the terms are decoupled from a particular format, like JSON, and you have the freedom to index them, display them or store them somewhere else.

        Show
        Frank Scholten added a comment - Or we could add a getTerm(String[] dictionary) method to SparseElement (see patch) This way the terms are decoupled from a particular format, like JSON, and you have the freedom to index them, display them or store them somewhere else.
        Hide
        Frank Scholten added a comment -

        Updated patch with getTerm(String[] dictionary) method added to SparseElement

        Show
        Frank Scholten added a comment - Updated patch with getTerm(String[] dictionary) method added to SparseElement
        Hide
        Frank Scholten added a comment -

        Any feedback on the latest patch?

        Show
        Frank Scholten added a comment - Any feedback on the latest patch?
        Hide
        Lance Norskog added a comment - - edited

        1) Is this feature useful in any other code outside Clustering?
        2) Can it be a static method? Vector has 14+ implementations, and all of them have to make sure they do nothing to screw this up. Some of them cannot support this effectively, and other would need to keep a cache as they are populated.

        Show
        Lance Norskog added a comment - - edited 1) Is this feature useful in any other code outside Clustering? 2) Can it be a static method? Vector has 14+ implementations, and all of them have to make sure they do nothing to screw this up. Some of them cannot support this effectively, and other would need to keep a cache as they are populated.
        Hide
        Jake Mannix added a comment -

        Ok, so I've thought about this a little, and the implementation that Frank put on here, and I had on my github branch too, essentially, is probably a bad idea, for exactly Lance's points mentioned here.

        So instead, we modify VectorDumper and VectorHelper to add a couple of static methods and options:

        in VectorHelper:
        [code]
        public static String vectorToJson(Vector vector, String[] dictionary, int maxEntries, boolean sort)
        [code]

        where the "sort" option sorts by the values of the Vector entries, and maxEntries describes the maximum number of vector entries to use. If dictionary is supplied and not null, then the vector indexes are replaced with their respective term entries in the dictionary.

        This way, VectorDumper is modified with the following options:
        [code]
        Option sortVectorsOpt = obuilder.withLongName("sortVectors").withRequired(false).withDescription(
        "Sort output key/value pairs of the vector entries in abs magnitude descending order")
        .withShortName("sort").create();
        Option numIndexesPerVectorOpt = obuilder.withLongName("vectorSize").withShortName("vs").withRequired(false)
        .withArgument(abuilder.withName("vs").withMinimum(1).withMaximum(1).create())
        .withDescription("Truncate vectors to <vs> length when dumping (most useful when in"
        + " conjunction with -sort").create();
        [code]

        Then if you have clusters represented as vector centroids (or distributions over terms/features, or anything else which is a collection of Vectors linked to a dictionary of String labels for the vector indexes), then you don't really need a "ClusterDumper", as

        [code]
        $MAHOUT_HOME/bin/mahout vectordump s "path/to/vectors/part*" --dictionary "path/to/dictionary.file-0" -dt sequencefile -sort --vectorSize 100 -o local_vectors.json
        [code]

        puts each vector in "path/to/vectors/part-*" one per line in local_vectors.json, in json format, with the keys being the terms with the highest weight for the vector, the values being the vector values, and only the top 100 (by value) per vector are emitted.

        I've found this modification to VectorDumper invaluable in inspecting LDA topic models, but doing it without modifying the Vector interface is even better.

        Show
        Jake Mannix added a comment - Ok, so I've thought about this a little, and the implementation that Frank put on here, and I had on my github branch too, essentially, is probably a bad idea, for exactly Lance's points mentioned here. So instead, we modify VectorDumper and VectorHelper to add a couple of static methods and options: in VectorHelper: [code] public static String vectorToJson(Vector vector, String[] dictionary, int maxEntries, boolean sort) [code] where the "sort" option sorts by the values of the Vector entries, and maxEntries describes the maximum number of vector entries to use. If dictionary is supplied and not null, then the vector indexes are replaced with their respective term entries in the dictionary. This way, VectorDumper is modified with the following options: [code] Option sortVectorsOpt = obuilder.withLongName("sortVectors").withRequired(false).withDescription( "Sort output key/value pairs of the vector entries in abs magnitude descending order") .withShortName("sort").create(); Option numIndexesPerVectorOpt = obuilder.withLongName("vectorSize").withShortName("vs").withRequired(false) .withArgument(abuilder.withName("vs").withMinimum(1).withMaximum(1).create()) .withDescription("Truncate vectors to <vs> length when dumping (most useful when in" + " conjunction with -sort").create(); [code] Then if you have clusters represented as vector centroids (or distributions over terms/features, or anything else which is a collection of Vectors linked to a dictionary of String labels for the vector indexes), then you don't really need a "ClusterDumper", as [code] $MAHOUT_HOME/bin/mahout vectordump s "path/to/vectors/part *" --dictionary "path/to/dictionary.file-0" -dt sequencefile -sort --vectorSize 100 -o local_vectors.json [code] puts each vector in "path/to/vectors/part-*" one per line in local_vectors.json, in json format, with the keys being the terms with the highest weight for the vector, the values being the vector values, and only the top 100 (by value) per vector are emitted. I've found this modification to VectorDumper invaluable in inspecting LDA topic models, but doing it without modifying the Vector interface is even better.
        Hide
        Frank Scholten added a comment -

        I agree that the original solution is too invasive and the feature you are describing is nice for inspection.

        One suggestion I have is to separate the JSON formatting from retrieving the top terms.

        Show
        Frank Scholten added a comment - I agree that the original solution is too invasive and the feature you are describing is nice for inspection. One suggestion I have is to separate the JSON formatting from retrieving the top terms.
        Hide
        Jake Mannix added a comment -

        Ok, I added a couple of methods and options to VectorHelper and VectorDumper. Check out the difference between patch version 5 and 6 of: https://reviews.apache.org/r/2944

        Show
        Jake Mannix added a comment - Ok, I added a couple of methods and options to VectorHelper and VectorDumper. Check out the difference between patch version 5 and 6 of: https://reviews.apache.org/r/2944
        Hide
        Frank Scholten added a comment -

        Cool, looks good. The Google collections stuff is nice also.

        Show
        Frank Scholten added a comment - Cool, looks good. The Google collections stuff is nice also.
        Hide
        Frank Scholten added a comment -

        I think this would be useful feature to have in 0.6. Can someone else have a look?

        Show
        Frank Scholten added a comment - I think this would be useful feature to have in 0.6. Can someone else have a look?
        Hide
        Grant Ingersoll added a comment -

        I've got some refactorings in this area for MAHOUT-899 too. It would be good to get these two resolved soon.

        Show
        Grant Ingersoll added a comment - I've got some refactorings in this area for MAHOUT-899 too. It would be good to get these two resolved soon.
        Hide
        Jeff Eastman added a comment -

        Grant, is this patch ready to go in? Seems like there is some overlap with MAHOUT-899?

        Show
        Jeff Eastman added a comment - Grant, is this patch ready to go in? Seems like there is some overlap with MAHOUT-899 ?
        Hide
        Jeff Eastman added a comment -

        I downloaded the latest patch and it no longer applies without errors. Given the late date w.r.t. 0.6 code freeze and the lack of an assignee I'm moving the issue to release 0.7

        Show
        Jeff Eastman added a comment - I downloaded the latest patch and it no longer applies without errors. Given the late date w.r.t. 0.6 code freeze and the lack of an assignee I'm moving the issue to release 0.7
        Hide
        Frank Scholten added a comment -

        The latest changes were from Jake and they are already in trunk, revision 1209794, as part of MAHOUT-897

        Show
        Frank Scholten added a comment - The latest changes were from Jake and they are already in trunk, revision 1209794, as part of MAHOUT-897
        Hide
        Frank Scholten added a comment -

        So technically they are in 0.6

        Show
        Frank Scholten added a comment - So technically they are in 0.6
        Hide
        Frank Scholten added a comment -

        I guess this one can be closed?

        Show
        Frank Scholten added a comment - I guess this one can be closed?

          People

          • Assignee:
            Jake Mannix
            Reporter:
            Frank Scholten
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development