Solr
  1. Solr
  2. SOLR-769

Support Document and Search Result clustering

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: contrib - Clustering
    • Labels:
      None

      Description

      Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering.

      The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results.

      While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I may push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism.

      1. subcluster-flattening.patch
        1 kB
        Stanislaw Osinski
      2. SOLR-769.patch
        13 kB
        Yonik Seeley
      3. SOLR-769.patch
        10 kB
        Yonik Seeley
      4. clustering-componet-shard.patch
        21 kB
        Brad Giaccio
      5. SOLR-769-analyzerClass.patch
        3 kB
        Koji Sekiguchi
      6. SOLR-769.patch
        177 kB
        Grant Ingersoll
      7. SOLR-769.tar
        1.87 MB
        Grant Ingersoll
      8. SOLR-769.patch
        177 kB
        Grant Ingersoll
      9. SOLR-769.patch
        122 kB
        Grant Ingersoll
      10. SOLR-769.zip
        42 kB
        Stanislaw Osinski
      11. SOLR-769-lib.zip
        1.68 MB
        Stanislaw Osinski
      12. SOLR-769.patch
        164 kB
        Grant Ingersoll
      13. SOLR-769.patch
        187 kB
        Grant Ingersoll
      14. SOLR-769.patch
        193 kB
        Grant Ingersoll
      15. SOLR-769.patch
        191 kB
        Grant Ingersoll
      16. SOLR-769.patch
        183 kB
        Grant Ingersoll
      17. clustering-libs.tar
        1.87 MB
        Grant Ingersoll
      18. SOLR-769.patch
        193 kB
        Grant Ingersoll
      19. SOLR-769.patch
        150 kB
        Grant Ingersoll
      20. SOLR-769.patch
        104 kB
        Grant Ingersoll
      21. clustering-libs.tar
        1.54 MB
        Grant Ingersoll

        Issue Links

          Activity

          Hide
          Koji Sekiguchi added a comment -

          Uh, I needed to read the part of the recursive call. Thanks for explanation!

          Show
          Koji Sekiguchi added a comment - Uh, I needed to read the part of the recursive call. Thanks for explanation!
          Hide
          Stanislaw Osinski added a comment -

          Hi Koji,

          Actually, the current code seems right: if we don't output subclusters, we need to include all documents of the cluster, including those from its subclusters, otherwise the subclusters' documents may not appear in the response at all. But if we do output subclusters, we add only the documents assigned specifically to the cluster because the subclusters with their documents will be included in the response too.

          S.

          Show
          Stanislaw Osinski added a comment - Hi Koji, Actually, the current code seems right: if we don't output subclusters, we need to include all documents of the cluster, including those from its subclusters, otherwise the subclusters' documents may not appear in the response at all. But if we do output subclusters, we add only the documents assigned specifically to the cluster because the subclusters with their documents will be included in the response too. S.
          Hide
          Koji Sekiguchi added a comment -

          Apologies Grant for quote your comment on 27/Jul/09:

          Also applied Stanislaw's patch.

          I'm confused by this line:

          List<Document> docs = outputSubClusters ? outCluster.getDocuments() : outCluster.getAllDocuments();
          

          According to Carrot2 Javadoc:

          http://download.carrot2.org/stable/javadoc/org/carrot2/core/Cluster.html#getAllDocuments%28%29

          Should it be:

          List<Document> docs = outputSubClusters ? outCluster.getAllDocuments() : outCluster.getDocuments();
          

          ?

          Show
          Koji Sekiguchi added a comment - Apologies Grant for quote your comment on 27/Jul/09: Also applied Stanislaw's patch. I'm confused by this line: List<Document> docs = outputSubClusters ? outCluster.getDocuments() : outCluster.getAllDocuments(); According to Carrot2 Javadoc: http://download.carrot2.org/stable/javadoc/org/carrot2/core/Cluster.html#getAllDocuments%28%29 Should it be: List<Document> docs = outputSubClusters ? outCluster.getAllDocuments() : outCluster.getDocuments(); ?
          Koji Sekiguchi made changes -
          Component/s contrib - Clustering [ 12313050 ]
          Hide
          Koji Sekiguchi added a comment -

          add component info

          Show
          Koji Sekiguchi added a comment - add component info
          Stanislaw Osinski made changes -
          Link This issue is related to SOLR-1314 [ SOLR-1314 ]
          Hide
          Stanislaw Osinski added a comment -

          Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released.

          Show
          Stanislaw Osinski added a comment - Created: SOLR-1314 . I'll attach a patch there as soon as Lucene 2.9 is released.
          Hide
          Grant Ingersoll added a comment -

          Would that make sense? Should I create a separate issue for it, or rather reopen this one?

          Yes, I think that makes sense. Separate issue would be good, this one is long enough.

          Show
          Grant Ingersoll added a comment - Would that make sense? Should I create a separate issue for it, or rather reopen this one? Yes, I think that makes sense. Separate issue would be good, this one is long enough.
          Hide
          Stanislaw Osinski added a comment -

          Hi Grant,

          There's one more thing: we're planning to release version 3.1.0 of Carrot2 with certain bug fixes in clustering algorithm and better support for Chinese (using the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out, but before Solr 1.4, so that the latter would have a newer version of Carrot2 on board (should be just a matter of replacing Carrot2 JAR / upgrading version of the downloaded dependency). Would that make sense? Should I create a separate issue for it, or rather reopen this one?

          Thanks,

          S.

          Show
          Stanislaw Osinski added a comment - Hi Grant, There's one more thing: we're planning to release version 3.1.0 of Carrot2 with certain bug fixes in clustering algorithm and better support for Chinese (using the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out, but before Solr 1.4, so that the latter would have a newer version of Carrot2 on board (should be just a matter of replacing Carrot2 JAR / upgrading version of the downloaded dependency). Would that make sense? Should I create a separate issue for it, or rather reopen this one? Thanks, S.
          Grant Ingersoll made changes -
          Status In Progress [ 3 ] Closed [ 6 ]
          Resolution Fixed [ 1 ]
          Hide
          Grant Ingersoll added a comment -

          This should be back to working and the example is not contained in the contrib/clustering, plus I re-instated the downloads directory.

          Show
          Grant Ingersoll added a comment - This should be back to working and the example is not contained in the contrib/clustering, plus I re-instated the downloads directory.
          Hide
          Grant Ingersoll added a comment -

          OK, I have committed my changes and believe functionality is restored and is properly working with the SolrResourceLoader. Also applied Stanislaw's patch.

          Still likely need to review how to distribute all of this. My guess is that we should only include the source, including the build and instructions for installing, and not even package jars at all since we can't include the LGPL ones necessary for Carrot2.

          Show
          Grant Ingersoll added a comment - OK, I have committed my changes and believe functionality is restored and is properly working with the SolrResourceLoader. Also applied Stanislaw's patch. Still likely need to review how to distribute all of this. My guess is that we should only include the source, including the build and instructions for installing, and not even package jars at all since we can't include the LGPL ones necessary for Carrot2.
          Hide
          Grant Ingersoll added a comment -

          Note, I believe there is also a classloading issue when trying to load the carrot algorithm, b/c it does not use the SolrResourceLoader

          Show
          Grant Ingersoll added a comment - Note, I believe there is also a classloading issue when trying to load the carrot algorithm, b/c it does not use the SolrResourceLoader
          Hide
          Grant Ingersoll added a comment -

          classloading issues after the hander was removed from solr.war

          I think the issue is that changes you made don't include the actual include the clustering code in Solr when running the example. I think we just need to copy over the clustering JAR from the build directory into the lib, but that is a bit weird, IMO.

          To fix, I'm going to make the example target create a proper Solr home under contrib/clustering/example. Which, of course, isn't much different from how it used to be. I am also going to restore the downloads directory for packaging/release functionality.

          Show
          Grant Ingersoll added a comment - classloading issues after the hander was removed from solr.war I think the issue is that changes you made don't include the actual include the clustering code in Solr when running the example. I think we just need to copy over the clustering JAR from the build directory into the lib, but that is a bit weird, IMO. To fix, I'm going to make the example target create a proper Solr home under contrib/clustering/example. Which, of course, isn't much different from how it used to be. I am also going to restore the downloads directory for packaging/release functionality.
          Grant Ingersoll made changes -
          Assignee Grant Ingersoll [ gsingers ]
          Yonik Seeley made changes -
          Assignee Yonik Seeley [ yseeley@gmail.com ]
          Hide
          Yonik Seeley added a comment - - edited

          un-assigning myself since I'm not sure when I'll be able to get back to this.
          Issues remaining:

          • classloading issues after the hander was removed from solr.war
          • possible packaging issues that Grant brought up (the downloaded jars shouldn't be shipped)
          • update the Wiki once classloading works and we can generate the new example output
          Show
          Yonik Seeley added a comment - - edited un-assigning myself since I'm not sure when I'll be able to get back to this. Issues remaining: classloading issues after the hander was removed from solr.war possible packaging issues that Grant brought up (the downloaded jars shouldn't be shipped) update the Wiki once classloading works and we can generate the new example output
          Stanislaw Osinski made changes -
          Attachment subcluster-flattening.patch [ 12412899 ]
          Hide
          Stanislaw Osinski added a comment -

          Hi,

          While configuring the clustering component for an algorithm that returns hierarchical clusters, it took me a while to debug why subclusters wouldn't appear on the output. It turned out that the default value for the carrot.outputSubClusters parameter is false, which was the opposite to what I assumed Would it be a problem to change the default to true, so that other users avoid the same problem?

          Another improvement worth making for the carrot.outputSubClusters = false case is "flattening" the clusters: returning all documents of the 1st level clusters, including those contained in the subclusters the user chose not to output. Without this improvement, many document-cluster assignments may be lost because some Carrot2 algorithms will assign documents only to the "leaf" (deepest in the hierarchy) clusters.

          I'm attaching a patch that implements both changes.

          Show
          Stanislaw Osinski added a comment - Hi, While configuring the clustering component for an algorithm that returns hierarchical clusters, it took me a while to debug why subclusters wouldn't appear on the output. It turned out that the default value for the carrot.outputSubClusters parameter is false , which was the opposite to what I assumed Would it be a problem to change the default to true , so that other users avoid the same problem? Another improvement worth making for the carrot.outputSubClusters = false case is "flattening" the clusters: returning all documents of the 1st level clusters, including those contained in the subclusters the user chose not to output. Without this improvement, many document-cluster assignments may be lost because some Carrot2 algorithms will assign documents only to the "leaf" (deepest in the hierarchy) clusters. I'm attaching a patch that implements both changes.
          Hide
          Yonik Seeley added a comment -

          Apologies Brad - I didn't realize there were pending patches or I would have not done the reformat.

          Show
          Yonik Seeley added a comment - Apologies Brad - I didn't realize there were pending patches or I would have not done the reformat.
          Hide
          Brad Giaccio added a comment -

          If you could , could my patch to handle shards be applied before you reformat so I don't have to piece it together again and resubmit?

          Show
          Brad Giaccio added a comment - If you could , could my patch to handle shards be applied before you reformat so I don't have to piece it together again and resubmit?
          Hide
          Yonik Seeley added a comment -

          Of course, now that I've removed the clustering libs from the solr.war, the example no longer works for some reason... looks like all the jars are in example/clustering/solr/lib, so it's classloading issues I imagine.

          On a related note, I'm not sure how useful it is to have a clustering component with multiple plugins itself... the extra level of plugins seems to just add more complexity. Different plugins could always share utility classes, perhaps even base classes, and could strive for a common output format - all without going to an additional plugin model.

          Show
          Yonik Seeley added a comment - Of course, now that I've removed the clustering libs from the solr.war, the example no longer works for some reason... looks like all the jars are in example/clustering/solr/lib, so it's classloading issues I imagine. On a related note, I'm not sure how useful it is to have a clustering component with multiple plugins itself... the extra level of plugins seems to just add more complexity. Different plugins could always share utility classes, perhaps even base classes, and could strive for a common output format - all without going to an additional plugin model.
          Hide
          Mark Miller added a comment -

          Anyone mind if I reformat the source files that currently use tabs?

          +1

          Show
          Mark Miller added a comment - Anyone mind if I reformat the source files that currently use tabs? +1
          Hide
          Yonik Seeley added a comment -

          Anyone mind if I reformat the source files that currently use tabs?

          Show
          Yonik Seeley added a comment - Anyone mind if I reformat the source files that currently use tabs?
          Yonik Seeley made changes -
          Attachment SOLR-769.patch [ 12412312 ]
          Hide
          Yonik Seeley added a comment -

          This fixes the SolrQueryRequest issue and also stopped the swallowing of an exception that I just happened to see.

          I'll commit shortly.

          Show
          Yonik Seeley added a comment - This fixes the SolrQueryRequest issue and also stopped the swallowing of an exception that I just happened to see. I'll commit shortly.
          Yonik Seeley made changes -
          Attachment SOLR-769.patch [ 12412290 ]
          Hide
          Yonik Seeley added a comment -

          The attached patch implements the simpler JSON friendly format.

          example:

          [...] 
          "clusters":[
            { "labels":["DDR"],
              "docs":["TWINX2048-3200PRO","VS1GB400C3","VDBDB1A16"]
            },
            { "labels":["Car Power Adapter"],
              "docs":["F8V7067-APL-KIT","IW-02"]
            },
            { "labels":["Display"],
              "docs":["MA147LL/A","VA902B"]
            }
          
          Show
          Yonik Seeley added a comment - The attached patch implements the simpler JSON friendly format. example: [...] "clusters" :[ { "labels" :[ "DDR" ], "docs" :[ "TWINX2048-3200PRO" , "VS1GB400C3" , "VDBDB1A16" ] }, { "labels" :[ "Car Power Adapter" ], "docs" :[ "F8V7067-APL-KIT" , "IW-02" ] }, { "labels" :[ "Display" ], "docs" :[ "MA147LL/A" , "VA902B" ] }
          Hide
          Stanislaw Osinski added a comment -

          Is "labels" is needed because there could be multiple labels per cluster in the future? ( I assume yes)

          Correct. Currently neither of Carrot2's algorithms creates clusters with multiple labels, but it's quite likely that there are other algorithms that can do that.

          Show
          Stanislaw Osinski added a comment - Is "labels" is needed because there could be multiple labels per cluster in the future? ( I assume yes) Correct. Currently neither of Carrot2's algorithms creates clusters with multiple labels, but it's quite likely that there are other algorithms that can do that.
          Hide
          Grant Ingersoll added a comment -

          Makes sense, might need to refactor some of the initialization code and the abstract clustering engine, but no big deal.

          Show
          Grant Ingersoll added a comment - Makes sense, might need to refactor some of the initialization code and the abstract clustering engine, but no big deal.
          Hide
          Yonik Seeley added a comment -

          I'm talking about the search results clustering, which is per-request. RequestHandlers should pretty much always use the core/searcher associated with the SolrQueryRequest. newSearcher/firstSearcher hooks set this themselves, hence it's a different searcher than one would get from getSearcher() (and could possibly even cause a deadlock). Architecturally, there could be any number of reasons to use a different searcher in the future... the SolrQueryRequest says which searcher to use.

          Show
          Yonik Seeley added a comment - I'm talking about the search results clustering, which is per-request. RequestHandlers should pretty much always use the core/searcher associated with the SolrQueryRequest. newSearcher/firstSearcher hooks set this themselves, hence it's a different searcher than one would get from getSearcher() (and could possibly even cause a deadlock). Architecturally, there could be any number of reasons to use a different searcher in the future... the SolrQueryRequest says which searcher to use.
          Hide
          Grant Ingersoll added a comment -

          Also, some implementations may need lower level interfaces than Searcher, it just seems easier to have core access.

          Show
          Grant Ingersoll added a comment - Also, some implementations may need lower level interfaces than Searcher, it just seems easier to have core access.
          Hide
          Grant Ingersoll added a comment -

          Now that I'm looking at some of the code, is there a reason why clustering doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the core?

          Because the clustering engine gets initialized during core initialization and thus doesn't have a SolrQueryRequest at that time. Is there harm in the way it's being done? I suppose it adds an extra reference, right, meaning it could keep a core open longer?

          In the case of document clustering, I think it could be a long running job. It's not clear yet how that should work, but it is something to keep in mind. I expect to implement that sometime this summer, likely after 1.4.

          Show
          Grant Ingersoll added a comment - Now that I'm looking at some of the code, is there a reason why clustering doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the core? Because the clustering engine gets initialized during core initialization and thus doesn't have a SolrQueryRequest at that time. Is there harm in the way it's being done? I suppose it adds an extra reference, right, meaning it could keep a core open longer? In the case of document clustering, I think it could be a long running job. It's not clear yet how that should work, but it is something to keep in mind. I expect to implement that sometime this summer, likely after 1.4.
          Hide
          Yonik Seeley added a comment -

          Now that I'm looking at some of the code, is there a reason why clustering doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the core?

          Show
          Yonik Seeley added a comment - Now that I'm looking at some of the code, is there a reason why clustering doesn't use a SolrQueryRequest, but instead grabs a searcher directly from the core?
          Yonik Seeley made changes -
          Assignee Grant Ingersoll [ gsingers ] Yonik Seeley [ yseeley@gmail.com ]
          Hide
          Grant Ingersoll added a comment -

          Is "labels" is needed because there could be multiple labels per cluster in the future? ( I assume yes)

          Not sure, but likely so

          Do we need more per-doc information than just the id? (I assume no)

          I think for other algorithms like k-Means, Canopy and others (Mahout) you could reasonable expect to return:
          1. The centroid that the given document belongs to - This can be captured as the label, but it is often represented as a vector and could thus be quite long. For instance, in Mahout, we could return this as a JSON string (we're using GSON over there)
          2. The distance from the centroid used in clustering.

          Could we want other per-cluster information in the future (I assume yes)

          See #1 in the previous.

          What other possible information could be added in the future?

          Hard to say, but the nature of this implementation is such that people will can plug in their own clustering algorithms which may have different outputs. Until we have at least one other implementation, it will be difficult to "harden" the interfaces. For now, though, you're proposed alterations to the format are fine with me.

          Seems like it would be nice if we could handle unknown field types gracefully?

          Yes, that would be good.

          Show
          Grant Ingersoll added a comment - Is "labels" is needed because there could be multiple labels per cluster in the future? ( I assume yes) Not sure, but likely so Do we need more per-doc information than just the id? (I assume no) I think for other algorithms like k-Means, Canopy and others (Mahout) you could reasonable expect to return: 1. The centroid that the given document belongs to - This can be captured as the label, but it is often represented as a vector and could thus be quite long. For instance, in Mahout, we could return this as a JSON string (we're using GSON over there) 2. The distance from the centroid used in clustering. Could we want other per-cluster information in the future (I assume yes) See #1 in the previous. What other possible information could be added in the future? Hard to say, but the nature of this implementation is such that people will can plug in their own clustering algorithms which may have different outputs. Until we have at least one other implementation, it will be difficult to "harden" the interfaces. For now, though, you're proposed alterations to the format are fine with me. Seems like it would be nice if we could handle unknown field types gracefully? Yes, that would be good.
          Hide
          Yonik Seeley added a comment -

          I hit an error trying to cluster some documents I added with solr cell - 400 unknown field "Author".
          Seems like it would be nice if we could handle unknown field types gracefully?

          Show
          Yonik Seeley added a comment - I hit an error trying to cluster some documents I added with solr cell - 400 unknown field "Author". Seems like it would be nice if we could handle unknown field types gracefully?
          Hide
          Yonik Seeley added a comment -

          The response structure is a bit funny (it's like normal XML, which we don't really use in Solr-land), and certainly not optimal for JSON responses:

           "clusters":[
            "cluster",[
          	"labels",[
          	 "label","DDR"],
          	"docs",[
          	 "doc","TWINX2048-3200PRO",
          	 "doc","VS1GB400C3",
          	 "doc","VDBDB1A16"]],
            "cluster",[
          	"labels",[
          	 "label","Car Power Adapter"],
          	"docs",[
          	 "doc","F8V7067-APL-KIT",
          	 "doc","IW-02"]],
          [...]
          

          Is "labels" is needed because there could be multiple labels per cluster in the future? ( I assume yes)
          Do we need more per-doc information than just the id? (I assume no)
          Could we want other per-cluster information in the future (I assume yes)
          What other possible information could be added in the future?

          Given the assumptions above, "clusters", "docs", and "labels" should all be arrays instead of NamedLists (the names are just repeated redundant info).
          All of the remaining NamedLists(just each "cluster") should be a SimpleOrderedMap since access by key is more important than order... that will give us something along the lines of:

          "clusters" : [
              { "labels" : ["DDR"],
          	"docs":["TWINX2048-3200PRO","VS1GB400C3","VDBDB1A16"]
              }
              ,
              { "labels" : ["Car Power Adapter"],
          	"docs":["F8V7067-APL-KIT","IW-02"]
              }
          ]
          

          Make sense?

          Show
          Yonik Seeley added a comment - The response structure is a bit funny (it's like normal XML, which we don't really use in Solr-land), and certainly not optimal for JSON responses: "clusters" :[ "cluster" ,[ "labels" ,[ "label" , "DDR" ], "docs" ,[ "doc" , "TWINX2048-3200PRO" , "doc" , "VS1GB400C3" , "doc" , "VDBDB1A16" ]], "cluster" ,[ "labels" ,[ "label" , "Car Power Adapter" ], "docs" ,[ "doc" , "F8V7067-APL-KIT" , "doc" , "IW-02" ]], [...] Is "labels" is needed because there could be multiple labels per cluster in the future? ( I assume yes) Do we need more per-doc information than just the id? (I assume no) Could we want other per-cluster information in the future (I assume yes) What other possible information could be added in the future? Given the assumptions above, "clusters", "docs", and "labels" should all be arrays instead of NamedLists (the names are just repeated redundant info). All of the remaining NamedLists(just each "cluster") should be a SimpleOrderedMap since access by key is more important than order... that will give us something along the lines of: "clusters" : [ { "labels" : [ "DDR" ], "docs" :[ "TWINX2048-3200PRO" , "VS1GB400C3" , "VDBDB1A16" ] } , { "labels" : [ "Car Power Adapter" ], "docs" :[ "F8V7067-APL-KIT" , "IW-02" ] } ] Make sense?
          Brad Giaccio made changes -
          Attachment clustering-componet-shard.patch [ 12408762 ]
          Brad Giaccio made changes -
          Attachment clustering-componet-shard.patch [ 12409722 ]
          Hide
          Brad Giaccio added a comment -

          Okay I've rewritten the patch, as I suggested. Now the clustering happens in finishStage for distributed queries and it happens in process for non-distributed both by calling the new method clusterResults . To make this happen I had to convert the interfaces and supporting code to use SolrDocumentList rather than DocList.

          I've added a unit test which extends TestDistributedSearch, I had to modify TestDistributedSearch and make a bunch of things protected. This allowed me to write a very small test case (just had to override doTest) and leave all the logic for creating shards, distributing docs, and comparing responses in TestDistributedSearch. I felt this made for a very clean way to test a single distributed component.

          Show
          Brad Giaccio added a comment - Okay I've rewritten the patch, as I suggested. Now the clustering happens in finishStage for distributed queries and it happens in process for non-distributed both by calling the new method clusterResults . To make this happen I had to convert the interfaces and supporting code to use SolrDocumentList rather than DocList. I've added a unit test which extends TestDistributedSearch, I had to modify TestDistributedSearch and make a bunch of things protected. This allowed me to write a very small test case (just had to override doTest) and leave all the logic for creating shards, distributing docs, and comparing responses in TestDistributedSearch. I felt this made for a very clean way to test a single distributed component.
          Hide
          Koji Sekiguchi added a comment -
          <str name="Tokenizer.analyzer">fully.qualified.class.Name</str>
          

          This works as expected w/o my patch. Thank you, Stanislaw!

          Show
          Koji Sekiguchi added a comment - <str name= "Tokenizer.analyzer" >fully.qualified.class.Name</str> This works as expected w/o my patch. Thank you, Stanislaw!
          Hide
          Stanislaw Osinski added a comment -

          Ah, I should have mentioned that up front – Carrot2 will try to convert the string into the type accepted by the attribute. In case of the class-types attributes, it will try to load the class using the current thread's context classloader. Conversions are also available for numeric, boolean and enum attributes (see: http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html). Please let me know if that way works for you.

          Show
          Stanislaw Osinski added a comment - Ah, I should have mentioned that up front – Carrot2 will try to convert the string into the type accepted by the attribute. In case of the class-types attributes, it will try to load the class using the current thread's context classloader. Conversions are also available for numeric, boolean and enum attributes (see: http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html ). Please let me know if that way works for you.
          Hide
          Koji Sekiguchi added a comment -

          In fact, you can set Carrot2 attributes (both init- and request-time) in the solr config file, this should work also without the patch. Just add:

          <str name="Tokenizer.analyzer">fully.qualified.class.Name</str>

          Hmm, I thought I need to assign Class<?> type (other than String) for the second argument of the attribute. I'll try it.

          Show
          Koji Sekiguchi added a comment - In fact, you can set Carrot2 attributes (both init- and request-time) in the solr config file, this should work also without the patch. Just add: <str name="Tokenizer.analyzer">fully.qualified.class.Name</str> Hmm, I thought I need to assign Class<?> type (other than String) for the second argument of the attribute. I'll try it.
          Hide
          Stanislaw Osinski added a comment -

          In fact, you can set Carrot2 attributes (both init- and request-time) in the solr config file, this should work also without the patch. Just add:

          <str name="Tokenizer.analyzer">fully.qualified.class.Name</str>

          to the search component element. See http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find list of Carrot2 attributes, their ids and description at: http://download.carrot2.org/stable/manual/#chapter.components.

          Show
          Stanislaw Osinski added a comment - In fact, you can set Carrot2 attributes (both init- and request-time) in the solr config file, this should work also without the patch. Just add: <str name="Tokenizer.analyzer">fully.qualified.class.Name</str> to the search component element. See http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find list of Carrot2 attributes, their ids and description at: http://download.carrot2.org/stable/manual/#chapter.components .
          Koji Sekiguchi made changes -
          Attachment SOLR-769-analyzerClass.patch [ 12408894 ]
          Hide
          Koji Sekiguchi added a comment -

          patch for "carrot.analyzerClass" feature.

          Show
          Koji Sekiguchi added a comment - patch for "carrot.analyzerClass" feature.
          Hide
          Koji Sekiguchi added a comment -

          The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the initAttributes map in the init() method of CarrotClusteringEngine.

          This solves the problem. Thank you!

          Show
          Koji Sekiguchi added a comment - The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the initAttributes map in the init() method of CarrotClusteringEngine. This solves the problem. Thank you!
          Hide
          Stanislaw Osinski added a comment -

          Pasting the comment I made on the list:

          The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the initAttributes map in the init() method of CarrotClusteringEngine.

          Please let me know if this solves the problem. If not, I'll investigate further.

          Show
          Stanislaw Osinski added a comment - Pasting the comment I made on the list: The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the initAttributes map in the init() method of CarrotClusteringEngine . Please let me know if this solves the problem. If not, I'll investigate further.
          Hide
          Koji Sekiguchi added a comment - - edited

          (snip off from http://www.nabble.com/questions-about-Clustering-tt23681134.html)

          I'd like to use this cool stuff on an environment other than English, e.g. Japanese.

          I've implemented Carrot2JapaneseAnalyzer (w/ Payload/ITokenType) for this purpose. It worked well with ClusteringDocumentList example, but didn't work with CarrotClusteringEngine.

          What I did is that I inserted the following lines('+') to CarrotClusteringEngine:

          attributes.put(AttributeNames.QUERY, query.toString());
          + attributes.put(AttributeUtils.getKey(Tokenizer.class, "analyzer"),
          + Carrot2JapaneseAnalyzer.class);
          

          There is no runtime errors, but Carrot2 didn't use my analyzer, it just ignored and used ExtendedWhitespaceAnalyzer (confirmed via debugger).

          Is it classloader problem? I placed my jar in $

          {solr.solr.home}

          /lib .

          Show
          Koji Sekiguchi added a comment - - edited (snip off from http://www.nabble.com/questions-about-Clustering-tt23681134.html ) I'd like to use this cool stuff on an environment other than English, e.g. Japanese. I've implemented Carrot2JapaneseAnalyzer (w/ Payload/ITokenType) for this purpose. It worked well with ClusteringDocumentList example, but didn't work with CarrotClusteringEngine. What I did is that I inserted the following lines('+') to CarrotClusteringEngine: attributes.put(AttributeNames.QUERY, query.toString()); + attributes.put(AttributeUtils.getKey(Tokenizer.class, "analyzer" ), + Carrot2JapaneseAnalyzer.class); There is no runtime errors, but Carrot2 didn't use my analyzer, it just ignored and used ExtendedWhitespaceAnalyzer (confirmed via debugger). Is it classloader problem? I placed my jar in $ {solr.solr.home} /lib .
          Hide
          Grant Ingersoll added a comment -

          A second option would have been to move the body of the process method to finishStage. This would have the benefit of only needing to do the clustering on the final set of responses. After the QueryComponent does its job of creating the final result set. This would also not make finishStage be so dependent on what is happening in the engines when they create their cluster response

          I would say that this is actually the correct way to do this, as opposed to just stitching the results together. For example, it may very well make sense that results from shard 1 belong in cluster A when clustered on the main node, whereas they belong to cluster B when only clustered on the shard.

          If you can make that change and then add some tests, I can commit.

          I'm still trying to wrap my head around TestDistributedSearch so see how I can provide test methods.

          Please add any insight you have to http://wiki.apache.org/solr/WritingDistributedSearchComponents.

          Show
          Grant Ingersoll added a comment - A second option would have been to move the body of the process method to finishStage. This would have the benefit of only needing to do the clustering on the final set of responses. After the QueryComponent does its job of creating the final result set. This would also not make finishStage be so dependent on what is happening in the engines when they create their cluster response I would say that this is actually the correct way to do this, as opposed to just stitching the results together. For example, it may very well make sense that results from shard 1 belong in cluster A when clustered on the main node, whereas they belong to cluster B when only clustered on the shard. If you can make that change and then add some tests, I can commit. I'm still trying to wrap my head around TestDistributedSearch so see how I can provide test methods. Please add any insight you have to http://wiki.apache.org/solr/WritingDistributedSearchComponents .
          Brad Giaccio made changes -
          Attachment clustering-componet-shard.patch [ 12408762 ]
          Hide
          Brad Giaccio added a comment -

          This is a patch to add shard support to the ClusteringComponent.

          Much like the recently posted spell check shard patch it simply implements finishStage and stitches the response together.

          A second option would have been to move the body of the process method to finishStage. This would have the benefit of only needing to do the clustering on the final set of responses. After the QueryComponent does its job of creating the final result set. This would also not make finishStage be so dependent on what is happening in the engines when they create their cluster response.

          I'm still trying to wrap my head around TestDistributedSearch so see how I can provide test methods.

          If option 2 that I laid out is preferred I should be able to provide a patch for that as well.

          Show
          Brad Giaccio added a comment - This is a patch to add shard support to the ClusteringComponent. Much like the recently posted spell check shard patch it simply implements finishStage and stitches the response together. A second option would have been to move the body of the process method to finishStage. This would have the benefit of only needing to do the clustering on the final set of responses. After the QueryComponent does its job of creating the final result set. This would also not make finishStage be so dependent on what is happening in the engines when they create their cluster response. I'm still trying to wrap my head around TestDistributedSearch so see how I can provide test methods. If option 2 that I laid out is preferred I should be able to provide a patch for that as well.
          Hide
          Grant Ingersoll added a comment - - edited

          Committed revision 776692.

          Thanks to everyone who helped out, especially Carrot2 creators Dawid and Stanislaw.

          Show
          Grant Ingersoll added a comment - - edited Committed revision 776692. Thanks to everyone who helped out, especially Carrot2 creators Dawid and Stanislaw.
          Hide
          Allahbaksh Mohammedali added a comment -

          Hi Grant,
          I am looking forward keenly to see this feature. I want to see it in action as soon as possible. When the Code will be comitted to repo?

          Show
          Allahbaksh Mohammedali added a comment - Hi Grant, I am looking forward keenly to see this feature. I want to see it in action as soon as possible. When the Code will be comitted to repo?
          Hide
          Stanislaw Osinski added a comment -

          Thanks Grant! Looking forward to seeing the code in the repo!

          S.

          Show
          Stanislaw Osinski added a comment - Thanks Grant! Looking forward to seeing the code in the repo! S.
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12408051 ]
          Hide
          Grant Ingersoll added a comment -

          OK, I think all the ducks are in a row.

          I intend to commit on Friday.

          Show
          Grant Ingersoll added a comment - OK, I think all the ducks are in a row. I intend to commit on Friday.
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12405885 ]
          Attachment SOLR-769.tar [ 12405886 ]
          Hide
          Grant Ingersoll added a comment -

          OK, I think this is ready to go, except I still need to double check how it works with release. Since we can't distribute LGPL, this is going to have to be a source only release artifact and thus can never be in the WAR, unfortunately.

          The tarball contains the JAR files that one needs, with the exception of the LGPL deps which are downloaded from the approp. places.

          Show
          Grant Ingersoll added a comment - OK, I think this is ready to go, except I still need to double check how it works with release. Since we can't distribute LGPL, this is going to have to be a source only release artifact and thus can never be in the WAR, unfortunately. The tarball contains the JAR files that one needs, with the exception of the LGPL deps which are downloaded from the approp. places.
          Grant Ingersoll made changes -
          Comment [ Where can we download nni.jar from?

          Seems like if you only need two classes it would be easy enough to replace them with your own code. ]
          Hide
          Stanislaw Osinski added a comment -

          NNI JAR is indeed LGPL, it comes from MTJ: http://ressim.berlios.de/. It's also included in Carrot2 trunk, not in the main lib/ dir, but in /core/carrot2-util-matrix/lib.

          At the time we integrated it with Carrot2 (a few years ago), it used to be distributed as a separate dependency for MTJ, wow it's included in MTJ JAR. As MTJ is quite big and we need literally two classes that are in nni.jar, I'd prefer to make the NNI JAR as it is a part of download, with a reference to the MTJ project. Would that make sense?

          S.

          Show
          Stanislaw Osinski added a comment - NNI JAR is indeed LGPL, it comes from MTJ: http://ressim.berlios.de/ . It's also included in Carrot2 trunk, not in the main lib/ dir, but in /core/carrot2-util-matrix/lib. At the time we integrated it with Carrot2 (a few years ago), it used to be distributed as a separate dependency for MTJ, wow it's included in MTJ JAR. As MTJ is quite big and we need literally two classes that are in nni.jar, I'd prefer to make the NNI JAR as it is a part of download, with a reference to the MTJ project. Would that make sense? S.
          Hide
          Grant Ingersoll added a comment -

          Looks like we need to make the NNI JAR be a download, too, right? It appears to be LGPL. Where does that library come from, anyway? I don't see it on Carrot trunk, but it is in the zip. And a search for it doesn't reveal much.

          -Grant

          Show
          Grant Ingersoll added a comment - Looks like we need to make the NNI JAR be a download, too, right? It appears to be LGPL. Where does that library come from, anyway? I don't see it on Carrot trunk, but it is in the zip. And a search for it doesn't reveal much. -Grant
          Hide
          Stanislaw Osinski added a comment -

          Hi Grant,

          If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip, you'll find licenses in the lib/ folder of the distribution. That distribution contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar), so you'd need to pick only those that are relevant.

          S.

          Show
          Stanislaw Osinski added a comment - Hi Grant, If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip , you'll find licenses in the lib/ folder of the distribution. That distribution contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar), so you'd need to pick only those that are relevant. S.
          Hide
          Grant Ingersoll added a comment -

          Hi Stanislaw,

          I'm going to commit soon and I was wondering if Carrot2 has a handy place where they keep all the licenses and notices so that I can fill out Solr's NOTICE.txt and LICENSE.txt. If not, I will go collate them.

          Show
          Grant Ingersoll added a comment - Hi Stanislaw, I'm going to commit soon and I was wondering if Carrot2 has a handy place where they keep all the licenses and notices so that I can fill out Solr's NOTICE.txt and LICENSE.txt. If not, I will go collate them.
          Hide
          Stanislaw Osinski added a comment -

          Also, you say C2 can handle full docs, is it feasible, then to implement it for the "offline" mode I have in mind, whereby you cluster the whole collection offline and then store the clusters for retrieval? I haven't implemented this yet, but was thinking some people will be interested in full corpus clustering. The nice thing, then, is that as new documents come in, they can be added to existing clusters (and maybe periodically, we re-cluster). Just thinking outloud.

          We have two variables here: the length of docs and the number of docs. Carrot2 is suitable for small numbers of docs (up to say 1000). If the docs are short (a paragraph or so), the clustering should be pretty fast, suitable for on-line processing (see: http://project.carrot2.org/algorithms.html). If the documents get longer, Carrot2 will still handle them, but will require some more time for processing, I'll try to do some measurements. But C2 is not useful for the "whole collection" case – it performs all processing in-memory and here we'd need a totally different class of algorithm, something along the lines of Mahout developments.

          Hmm, that's an interesting thought. We could check to see if highlighting is done first.

          To quickly summarise the pros and cons of relying on highlighting being done outside of the clustering component:

          Pros:

          • we avoid duplication of processing (highlighting being done twice)
          • simpler code of the clustering component, less configuration

          Cons:

          • if someone doesn't want highlighting in the search results, the clustering is likely to take more time (because it operates on full documents, and it's controlled globally)
          • depending on the highlighter, we may get some markup in the summaries, which may affect clustering (I'd need to check how Carrot2 handles that)

          Should the MockClusteringAlgorithm be under the test source tree and not the main one? I moved it in the patch to follow

          Absolutely, it should be in the test source.

          I don't think we need to output the number of clusters, since that will be obvious from the list size. I dropped it in the patch to follow

          Makes sense, I kept it because the original version had it.

          Also, on the response structure, we certainly could make it optional, although it means having to go do a lookup in the real doc list, which could be less than fun.

          By "lookup" you mean the lookup in the XML response? Here again we have a trade off between the length of the response and ease of processing: if we repeat document titles / snippets in the clusters structure, we at least double the response size (at least because the same document may belong to many clusters), but can potentially save some lookups. But if we want to get some other fields of a document (other than we repeat in the clusters list), we'd still need a lookup.

          To sum up, my intuition would be to avoid duplication and stick with document ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, the clustering component could have a list of configurable fields to be repeated in the cluster list if that's really helpful in real-word use cases.

          Show
          Stanislaw Osinski added a comment - Also, you say C2 can handle full docs, is it feasible, then to implement it for the "offline" mode I have in mind, whereby you cluster the whole collection offline and then store the clusters for retrieval? I haven't implemented this yet, but was thinking some people will be interested in full corpus clustering. The nice thing, then, is that as new documents come in, they can be added to existing clusters (and maybe periodically, we re-cluster). Just thinking outloud. We have two variables here: the length of docs and the number of docs. Carrot2 is suitable for small numbers of docs (up to say 1000). If the docs are short (a paragraph or so), the clustering should be pretty fast, suitable for on-line processing (see: http://project.carrot2.org/algorithms.html ). If the documents get longer, Carrot2 will still handle them, but will require some more time for processing, I'll try to do some measurements. But C2 is not useful for the "whole collection" case – it performs all processing in-memory and here we'd need a totally different class of algorithm, something along the lines of Mahout developments. Hmm, that's an interesting thought. We could check to see if highlighting is done first. To quickly summarise the pros and cons of relying on highlighting being done outside of the clustering component: Pros: we avoid duplication of processing (highlighting being done twice) simpler code of the clustering component, less configuration Cons: if someone doesn't want highlighting in the search results, the clustering is likely to take more time (because it operates on full documents, and it's controlled globally) depending on the highlighter, we may get some markup in the summaries, which may affect clustering (I'd need to check how Carrot2 handles that) Should the MockClusteringAlgorithm be under the test source tree and not the main one? I moved it in the patch to follow Absolutely, it should be in the test source. I don't think we need to output the number of clusters, since that will be obvious from the list size. I dropped it in the patch to follow Makes sense, I kept it because the original version had it. Also, on the response structure, we certainly could make it optional, although it means having to go do a lookup in the real doc list, which could be less than fun. By "lookup" you mean the lookup in the XML response? Here again we have a trade off between the length of the response and ease of processing: if we repeat document titles / snippets in the clusters structure, we at least double the response size (at least because the same document may belong to many clusters), but can potentially save some lookups. But if we want to get some other fields of a document (other than we repeat in the clusters list), we'd still need a lookup. To sum up, my intuition would be to avoid duplication and stick with document ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, the clustering component could have a list of configurable fields to be repeated in the cluster list if that's really helpful in real-word use cases.
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12403389 ]
          Hide
          Grant Ingersoll added a comment -

          Should the MockClusteringAlgorithm be under the test source tree and not the main one? I moved it in the patch to follow

          I don't think we need to output the number of clusters, since that will be obvious from the list size. I dropped it in the patch to follow

          Also, on the response structure, we certainly could make it optional, although it means having to go do a lookup in the real doc list, which could be less than fun.

          Patch to follow

          Show
          Grant Ingersoll added a comment - Should the MockClusteringAlgorithm be under the test source tree and not the main one? I moved it in the patch to follow I don't think we need to output the number of clusters, since that will be obvious from the list size. I dropped it in the patch to follow Also, on the response structure, we certainly could make it optional, although it means having to go do a lookup in the real doc list, which could be less than fun. Patch to follow
          Hide
          Grant Ingersoll added a comment -

          Highlighting:

          Hmm, that's an interesting thought. We could check to see if highlighting is done first.

          Also, you say C2 can handle full docs, is it feasible, then to implement it for the "offline" mode I have in mind, whereby you cluster the whole collection offline and then store the clusters for retrieval? I haven't implemented this yet, but was thinking some people will be interested in full corpus clustering. The nice thing, then, is that as new documents come in, they can be added to existing clusters (and maybe periodically, we re-cluster). Just thinking outloud.

          Rest of the stuff in that comment sounds good. I will try out the patch.

          Show
          Grant Ingersoll added a comment - Highlighting: Hmm, that's an interesting thought. We could check to see if highlighting is done first. Also, you say C2 can handle full docs, is it feasible, then to implement it for the "offline" mode I have in mind, whereby you cluster the whole collection offline and then store the clusters for retrieval? I haven't implemented this yet, but was thinking some people will be interested in full corpus clustering. The nice thing, then, is that as new documents come in, they can be added to existing clusters (and maybe periodically, we re-cluster). Just thinking outloud. Rest of the stuff in that comment sounds good. I will try out the patch.
          Stanislaw Osinski made changes -
          Attachment SOLR-769.patch [ 12401945 ]
          Stanislaw Osinski made changes -
          Attachment SOLR-769.zip [ 12402688 ]
          Hide
          Stanislaw Osinski added a comment -

          Further code clean-ups, support for passing intialization-time attributes to Carrot2 algorithms, some comments in the example configuration file.

          Show
          Stanislaw Osinski added a comment - Further code clean-ups, support for passing intialization-time attributes to Carrot2 algorithms, some comments in the example configuration file.
          Shalin Shekhar Mangar made changes -
          Fix Version/s 1.4 [ 12313351 ]
          Hide
          Shalin Shekhar Mangar added a comment -

          Marking for 1.4 release

          Show
          Shalin Shekhar Mangar added a comment - Marking for 1.4 release
          Stanislaw Osinski made changes -
          Attachment SOLR-769-lib.zip [ 12402482 ]
          Hide
          Stanislaw Osinski added a comment -

          Libs with Carrot2 v3.0.1 we've just released.

          Show
          Stanislaw Osinski added a comment - Libs with Carrot2 v3.0.1 we've just released.
          Stanislaw Osinski made changes -
          Attachment SOLR-769-lib.zip [ 12401946 ]
          Hide
          Stanislaw Osinski added a comment - - edited

          Hi All,

          I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues:

          1. Response structure.

          I was wondering – to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this:

          <lst name="clusters">
           <int name="numClusters">3</int>
           <lst name="cluster">
            <lst name="labels">
              <str name="label">GPU VPU Clocked</str>
            </lst>
            <lst name="docs">
              <str name="doc">EN7800GTX/2DHTV/256M</str>
              <str name="doc">100-435805</str>
            </lst>
           </lst>
           <lst name="cluster">
            <lst name="labels">
              <str name="label">Hard Drive</str>
            </lst>
            <lst name="docs">
              <str name="doc">6H500F0</str>
              <str name="doc">SP2514N</str>
            </lst>
           </lst>
           <lst name="cluster">
            <lst name="labels">
              <str name="label">Other Topics</str>
            </lst>
            <lst name="docs">
              <str name="doc">9885A004</str>
            </lst>
           </lst>
          

          Actually, this is what I've implemented in the patch.

          Also, in case of hierarchical clusters I've introduced a grouping entity called "clusters" so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense.


          2. Build: compile warnings about missing SimpleXML

          SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings.


          3. Build: copying of protowords.txt etc

          The patch includes lexical files both in the contrib/clustering/src/java/test/resources/.... and in the examples dir. I'm not sure how this is handled though – do you keep copies in the repository or copy those somehow in the build?


          4. Highlighting

          This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters.

          I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though.


          5. Documentation (wiki) updates

          Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters.

          Show
          Stanislaw Osinski added a comment - - edited Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: 1. Response structure. I was wondering – to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: <lst name= "clusters" > < int name= "numClusters" >3</ int > <lst name= "cluster" > <lst name= "labels" > <str name= "label" >GPU VPU Clocked</str> </lst> <lst name= "docs" > <str name= "doc" >EN7800GTX/2DHTV/256M</str> <str name= "doc" >100-435805</str> </lst> </lst> <lst name= "cluster" > <lst name= "labels" > <str name= "label" >Hard Drive</str> </lst> <lst name= "docs" > <str name= "doc" >6H500F0</str> <str name= "doc" >SP2514N</str> </lst> </lst> <lst name= "cluster" > <lst name= "labels" > <str name= "label" >Other Topics</str> </lst> <lst name= "docs" > <str name= "doc" >9885A004</str> </lst> </lst> Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called "clusters" so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. 2. Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. 3. Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/.... and in the examples dir. I'm not sure how this is handled though – do you keep copies in the repository or copy those somehow in the build? 4. Highlighting This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters. I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though. 5. Documentation (wiki) updates Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters.
          Stanislaw Osinski made changes -
          Attachment SOLR-769.patch [ 12401945 ]
          Attachment SOLR-769-lib.zip [ 12401946 ]
          Hide
          Stanislaw Osinski added a comment -

          Yet another patch, this time with passing unit tests and working example. Will make some more comments in a sec. Please use SOLR-769-lib.zip libs with this patch.

          Show
          Stanislaw Osinski added a comment - Yet another patch, this time with passing unit tests and working example. Will make some more comments in a sec. Please use SOLR-769 -lib.zip libs with this patch.
          Hide
          Stanislaw Osinski added a comment -

          Hi Grant,

          I've added a Carrot2 issue referring to point 3 on your TODO list: http://issues.carrot2.org/browse/CARROT-457. I'll be looking into this over the weekend.

          Staszek

          Show
          Stanislaw Osinski added a comment - Hi Grant, I've added a Carrot2 issue referring to point 3 on your TODO list: http://issues.carrot2.org/browse/CARROT-457 . I'll be looking into this over the weekend. Staszek
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12399922 ]
          Hide
          Grant Ingersoll added a comment -

          Here's a patch for Carrot2 3.0 that COMPILES ONLY.
          You will need to download the clustering-libs.tar.gz from http://people.apache.org/~gsingers/clustering-libs.tar.gz as it is too big to upload to JIRA.

          TODO:
          1. Tests passing and more tests
          2. Update NOTICE.txt and LICENSE.txt
          3. Get trimmed down Carrot2 library that doesn't have all the Document Source dependencies, and preferably the web services deps. Solr doesn't need the Google, etc. API deps. Preferably remove the LGPL deps too, but for now, they are downloaded via ANT from the Maven repositories.
          4. Update the Maven template
          5. Hook in the builds
          6. Make sure the example works

          Show
          Grant Ingersoll added a comment - Here's a patch for Carrot2 3.0 that COMPILES ONLY. You will need to download the clustering-libs.tar.gz from http://people.apache.org/~gsingers/clustering-libs.tar.gz as it is too big to upload to JIRA. TODO: 1. Tests passing and more tests 2. Update NOTICE.txt and LICENSE.txt 3. Get trimmed down Carrot2 library that doesn't have all the Document Source dependencies, and preferably the web services deps. Solr doesn't need the Google, etc. API deps. Preferably remove the LGPL deps too, but for now, they are downloaded via ANT from the Maven repositories. 4. Update the Maven template 5. Hook in the builds 6. Make sure the example works
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12397778 ]
          Hide
          Grant Ingersoll added a comment -
          Show
          Grant Ingersoll added a comment - Updated to trunk. See http://wiki.apache.org/solr/ClusteringComponent
          Hide
          Grant Ingersoll added a comment -

          Hi Bruce,

          I haven't done any perf. testing, as I've been focused on functionality first. However, I'm not sure whether that query was the first one run, or not, so I don't know the status of the searcher, etc. I'm pretty sure I don't have any warming queries, etc.

          Show
          Grant Ingersoll added a comment - Hi Bruce, I haven't done any perf. testing, as I've been focused on functionality first. However, I'm not sure whether that query was the first one run, or not, so I don't know the status of the searcher, etc. I'm pretty sure I don't have any warming queries, etc.
          Hide
          Stanislaw Osinski added a comment -

          Bruce,

          For performance of the clustering algorithm alone, please take a look at: http://project.carrot2.org/algorithms.html
          Obviously, you'd need to add the overhead of fetching the snippets / documents from the index. Not sure how many are fetched and whether they come from Solr's cache or not, so not sure if clustering or fetching time is prevailing.

          Cheers,

          Staszek

          Show
          Stanislaw Osinski added a comment - Bruce, For performance of the clustering algorithm alone, please take a look at: http://project.carrot2.org/algorithms.html Obviously, you'd need to add the overhead of fetching the snippets / documents from the index. Not sure how many are fetched and whether they come from Solr's cache or not, so not sure if clustering or fetching time is prevailing. Cheers, Staszek
          Hide
          Bruce Ritchie added a comment -

          Grant,

          This patch looks very promising, I can't wait to give it a try and find a way to incorporate it into a project I'm working on (when it's ready of course ... likely not till after Carrot2 3 is released though)

          Can you give a quick estimate as to the performance impact of enabling clustering in search results mode? In the example @ http://wiki.apache.org/solr/ClusteringFullResultsExample the query time seems pretty high and I was wondering if that was a result of this patch or something else?

          Thanks,

          Bruce Ritchie

          Show
          Bruce Ritchie added a comment - Grant, This patch looks very promising, I can't wait to give it a try and find a way to incorporate it into a project I'm working on (when it's ready of course ... likely not till after Carrot2 3 is released though) Can you give a quick estimate as to the performance impact of enabling clustering in search results mode? In the example @ http://wiki.apache.org/solr/ClusteringFullResultsExample the query time seems pretty high and I was wondering if that was a result of this patch or something else? Thanks, Bruce Ritchie
          Hide
          Vaijanath N. Rao added a comment -

          Hi Grant,

          Till now I have worked mostly with full document clustering. Had never thought of search snippet clustering. I will definitely pitch in for clustering library. There are many libraries which have favourable/acceptable licensing terms which can be added to Solr.

          --Thanks and Regards
          Vaijanath

          Show
          Vaijanath N. Rao added a comment - Hi Grant, Till now I have worked mostly with full document clustering. Had never thought of search snippet clustering. I will definitely pitch in for clustering library. There are many libraries which have favourable/acceptable licensing terms which can be added to Solr. --Thanks and Regards Vaijanath
          Hide
          Grant Ingersoll added a comment -

          So what would be the procedure to add some clustering code beyond carrot or other available libraries.

          Essentially, you need to implement either a SearchClusteringEngine or a DocumentClusteringEngine and then hook declare it in the SearchComponent configuration, as is done with the Carrot2 example here:

          <lst name="engine">
                <!-- The name, only one can be named "default" -->
                <str name="name">default</str>
                <!-- Carrot2 specific parameters.  See the Carrot2 site for details on setting. -->
                <!-- carrot.algorithm:   Optional.  Currently only
                lingo is supported pending the release of Carrot2 3.0.  
                 -->
                <str name="carrot.algorithm">lingo</str>
                <!-- Lingo specific -->
                <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
                <float name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
              </lst>
          

          or, in the mock setup:

          <lst name="engine">
                <!-- The name, only one can be named "default" -->
                <str name="name">docEngine</str>
                <str name="classname">org.apache.solr.handler.clustering.MockDocumentClusteringEngine</str>
              </lst>
          

          If you don't declare the classname value, then it assumes the Carrot implementation.

          Naturally, you need to take care of all the libraries being available to Solr, etc. just as you would for any plugin.

          Since you are interested in clustering, Vaijanath, it would be good to get your feedback on the APIs. Are you doing full document clustering or just search snippet clustering? Also, if you are using an open source clustering library that has acceptable licensing terms (i.e. not GPL or similar), perhaps consider contributing an implementation of the engine and then we can make it available to everyone.

          Show
          Grant Ingersoll added a comment - So what would be the procedure to add some clustering code beyond carrot or other available libraries. Essentially, you need to implement either a SearchClusteringEngine or a DocumentClusteringEngine and then hook declare it in the SearchComponent configuration, as is done with the Carrot2 example here: <lst name= "engine" > <!-- The name, only one can be named " default " --> <str name= "name" > default </str> <!-- Carrot2 specific parameters. See the Carrot2 site for details on setting. --> <!-- carrot.algorithm: Optional. Currently only lingo is supported pending the release of Carrot2 3.0. --> <str name= "carrot.algorithm" >lingo</str> <!-- Lingo specific --> < float name= "carrot.lingo.threshold.clusterAssignment" >0.150</ float > < float name= "carrot.lingo.threshold.candidateClusterThreshold" >0.775</ float > </lst> or, in the mock setup: <lst name= "engine" > <!-- The name, only one can be named " default " --> <str name= "name" >docEngine</str> <str name= "classname" >org.apache.solr.handler.clustering.MockDocumentClusteringEngine</str> </lst> If you don't declare the classname value, then it assumes the Carrot implementation. Naturally, you need to take care of all the libraries being available to Solr, etc. just as you would for any plugin. Since you are interested in clustering, Vaijanath, it would be good to get your feedback on the APIs. Are you doing full document clustering or just search snippet clustering? Also, if you are using an open source clustering library that has acceptable licensing terms (i.e. not GPL or similar), perhaps consider contributing an implementation of the engine and then we can make it available to everyone.
          Hide
          Vaijanath N. Rao added a comment -

          Hi Grant,

          For just minor copying of .txt file I got this working without any problems.

          So what would be the procedure to add some clustering code beyond carrot or other available libraries.

          --Thanks and Regards
          Vaijanath

          Show
          Vaijanath N. Rao added a comment - Hi Grant, For just minor copying of .txt file I got this working without any problems. So what would be the procedure to add some clustering code beyond carrot or other available libraries. --Thanks and Regards Vaijanath
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12392572 ]
          Hide
          Grant Ingersoll added a comment -

          How about a patch where the tests pass? Here ya go...

          Show
          Grant Ingersoll added a comment - How about a patch where the tests pass? Here ya go...
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12392566 ]
          Hide
          Grant Ingersoll added a comment -

          OK, here's a first scratch at the component side of document clustering. There are no implementations of the DocumentClusteringEngine yet, so I am bit hesitant to even throw out a proposed API for that yet, but the current one is pretty generic, which is both good and bad. I don't particularly like passing around something as open as SolrParams, but I don't think I can pin down a generic set of explicit parameters either.

          Show
          Grant Ingersoll added a comment - OK, here's a first scratch at the component side of document clustering. There are no implementations of the DocumentClusteringEngine yet, so I am bit hesitant to even throw out a proposed API for that yet, but the current one is pretty generic, which is both good and bad. I don't particularly like passing around something as open as SolrParams, but I don't think I can pin down a generic set of explicit parameters either.
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12392515 ]
          Hide
          Grant Ingersoll added a comment -

          Removed the alternate algorithm implementations, but left in some of the framework for adding them. The Carrot2 maintainers are likely to remove Fuzzy Ants and some of the other implementations in 3.0, which is due out sometime soon. Thus, I'd rather not support something that isn't recommended.

          I'm likely to commit this fairly soon.

          -Grant

          Show
          Grant Ingersoll added a comment - Removed the alternate algorithm implementations, but left in some of the framework for adding them. The Carrot2 maintainers are likely to remove Fuzzy Ants and some of the other implementations in 3.0, which is due out sometime soon. Thus, I'd rather not support something that isn't recommended. I'm likely to commit this fairly soon. -Grant
          Hide
          Grant Ingersoll added a comment -

          Note, also, that even though I put in support for some of the other C2 (Carrot2) algorithms, I don't think they quite work yet. I think they require passing in more parameters to set some algorithm properties (for instance, for Fuzzy Ants, I think you need to set a depth) and I haven't figured those out yet. If you have C2 experience, insight would be appreciated.

          For now, stick to Lingo.

          Show
          Grant Ingersoll added a comment - Note, also, that even though I put in support for some of the other C2 (Carrot2) algorithms, I don't think they quite work yet. I think they require passing in more parameters to set some algorithm properties (for instance, for Fuzzy Ants, I think you need to set a depth) and I haven't figured those out yet. If you have C2 experience, insight would be appreciated. For now, stick to Lingo.
          Grant Ingersoll made changes -
          Attachment clustering-libs.tar [ 12392181 ]
          Hide
          Grant Ingersoll added a comment -

          Untar in contrib/clustering/lib.

          Show
          Grant Ingersoll added a comment - Untar in contrib/clustering/lib.
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12392180 ]
          Hide
          Grant Ingersoll added a comment -

          Here's a patch that actually passes the tests.

          Note, there's still a little oddity with the Snowball program that needs to be worked out, thus I don't recommend running this patch in production yet. The issue is that both Carrot and Solr have deps on Snowball, but on different versions, furthermore, Carrot2 goes one further and slightly modifies the names of Snowball.

          I will upload new libs in a minute.

          Show
          Grant Ingersoll added a comment - Here's a patch that actually passes the tests. Note, there's still a little oddity with the Snowball program that needs to be worked out, thus I don't recommend running this patch in production yet. The issue is that both Carrot and Solr have deps on Snowball, but on different versions, furthermore, Carrot2 goes one further and slightly modifies the names of Snowball. I will upload new libs in a minute.
          Hide
          Grant Ingersoll added a comment -

          Yeah, I probably will include the other jars and make it easy to include them. For now, I wanted to get something basic working for a talk I'm giving on Wednesday night

          Show
          Grant Ingersoll added a comment - Yeah, I probably will include the other jars and make it easy to include them. For now, I wanted to get something basic working for a talk I'm giving on Wednesday night
          Hide
          Andrzej Bialecki added a comment -

          FYI, Carrot2 does support a handful of different clustering algorithms (the ones I know of are Fuzzy Ants, KMeans and Suffix Tree, in addition to Lingo).

          Show
          Andrzej Bialecki added a comment - FYI, Carrot2 does support a handful of different clustering algorithms (the ones I know of are Fuzzy Ants, KMeans and Suffix Tree, in addition to Lingo).
          Hide
          Grant Ingersoll added a comment -

          Still to do, more testing, get feedback, implement basics of doc. clustering. This last piece will take some more design work. Also need to validate some more that the results make sense for search results clustering, but my first look suggests they do.

          Show
          Grant Ingersoll added a comment - Still to do, more testing, get feedback, implement basics of doc. clustering. This last piece will take some more design work. Also need to validate some more that the results make sense for search results clustering, but my first look suggests they do.
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12391950 ]
          Hide
          Grant Ingersoll added a comment -

          More updates, added example

          Show
          Grant Ingersoll added a comment - More updates, added example
          Grant Ingersoll made changes -
          Attachment SOLR-769.patch [ 12391945 ]
          Hide
          Grant Ingersoll added a comment -

          First draft of a patch.

          Notes:

          1. Carrot2 uses the snowball stemmers, but it shouldn't clash, b/c it actually slightly changes the names of them to be like englishStemmer (as opposed to EnglishStemmer). I'm debating whether or not to just re-implement this so that it can use the same snowball stemmers we use in Solr. Probably not a big deal.

          2. I haven't implemented document clustering yet. To do this, I need to setup a background thread that will be spawned to do the clustering, since it is presumably going through some large set of documents and clustering them. To do this, it will probably require term vectors. This will introduce a dep. on Mahout, so I'll need a version of that library too.

          3. It would be really cool for the Carrot2 implementation to support using other clustering algs besides Lingo. Basically, this just needs to be factored into the configuration and the jars included in the distribution. This is not a high priority for me at the moment.

          TODO:
          More tests.
          Decide on output format
          Implement doc. clustering framework part (i.e. spawning of threads, commands)
          ????

          Show
          Grant Ingersoll added a comment - First draft of a patch. Notes: 1. Carrot2 uses the snowball stemmers, but it shouldn't clash, b/c it actually slightly changes the names of them to be like englishStemmer (as opposed to EnglishStemmer). I'm debating whether or not to just re-implement this so that it can use the same snowball stemmers we use in Solr. Probably not a big deal. 2. I haven't implemented document clustering yet. To do this, I need to setup a background thread that will be spawned to do the clustering, since it is presumably going through some large set of documents and clustering them. To do this, it will probably require term vectors. This will introduce a dep. on Mahout, so I'll need a version of that library too. 3. It would be really cool for the Carrot2 implementation to support using other clustering algs besides Lingo. Basically, this just needs to be factored into the configuration and the jars included in the distribution. This is not a high priority for me at the moment. TODO: More tests. Decide on output format Implement doc. clustering framework part (i.e. spawning of threads, commands) ????
          Grant Ingersoll made changes -
          Attachment clustering-libs.tar [ 12391944 ]
          Hide
          Grant Ingersoll added a comment -

          Clustering libs

          Show
          Grant Ingersoll added a comment - Clustering libs
          Hide
          Grant Ingersoll added a comment -

          Patch soon, as a start. I'm going to check in the basic directory structure and libs, and then provide a patch with the source that we can iterate on.

          Show
          Grant Ingersoll added a comment - Patch soon, as a start. I'm going to check in the basic directory structure and libs, and then provide a patch with the source that we can iterate on.
          Grant Ingersoll made changes -
          Field Original Value New Value
          Status Open [ 1 ] In Progress [ 3 ]
          Hide
          Grant Ingersoll added a comment -
          Show
          Grant Ingersoll added a comment - Starting docs at http://wiki.apache.org/solr/ClusteringComponent
          Grant Ingersoll created issue -

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              6 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development