Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: contrib - Clustering
    • Labels:
      None

      Description

      http://project.carrot2.org/release-3.2.0-notes.html

      Carrot2 is now LGPL free, which means we should be able to bundle the binary!

      1. carrot2-core-3.4.0-jdk1.5.jar
        640 kB
        Stanislaw Osinski
      2. SOLR-1804.patch
        16 kB
        Grant Ingersoll
      3. SOLR-1804-carrot2-3.4.0-dev.patch
        28 kB
        Stanislaw Osinski
      4. SOLR-1804-carrot2-3.4.0-dev-trunk.patch
        35 kB
        Stanislaw Osinski
      5. SOLR-1804-carrot2-3.4.0-libs.zip
        4.83 MB
        Stanislaw Osinski

        Issue Links

          Activity

          Hide
          Stanislaw Osinski added a comment -

          Carrot2 3.2.0 uses Lucene 3.x API, so we'd need to wait until Solr upgrades too. A patch upgrading Carrot2 to 3.2.0 in Solr is here: http://issues.carrot2.org/browse/CARROT-623.

          Show
          Stanislaw Osinski added a comment - Carrot2 3.2.0 uses Lucene 3.x API, so we'd need to wait until Solr upgrades too. A patch upgrading Carrot2 to 3.2.0 in Solr is here: http://issues.carrot2.org/browse/CARROT-623 .
          Hide
          Robert Muir added a comment -

          I wonder if you guys have any insight why the results of this test may have changed from 16 to 15 between Lucene 3.0 and Lucene 3.1-dev: http://svn.apache.org/viewvc?view=revision&revision=923048

          It did not change between Lucene 2.9 and Lucene 3.0, so I'm concerned about why the results would change between 3.0 and 3.1-dev.

          One possible explanation would be if Carrot2 used Version.LUCENE_CURRENT somewhere in its code. Any ideas?

          Show
          Robert Muir added a comment - I wonder if you guys have any insight why the results of this test may have changed from 16 to 15 between Lucene 3.0 and Lucene 3.1-dev: http://svn.apache.org/viewvc?view=revision&revision=923048 It did not change between Lucene 2.9 and Lucene 3.0, so I'm concerned about why the results would change between 3.0 and 3.1-dev. One possible explanation would be if Carrot2 used Version.LUCENE_CURRENT somewhere in its code. Any ideas?
          Hide
          Stanislaw Osinski added a comment -

          Hi Robert,

          Lucene dependency is the only change, right? Or you also upgraded Carrot2 from e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have changed e.g. because we tuned stop words or other algorithm attributes.

          S.

          Show
          Stanislaw Osinski added a comment - Hi Robert, Lucene dependency is the only change, right? Or you also upgraded Carrot2 from e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have changed e.g. because we tuned stop words or other algorithm attributes. S.
          Hide
          Robert Muir added a comment -

          Hi Stanislaw:

          Correct, I did not upgrade anything else, just lucene.

          I'm sorry its not exactly related to this issue
          (although If we need to upgrade carrot2 to be compatible with Lucene 3.x, then thats ok)

          My concern is more that we did something in Lucene between 3.0
          and now that caused the results to be different... though again
          this could be explained if somewhere in its code Carrot2 uses some
          Lucene analysis component, but doesn't hardwire Version to LUCENE_29.

          If all else fails I can try to seek out the svn rev # of Lucene that causes this change,
          by brute force binary search

          Show
          Robert Muir added a comment - Hi Stanislaw: Correct, I did not upgrade anything else, just lucene. I'm sorry its not exactly related to this issue (although If we need to upgrade carrot2 to be compatible with Lucene 3.x, then thats ok) My concern is more that we did something in Lucene between 3.0 and now that caused the results to be different... though again this could be explained if somewhere in its code Carrot2 uses some Lucene analysis component, but doesn't hardwire Version to LUCENE_29. If all else fails I can try to seek out the svn rev # of Lucene that causes this change, by brute force binary search
          Hide
          Grant Ingersoll added a comment -

          Robert, instead of tracking it down by brute force, you might just dump out the clusters and see if they are still reasonable. If they are, I wouldn't worry too much about it, as it is likely due to the issues Staszek mentioned.

          Show
          Grant Ingersoll added a comment - Robert, instead of tracking it down by brute force, you might just dump out the clusters and see if they are still reasonable. If they are, I wouldn't worry too much about it, as it is likely due to the issues Staszek mentioned.
          Hide
          Robert Muir added a comment -

          Grant I am concerned about a possible BW break in Lucene trunk, that is all.
          I think its strange that 3.0 and 3.1 jars give different results.

          Can you tell me if the clusters are reasonable? here is the output.

          junit.framework.AssertionFailedError: number of clusters: [
          {labels=[Data Mining Applications], docs=[5, 13, 25, 12, 27],clusters=[]}, 
          {labels=[Databases],docs=[15, 21, 7, 17, 11],clusters=[]}, 
          {labels=[Knowledge Discovery],docs=[6, 18, 15, 17, 10],clusters=[]}, 
          {labels=[Statistical Data Mining],docs=[28, 24, 2, 14],clusters=[]}, 
          {labels=[Data Mining Solutions],docs=[5, 22, 8],clusters=[]}, 
          {labels=[Data Mining Techniques],docs=[12, 2, 14],clusters=[]}, 
          {labels=[Known as Data Mining],docs=[23, 17, 19],clusters=[]}, 
          {labels=[Text Mining],docs=[6, 9, 29],clusters=[]}, 
          {labels=[Dedicated],docs=[10, 11],clusters=[]}, 
          {labels=[Extraction of Hidden Predictive],docs=[3, 11],clusters=[]}, 
          {labels=[Information from Large],docs=[3, 7],clusters=[]}, 
          {labels=[Neural Networks],docs=[12, 1],clusters=[]}, 
          {labels=[Open],docs=[15, 20],clusters=[]}, 
          {labels=[Research],docs=[26, 8],clusters=[]}, 
          {labels=[Other Topics],docs=[16],clusters=[]}
          ] expected:<16> but was:<15>
          
          Show
          Robert Muir added a comment - Grant I am concerned about a possible BW break in Lucene trunk, that is all. I think its strange that 3.0 and 3.1 jars give different results. Can you tell me if the clusters are reasonable? here is the output. junit.framework.AssertionFailedError: number of clusters: [ {labels=[Data Mining Applications], docs=[5, 13, 25, 12, 27],clusters=[]}, {labels=[Databases],docs=[15, 21, 7, 17, 11],clusters=[]}, {labels=[Knowledge Discovery],docs=[6, 18, 15, 17, 10],clusters=[]}, {labels=[Statistical Data Mining],docs=[28, 24, 2, 14],clusters=[]}, {labels=[Data Mining Solutions],docs=[5, 22, 8],clusters=[]}, {labels=[Data Mining Techniques],docs=[12, 2, 14],clusters=[]}, {labels=[Known as Data Mining],docs=[23, 17, 19],clusters=[]}, {labels=[Text Mining],docs=[6, 9, 29],clusters=[]}, {labels=[Dedicated],docs=[10, 11],clusters=[]}, {labels=[Extraction of Hidden Predictive],docs=[3, 11],clusters=[]}, {labels=[Information from Large],docs=[3, 7],clusters=[]}, {labels=[Neural Networks],docs=[12, 1],clusters=[]}, {labels=[Open],docs=[15, 20],clusters=[]}, {labels=[Research],docs=[26, 8],clusters=[]}, {labels=[Other Topics],docs=[16],clusters=[]} ] expected:<16> but was:<15>
          Hide
          Stanislaw Osinski added a comment -

          I was about to offer advice similar to Grant's, but wanted to wait to confirm the scope of changes.

          If it was only Lucene dependency update, with the assumption that the update didn't change the documents fed to Carrot2 in tests, the results shouldn't change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the standard Lucene one; so no Version.LUCENE_* issues as far as I can tell.

          I haven't got Solr code handy, but maybe the test performs clustering on summaries generated from the original test documents and Lucene 3.x introduces some changes in the way summaries are generated?

          If the clusters look reasonable, the problem is probably not critical, but still worth investigation to make sure it's not a bug of some kind.

          S.

          Show
          Stanislaw Osinski added a comment - I was about to offer advice similar to Grant's, but wanted to wait to confirm the scope of changes. If it was only Lucene dependency update, with the assumption that the update didn't change the documents fed to Carrot2 in tests, the results shouldn't change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the standard Lucene one; so no Version.LUCENE_* issues as far as I can tell. I haven't got Solr code handy, but maybe the test performs clustering on summaries generated from the original test documents and Lucene 3.x introduces some changes in the way summaries are generated? If the clusters look reasonable, the problem is probably not critical, but still worth investigation to make sure it's not a bug of some kind. S.
          Hide
          Stanislaw Osinski added a comment -

          Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be distributed together with Solr.

          S.

          Show
          Stanislaw Osinski added a comment - Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be distributed together with Solr. S.
          Hide
          Robert Muir added a comment -

          Thanks for the confirmation the clusters are ok.

          Well, this is embarrassing, it turns out it is a backwards break,
          though documented, and the culprit is yours truly.

          This is the reason it gets different results:

          * LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default.
            This means that terms with a position increment gap of zero do not
            affect the norms calculation by default.  (Robert Muir)
          

          I'll change the test to expect 15 clusters with Lucene 3.1, thanks

          Show
          Robert Muir added a comment - Thanks for the confirmation the clusters are ok. Well, this is embarrassing, it turns out it is a backwards break, though documented, and the culprit is yours truly. This is the reason it gets different results: * LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This means that terms with a position increment gap of zero do not affect the norms calculation by default. (Robert Muir) I'll change the test to expect 15 clusters with Lucene 3.1, thanks
          Hide
          Grant Ingersoll added a comment -

          We should be able to go through with this now, right?

          Show
          Grant Ingersoll added a comment - We should be able to go through with this now, right?
          Hide
          Stanislaw Osinski added a comment -

          Hi,

          As we're near the 3.4.0 release of Carrot2, I'm including a patch that upgrades the clustering plugin. The most notable changes are:

          • [3.4.0] Carrot2 core no longer depends on Lucene APIs, so the build.xml can be enabled again. The only class that makes use of Lucene API, LuceneLanguageModelFactory, is now included in the plugin's code, so there shouldn't be any problems with refactoring. In fact, I've already updated LuceneLanguageModelFactory to remove the use of deprecated APIs.
          • [3.3.0] The STC algorithm has seen some significant scalability improvements
          • [3.2.0] Carrot2 core no longer depends on LGPL libraries, so all the JARs can now be included in Solr SVN and SOLR-2007 won't need fixing.

          Included is a patch against r966211. A ZIP with JARs will follow in a sec.

          A couple of notes:

          • The upgrade requires upgrading Google collections to Guava. This is a drop-in replacement, all tests pass for me after the upgrade, plus the upgrade is recommended on the original Google Collections site.
          • The patch includes Carrot2 3.4.0-dev JAR, but I guess it's worth committing already to avoid the library downloads hassle (SOLR-2007).
          • Originally, Carrot2 supports clustering of Chinese content based on the Smart Chinese Tokenizer. This tokenizer would have to be referenced from the LuceneLanguageModelFactory class in Solr. However, when compiling the code in Ant, this smartcn doesn't seem available in the classpath. Is it a matter of modifying the build files, or it's a policy on dependencies between plugins?

          Let me know if you have any problems applying the patch.

          Thanks!

          S.

          Show
          Stanislaw Osinski added a comment - Hi, As we're near the 3.4.0 release of Carrot2, I'm including a patch that upgrades the clustering plugin. The most notable changes are: [3.4.0] Carrot2 core no longer depends on Lucene APIs, so the build.xml can be enabled again. The only class that makes use of Lucene API, LuceneLanguageModelFactory , is now included in the plugin's code, so there shouldn't be any problems with refactoring. In fact, I've already updated LuceneLanguageModelFactory to remove the use of deprecated APIs. [3.3.0] The STC algorithm has seen some significant scalability improvements [3.2.0] Carrot2 core no longer depends on LGPL libraries, so all the JARs can now be included in Solr SVN and SOLR-2007 won't need fixing. Included is a patch against r966211. A ZIP with JARs will follow in a sec. A couple of notes: The upgrade requires upgrading Google collections to Guava. This is a drop-in replacement, all tests pass for me after the upgrade, plus the upgrade is recommended on the original Google Collections site. The patch includes Carrot2 3.4.0-dev JAR, but I guess it's worth committing already to avoid the library downloads hassle ( SOLR-2007 ). Originally, Carrot2 supports clustering of Chinese content based on the Smart Chinese Tokenizer. This tokenizer would have to be referenced from the LuceneLanguageModelFactory class in Solr. However, when compiling the code in Ant, this smartcn doesn't seem available in the classpath. Is it a matter of modifying the build files, or it's a policy on dependencies between plugins? Let me know if you have any problems applying the patch. Thanks! S.
          Hide
          Stanislaw Osinski added a comment -

          Libs required for the Carrot2 3.4.0 update.

          Show
          Stanislaw Osinski added a comment - Libs required for the Carrot2 3.4.0 update.
          Hide
          Robert Muir added a comment -

          Hi Stanislaw: this looks cool! So, carrot2 jars don't depend directly on Lucene, and we can re-enable this component in trunk, and simply maintain the LuceneLanguageModelFactory?

          As far as the smart chinese, its currently not included with Solr, so I think this is why you have trouble. But could we enable a carrot2 factory for it that reflects it, in case the user puts the jar in the classpath?

          Show
          Robert Muir added a comment - Hi Stanislaw: this looks cool! So, carrot2 jars don't depend directly on Lucene, and we can re-enable this component in trunk, and simply maintain the LuceneLanguageModelFactory? As far as the smart chinese, its currently not included with Solr, so I think this is why you have trouble. But could we enable a carrot2 factory for it that reflects it, in case the user puts the jar in the classpath?
          Hide
          Stanislaw Osinski added a comment -

          Hi Stanislaw: this looks cool! So, carrot2 jars don't depend directly on Lucene, and we can re-enable this component in trunk, and simply maintain the LuceneLanguageModelFactory?

          Correct. The only dependency on Lucene is LuceneLanguageModelFactory, which is now part of Solr code base. In fact, we could also try bringing back the clustering plugin to Solr trunk, though I haven't tried that yet.

          As far as the smart chinese, its currently not included with Solr, so I think this is why you have trouble. But could we enable a carrot2 factory for it that reflects it, in case the user puts the jar in the classpath?

          Essentially, the dependency on the smart chinese is optional in a sense that the lack of it will degrade the quality of clustering in Chinese, but will not break it. Let me see if I can make it optionally loadable in LuceneLanguageModelFactory too. If not, we'll have to live with degraded clustering quality in case of Chinese.

          Show
          Stanislaw Osinski added a comment - Hi Stanislaw: this looks cool! So, carrot2 jars don't depend directly on Lucene, and we can re-enable this component in trunk, and simply maintain the LuceneLanguageModelFactory? Correct. The only dependency on Lucene is LuceneLanguageModelFactory , which is now part of Solr code base. In fact, we could also try bringing back the clustering plugin to Solr trunk, though I haven't tried that yet. As far as the smart chinese, its currently not included with Solr, so I think this is why you have trouble. But could we enable a carrot2 factory for it that reflects it, in case the user puts the jar in the classpath? Essentially, the dependency on the smart chinese is optional in a sense that the lack of it will degrade the quality of clustering in Chinese, but will not break it. Let me see if I can make it optionally loadable in LuceneLanguageModelFactory too. If not, we'll have to live with degraded clustering quality in case of Chinese.
          Hide
          Stanislaw Osinski added a comment -

          Essentially, the dependency on the smart chinese is optional in a sense that the lack of it will degrade the quality of clustering in Chinese, but will not break it. Let me see if I can make it optionally loadable in LuceneLanguageModelFactory too.

          I think we could handle this in a similar way as in Carrot2: attempt to load chinese tokenizer and fall back to the default one in case of class loading exceptions. The easiest implementation route would be to include smart chinese as a dependency during compilation of the clustering plugin with an understanding that the library may or may not be available during runtime. Is that possible with the current Solr compilation scripts?

          Show
          Stanislaw Osinski added a comment - Essentially, the dependency on the smart chinese is optional in a sense that the lack of it will degrade the quality of clustering in Chinese, but will not break it. Let me see if I can make it optionally loadable in LuceneLanguageModelFactory too. I think we could handle this in a similar way as in Carrot2: attempt to load chinese tokenizer and fall back to the default one in case of class loading exceptions. The easiest implementation route would be to include smart chinese as a dependency during compilation of the clustering plugin with an understanding that the library may or may not be available during runtime. Is that possible with the current Solr compilation scripts?
          Hide
          Stanislaw Osinski added a comment -

          Ok, here's another shot. This time, the language model factory includes support for Chinese. To avoid compilation issues, the classes are loaded through reflection. Not pretty, but works. If there's a way to have access to smart chinese at compilation time, let me know, I can remove the reflection stuff, so that the refactoring is more reliable.

          Show
          Stanislaw Osinski added a comment - Ok, here's another shot. This time, the language model factory includes support for Chinese. To avoid compilation issues, the classes are loaded through reflection. Not pretty, but works. If there's a way to have access to smart chinese at compilation time, let me know, I can remove the reflection stuff, so that the refactoring is more reliable.
          Hide
          Stanislaw Osinski added a comment -

          Updated dependencies.

          Show
          Stanislaw Osinski added a comment - Updated dependencies.
          Hide
          Stanislaw Osinski added a comment -

          A patch against solr trunk, the libs are the same as for the branch_3x patch.

          Show
          Stanislaw Osinski added a comment - A patch against solr trunk, the libs are the same as for the branch_3x patch.
          Hide
          Grant Ingersoll added a comment -

          What's the status on this? I'd like to commit, but is there a new C2 version out that we should go to?

          Show
          Grant Ingersoll added a comment - What's the status on this? I'd like to commit, but is there a new C2 version out that we should go to?
          Hide
          Grant Ingersoll added a comment -

          Also, why is the Guava lib in solr/lib as opposed to contrib/clustering/lib?

          Show
          Grant Ingersoll added a comment - Also, why is the Guava lib in solr/lib as opposed to contrib/clustering/lib?
          Hide
          Grant Ingersoll added a comment -

          Also, is the libs zip file the complete set of libraries or do I still need ehcache and some of the others?

          Show
          Grant Ingersoll added a comment - Also, is the libs zip file the complete set of libraries or do I still need ehcache and some of the others?
          Hide
          Stanislaw Osinski added a comment -

          Hi Grant,

          Good timing, Carrot2 3.4.0 stable release is being built right at this moment. When the stable JAR is available, I'll re-upload the ZIP with JARs. The ZIP contains all the dependencies you need, ehcache and others are now optional.

          As I explained in some earlier comment, Google now recommends upgrading from Google Collections to Guava. The upgrade is a drop-in replacement, all Solr tests passed for me locally after the upgrade. We already switched Carrot2 to Guava and the reason it's in solr/lib rather than contrib/clustering/lib is to avoid class path conflicts and duplication. If it's not possible to upgrade Solr's Goolge Collections to Guava, please let me know, I'll come up with something.

          Thanks!

          S.

          Show
          Stanislaw Osinski added a comment - Hi Grant, Good timing, Carrot2 3.4.0 stable release is being built right at this moment. When the stable JAR is available, I'll re-upload the ZIP with JARs. The ZIP contains all the dependencies you need, ehcache and others are now optional. As I explained in some earlier comment, Google now recommends upgrading from Google Collections to Guava. The upgrade is a drop-in replacement, all Solr tests passed for me locally after the upgrade. We already switched Carrot2 to Guava and the reason it's in solr/lib rather than contrib/clustering/lib is to avoid class path conflicts and duplication. If it's not possible to upgrade Solr's Goolge Collections to Guava, please let me know, I'll come up with something. Thanks! S.
          Hide
          Robert Muir added a comment -

          I think this is good to see, so we can re-enable the plugin in trunk.

          For a future issue/discussion, I wonder if it would be possible to improve the language/analysis integration such that carrot2 gets the analysis defined from a Solr schema.xml instead of its own factory directly linked to lucene stemmers etc? This might be a dumb question but it just seems like it might be an easier way for a user to configure it, and it would provide more flexibility.

          Show
          Robert Muir added a comment - I think this is good to see, so we can re-enable the plugin in trunk. For a future issue/discussion, I wonder if it would be possible to improve the language/analysis integration such that carrot2 gets the analysis defined from a Solr schema.xml instead of its own factory directly linked to lucene stemmers etc? This might be a dumb question but it just seems like it might be an easier way for a user to configure it, and it would provide more flexibility.
          Hide
          Grant Ingersoll added a comment -

          Brings this up to date, fixing some build issues. The Carrot test does not pass. Stazsek, perhaps you can take this, upgrade to latest C2 and get the test to work?

          Then, I can commit.

          Show
          Grant Ingersoll added a comment - Brings this up to date, fixing some build issues. The Carrot test does not pass. Stazsek, perhaps you can take this, upgrade to latest C2 and get the test to work? Then, I can commit.
          Hide
          Grant Ingersoll added a comment -

          Also, Stazsek, please check yes for donating the libraries w/ the ASL.

          Show
          Grant Ingersoll added a comment - Also, Stazsek, please check yes for donating the libraries w/ the ASL.
          Hide
          Stanislaw Osinski added a comment -

          Here are the libs with Carrot2 3.4.0 JAR.

          1. Apply the patch (the patch hasn't changed)
          2. Copy the libs from the ZIP overwriting the old ones
          3. Remove Google collections from solr/lib (it's replaced by Guava from the ZIP). If you don't do that, tests will fail due to class path conflicts.

          I've just tested this on my machine with the latest branch_3x (r966551) and all tests pass. If some tests fail for you, let me know and I'll investigate.

          S.

          Show
          Stanislaw Osinski added a comment - Here are the libs with Carrot2 3.4.0 JAR. 1. Apply the patch (the patch hasn't changed) 2. Copy the libs from the ZIP overwriting the old ones 3. Remove Google collections from solr/lib (it's replaced by Guava from the ZIP). If you don't do that, tests will fail due to class path conflicts. I've just tested this on my machine with the latest branch_3x (r966551) and all tests pass. If some tests fail for you, let me know and I'll investigate. S.
          Hide
          Stanislaw Osinski added a comment -

          Hi Robert,

          Some initial work on tighter integration with Solr should be possible after applying the patch from this issue. The patch contains a Solr-specific implementation of Carrot2's ILanguageModel interface. My rough guess is that the implementation of that interface could be further tweaked to create IStemmer and ITokenizer implementations based on the schema.xml settings. It could also implement the isCommonWord() method based on Solr's resources. A few notes though:

          • Carrot2 is slightly different from typical IR in a sense that it doesn't completely discard stop words – the tokenizer does not remove them from the token stream. The reason for this is that the cluster labels are taken literally from the input text and if we discard stop words, the labels won't as readable.
          • The ILanguageModel#isStopLabel() method is another Carrot2-specific thing. It's a more fine-grained method of removing useless labels, especially useful for domain-specific content. Carrot2's default implementation is based on regular expressions similar to this. I'm not sure if there's a corresponding resource in Solr though.

          We're thinking of restructuring Carrot2's language model a bit in one of the next releases, so it's a good chance to include some Solr-inspired improvements as well.

          S.

          Show
          Stanislaw Osinski added a comment - Hi Robert, Some initial work on tighter integration with Solr should be possible after applying the patch from this issue. The patch contains a Solr-specific implementation of Carrot2's ILanguageModel interface. My rough guess is that the implementation of that interface could be further tweaked to create IStemmer and ITokenizer implementations based on the schema.xml settings. It could also implement the isCommonWord() method based on Solr's resources. A few notes though: Carrot2 is slightly different from typical IR in a sense that it doesn't completely discard stop words – the tokenizer does not remove them from the token stream. The reason for this is that the cluster labels are taken literally from the input text and if we discard stop words, the labels won't as readable. The ILanguageModel#isStopLabel() method is another Carrot2-specific thing. It's a more fine-grained method of removing useless labels, especially useful for domain-specific content. Carrot2's default implementation is based on regular expressions similar to this . I'm not sure if there's a corresponding resource in Solr though. We're thinking of restructuring Carrot2's language model a bit in one of the next releases, so it's a good chance to include some Solr-inspired improvements as well. S.
          Hide
          Robert Muir added a comment -

          Stanislaw: I looked at the ILanguageModel interface along with your other comments, and had these thoughts:

          For these carrot2-specific things (isCommonWord, isStopLabel), the carrot2 integration could have TokenFilters that set Attributes. So for example a Carrot2CommonWordFilter would not remove tokens from the stream, it would simply mark a boolean attribute such as Carrot2CommonWordAttribute. Stop label processing could be done the same way.

          So, carrot2 processing could be based upon a schema.xml declaration, but the configuration would use these special carrot2 filters to mark attributes that carrot2 needs.

          Show
          Robert Muir added a comment - Stanislaw: I looked at the ILanguageModel interface along with your other comments, and had these thoughts: For these carrot2-specific things (isCommonWord, isStopLabel), the carrot2 integration could have TokenFilters that set Attributes. So for example a Carrot2CommonWordFilter would not remove tokens from the stream, it would simply mark a boolean attribute such as Carrot2CommonWordAttribute. Stop label processing could be done the same way. So, carrot2 processing could be based upon a schema.xml declaration, but the configuration would use these special carrot2 filters to mark attributes that carrot2 needs.
          Hide
          Grant Ingersoll added a comment -

          Trunk: Committed revision 988129.
          3.x: Committed revision 988137.

          Show
          Grant Ingersoll added a comment - Trunk: Committed revision 988129. 3.x: Committed revision 988137.
          Hide
          Stanislaw Osinski added a comment -

          Hi Grant,

          Thanks for committing the patches! I noticed that the 3.x branch build failed because Carrot2 JAR had classes in Java 1.6 format. I'm attaching a Java 1.5-compliant JAR. After replacing the original JAR with the attached one, all Solr tests passed on Java 1.5 on my machine. Apologies for not checking this earlier.

          Also, I believe the last paragraph of contrib/clustering/README.txt does not hold any more as all JARs are now distributed with Solr.

          Staszek

          Show
          Stanislaw Osinski added a comment - Hi Grant, Thanks for committing the patches! I noticed that the 3.x branch build failed because Carrot2 JAR had classes in Java 1.6 format. I'm attaching a Java 1.5-compliant JAR. After replacing the original JAR with the attached one, all Solr tests passed on Java 1.5 on my machine. Apologies for not checking this earlier. Also, I believe the last paragraph of contrib/clustering/README.txt does not hold any more as all JARs are now distributed with Solr. Staszek
          Hide
          Stanislaw Osinski added a comment -

          One more thing: contrib/clustering in trunk seems to contain some leftovers from the time clustering was disabled: build.xml.disabled, DISABLED-README.txt and the LGPL-related paragraph in README.txt. I guess we could remove them too to avoid confusion.

          S.

          Show
          Stanislaw Osinski added a comment - One more thing: contrib/clustering in trunk seems to contain some leftovers from the time clustering was disabled: build.xml.disabled, DISABLED-README.txt and the LGPL-related paragraph in README.txt. I guess we could remove them too to avoid confusion. S.
          Hide
          Grant Ingersoll added a comment -

          Applied. Stupid OS X and it's refusal to properly support 1.5

          Show
          Grant Ingersoll added a comment - Applied. Stupid OS X and it's refusal to properly support 1.5
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Hide
          David Smiley added a comment -

          Quick question: Why is guava in solr/lib instead of solr/contrib/clustering/lib ? I did a search on trunk for use of Guava and it is strictly limited to this contrib module. Does placement of this lib here signal that use of Guava in other parts of Solr is okay? (Guava is pretty cool so that would be nice)

          Show
          David Smiley added a comment - Quick question: Why is guava in solr/lib instead of solr/contrib/clustering/lib ? I did a search on trunk for use of Guava and it is strictly limited to this contrib module. Does placement of this lib here signal that use of Guava in other parts of Solr is okay? (Guava is pretty cool so that would be nice)
          Hide
          Yonik Seeley added a comment -

          IIRC, Noble intended to use it and moved it to lib (see SOLR-1707 for one potential use).
          IMO, it's use in other parts of Solr are fine (just don't automatically assume it's faster/better

          Show
          Yonik Seeley added a comment - IIRC, Noble intended to use it and moved it to lib (see SOLR-1707 for one potential use). IMO, it's use in other parts of Solr are fine (just don't automatically assume it's faster/better
          Hide
          Dawid Weiss added a comment -

          I agree with Yonik – Guava is compact and neat to use and we use it all the time, but I'd be careful with automatic replacement of certain constructs in performance-critical loops. It's well written, but certain methods sacrifice some performance for code beauty.

          Show
          Dawid Weiss added a comment - I agree with Yonik – Guava is compact and neat to use and we use it all the time, but I'd be careful with automatic replacement of certain constructs in performance-critical loops. It's well written, but certain methods sacrifice some performance for code beauty.
          Hide
          Steve Rowe added a comment -

          FYI, Dawid's SOLR-2378 introduced use of Guava outside of the clustering contrib.

          Show
          Steve Rowe added a comment - FYI, Dawid's SOLR-2378 introduced use of Guava outside of the clustering contrib.
          Hide
          David Smiley added a comment -

          Steve, the patch for SOLR-2378 did use Guava, but whoever committed to trunk must have backed it out because I went to the same class file I have on an updated trunk, FSTLookup.java, and there is no reference to an applicable import statement, but on the patch it was there.

          Again, if you do a search for "com.google.common" on any .java file in trunk, you'll only find it in the clustering contrib module.

          Show
          David Smiley added a comment - Steve, the patch for SOLR-2378 did use Guava, but whoever committed to trunk must have backed it out because I went to the same class file I have on an updated trunk, FSTLookup.java, and there is no reference to an applicable import statement, but on the patch it was there. Again, if you do a search for "com.google.common" on any .java file in trunk, you'll only find it in the clustering contrib module.
          Hide
          Dawid Weiss added a comment -

          I honestly don't remember what was it I used Guava for in FSTLookup, but it's most likely the generic-less constructors for Lists or Maps... nothing fancy. And I think Robert might have removed the dependency when he moved FSTLookup to a module.

          Show
          Dawid Weiss added a comment - I honestly don't remember what was it I used Guava for in FSTLookup, but it's most likely the generic-less constructors for Lists or Maps... nothing fancy. And I think Robert might have removed the dependency when he moved FSTLookup to a module.
          Hide
          Steve Rowe added a comment -

          if you do a search for "com.google.common" on any .java file in trunk, you'll only find it in the clustering contrib module.

          Right: grep 'com\.google' $(find . -name '*.java') only returns files under solr/contrib/clustering/.

          I think Robert might have removed the dependency when he moved FSTLookup to a module.

          Yup, e.g. Lists.newArrayList() -> new ArrayList<Entry>() : <http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTLookup.java?r1=1097216&r2=1126642&diff_format=h#l154>.

          So from April 14th through May 23rd, Solr did have a Guava dependency .

          Show
          Steve Rowe added a comment - if you do a search for "com.google.common" on any .java file in trunk, you'll only find it in the clustering contrib module. Right: grep 'com\.google' $(find . -name '*.java') only returns files under solr/contrib/clustering/ . I think Robert might have removed the dependency when he moved FSTLookup to a module. Yup, e.g. Lists.newArrayList() -> new ArrayList<Entry>() : < http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTLookup.java?r1=1097216&r2=1126642&diff_format=h#l154 >. So from April 14th through May 23rd, Solr did have a Guava dependency .
          Hide
          Robert Muir added a comment -

          Yup, e.g. Lists.newArrayList() -> new ArrayList<Entry>() : <http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTLookup.java?r1=1097216&r2=1126642&diff_format=h#l154>.

          Well not just that, e.g.:

          final List<String> input = Lists.newArrayList(Iterables.transform(benchmarkInput, new Function<TermFreq, String>() {
            public String apply(TermFreq tf) {
              return tf.term.substring(0, Math.min(tf.term.length(),
                  minPrefixLen + random.nextInt(maxPrefixLen - minPrefixLen + 1)));
            }
          }));
          

          versus

          final List<String> input = new ArrayList<String>(benchmarkInput.size());
          for (TermFreq tf : benchmarkInput) {
            input.add(tf.term.substring(0, Math.min(tf.term.length(),
                minPrefixLen + random.nextInt(maxPrefixLen - minPrefixLen + 1))));
          }
          

          One of these any of us can look at instantly and know whats going on. The other is more like the C++ STL,
          sure if the developer is really cautious and it doesnt go out of hand it might stay understandable.

          I suppose in all this refactoring, I didn't see what Guava bought us at all, except extra method calls. Guess
          I'm just not a fan.

          http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java?r1=1097216&r2=1126642&diff_format=h

          Show
          Robert Muir added a comment - Yup, e.g. Lists.newArrayList() -> new ArrayList<Entry>() : < http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTLookup.java?r1=1097216&r2=1126642&diff_format=h#l154 >. Well not just that, e.g.: final List< String > input = Lists.newArrayList(Iterables.transform(benchmarkInput, new Function<TermFreq, String >() { public String apply(TermFreq tf) { return tf.term.substring(0, Math .min(tf.term.length(), minPrefixLen + random.nextInt(maxPrefixLen - minPrefixLen + 1))); } })); versus final List< String > input = new ArrayList< String >(benchmarkInput.size()); for (TermFreq tf : benchmarkInput) { input.add(tf.term.substring(0, Math .min(tf.term.length(), minPrefixLen + random.nextInt(maxPrefixLen - minPrefixLen + 1)))); } One of these any of us can look at instantly and know whats going on. The other is more like the C++ STL, sure if the developer is really cautious and it doesnt go out of hand it might stay understandable. I suppose in all this refactoring, I didn't see what Guava bought us at all, except extra method calls. Guess I'm just not a fan. http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java?r1=1097216&r2=1126642&diff_format=h
          Hide
          David Smiley added a comment -

          Good point Rob. If any use of Guava in a patch to Solr core is going to get reverted, then we might as well recognize that now and move Guava from Solr's lib to clustering's lib directory.

          Show
          David Smiley added a comment - Good point Rob. If any use of Guava in a patch to Solr core is going to get reverted, then we might as well recognize that now and move Guava from Solr's lib to clustering's lib directory.
          Hide
          Robert Muir added a comment -

          well it wasnt 'reverted', but in general I think if we can simplify code, we should.

          NOTE: my complaint's aren't with Dawid's code, it was easy to see what was going on (though I did check guava code to make sure it wasnt doing anything really magic behind the scenes, aka we weren't losing anything)... but again my concern is the more general case, that by using external libraries like this, it makes it harder for others to work on the code, and makes us more prone to bugs (e.g. now we have to worry about bugs in guava too, we have a hard enough time with JRE bugs).

          If its truly necessary, e.g. if Guava has a HuperDuper O(1)-for-all-operations datastructure, then go for it and use it, but if its just a thin veil around what java already provides, I think we should try to avoid it.

          Show
          Robert Muir added a comment - well it wasnt 'reverted', but in general I think if we can simplify code, we should. NOTE: my complaint's aren't with Dawid's code, it was easy to see what was going on (though I did check guava code to make sure it wasnt doing anything really magic behind the scenes, aka we weren't losing anything)... but again my concern is the more general case, that by using external libraries like this, it makes it harder for others to work on the code, and makes us more prone to bugs (e.g. now we have to worry about bugs in guava too, we have a hard enough time with JRE bugs). If its truly necessary, e.g. if Guava has a HuperDuper O(1)-for-all-operations datastructure, then go for it and use it, but if its just a thin veil around what java already provides, I think we should try to avoid it.
          Hide
          Dawid Weiss added a comment -

          I am not considering changes to my code to be a personal attack, so no worries. And, since I'm a former assembly guy, straightforward code flow is what I understand and relate to – I use Guava from time to time to catch up with the young and hippie buzzwords (closures, abstraction, you know the like .

          And seriously, our 99% of Guava use is for instantiating lists and maps without providing generics (something that the new Java release will be able to infer from the code anyway – at least if I'm reading commit logs to the javac compiler right). The remaining 1% is for cascading list filters and sorting orders (which, after you get used to them a little bit, work out and read pretty nicely).

          I'm not by all means saying we should switch to Guava; I used it because I saw it was in global lib/ directory (and this happened after the patch to Carrot2 I believe).

          Show
          Dawid Weiss added a comment - I am not considering changes to my code to be a personal attack, so no worries. And, since I'm a former assembly guy, straightforward code flow is what I understand and relate to – I use Guava from time to time to catch up with the young and hippie buzzwords (closures, abstraction, you know the like . And seriously, our 99% of Guava use is for instantiating lists and maps without providing generics (something that the new Java release will be able to infer from the code anyway – at least if I'm reading commit logs to the javac compiler right). The remaining 1% is for cascading list filters and sorting orders (which, after you get used to them a little bit, work out and read pretty nicely). I'm not by all means saying we should switch to Guava; I used it because I saw it was in global lib/ directory (and this happened after the patch to Carrot2 I believe).

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development