Solr
  1. Solr
  2. SOLR-1692

CarrotClusteringEngine produce summary does nothing

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1
    • Component/s: contrib - Clustering
    • Labels:
      None

      Description

      In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored.

      1. SOLR-1692.patch
        8 kB
        Grant Ingersoll

        Activity

        Hide
        Grant Ingersoll added a comment -

        The relevant lines are:

        String snippet = getValue(doc, snippetField);
        if (produceSummary == true) {
                docsHolder[0] = id.intValue();
                DocList docAsList = new DocSlice(0, 1, docsHolder, scores, 1, 1.0f);
                highligher.doHighlighting(docAsList, theQuery, req, snippetFieldAry);
              }
        

        It seems like we do the highlighting but then don't use the result. If I recall, we should use the result to then set the snippet value.

        Show
        Grant Ingersoll added a comment - The relevant lines are: String snippet = getValue(doc, snippetField); if (produceSummary == true ) { docsHolder[0] = id.intValue(); DocList docAsList = new DocSlice(0, 1, docsHolder, scores, 1, 1.0f); highligher.doHighlighting(docAsList, theQuery, req, snippetFieldAry); } It seems like we do the highlighting but then don't use the result. If I recall, we should use the result to then set the snippet value.
        Hide
        Stanislaw Osinski added a comment -

        I've had a quick look into this issue and have two questions to consider:

        • Where should the configuration of the highlighter we use for clustering come from? Should it be the same as for the regular Solr highlighting or should we allow a clustering-specific configuration? My intuition is that we should go with the former. Otherwise, we may lose the clear relationship between cluster labels and documents on the output, because the clusters will be generated based on a text that is different from what the user is going to see.
        • What should we do if the highlighter is not able to generate a summary? One option is to use the full contents of the field. Alternatively, we could use N (configurable) first characters of the field. The answer to this really depends on the characteristics of the data we may get. If the total number of documents fed to Carrot2 doesn't exceed about a 1000, longer documents shouldn't be too much of a problem, so I'd suggest the former option (use full field text).
        Show
        Stanislaw Osinski added a comment - I've had a quick look into this issue and have two questions to consider: Where should the configuration of the highlighter we use for clustering come from? Should it be the same as for the regular Solr highlighting or should we allow a clustering-specific configuration? My intuition is that we should go with the former. Otherwise, we may lose the clear relationship between cluster labels and documents on the output, because the clusters will be generated based on a text that is different from what the user is going to see. What should we do if the highlighter is not able to generate a summary? One option is to use the full contents of the field. Alternatively, we could use N (configurable) first characters of the field. The answer to this really depends on the characteristics of the data we may get. If the total number of documents fed to Carrot2 doesn't exceed about a 1000, longer documents shouldn't be too much of a problem, so I'd suggest the former option (use full field text).
        Hide
        Grant Ingersoll added a comment -

        Where should the configuration of the highlighter we use for clustering come from?

        We have all the code hooked in for it already, we're just ignoring the output.

        What should we do if the highlighter is not able to generate a summary?

        I think we can default to the full contents, which is what would be used if you don't specify produceSummary. We can handle the char thing separately, I suppose.

        Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc.

        Show
        Grant Ingersoll added a comment - Where should the configuration of the highlighter we use for clustering come from? We have all the code hooked in for it already, we're just ignoring the output. What should we do if the highlighter is not able to generate a summary? I think we can default to the full contents, which is what would be used if you don't specify produceSummary. We can handle the char thing separately, I suppose. Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc.
        Hide
        Grant Ingersoll added a comment -

        Fixes the bug, adds new parameter to specify the frag size when using the highlighter.

        Show
        Grant Ingersoll added a comment - Fixes the bug, adds new parameter to specify the frag size when using the highlighter.
        Hide
        Stanislaw Osinski added a comment -

        Where should the configuration of the highlighter we use for clustering come from?

        We have all the code hooked in for it already, we're just ignoring the output.

        To avoid confusion and questions along the lines of "why clusters don't match the (highlighted) documents I'm seeing", I'd suggest a slightly more elaborate scenario for the clustering highlighter configuration:

        1. If main Solr highlighting is disabled, use the clustering component's highlighter settings.
        2. If main Solr highlighting is enabled, use the main highlighter's configuration as the defaults and let the clustering-specific highlighter configuration override the defaults.

        If we do it this way, we'll minimize the chances of users accidentally performing clustering on documents different (differently highlighted) than those they will see.

        Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc.

        This one would require some larger changes to Carrot2 internals. We do use Lucene infrastructure for preprocessing (currently for tokenization), but I can investigate if we can extend that further. A potential problem here is that very often the set of stopwords you use for document retrieval may not work equally well for clustering. I've filed a Carrot2-specific issue for it and will try to come up with something.

        Show
        Stanislaw Osinski added a comment - Where should the configuration of the highlighter we use for clustering come from? We have all the code hooked in for it already, we're just ignoring the output. To avoid confusion and questions along the lines of "why clusters don't match the (highlighted) documents I'm seeing", I'd suggest a slightly more elaborate scenario for the clustering highlighter configuration: 1. If main Solr highlighting is disabled, use the clustering component's highlighter settings. 2. If main Solr highlighting is enabled, use the main highlighter's configuration as the defaults and let the clustering-specific highlighter configuration override the defaults. If we do it this way, we'll minimize the chances of users accidentally performing clustering on documents different (differently highlighted) than those they will see. Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc. This one would require some larger changes to Carrot2 internals. We do use Lucene infrastructure for preprocessing (currently for tokenization), but I can investigate if we can extend that further. A potential problem here is that very often the set of stopwords you use for document retrieval may not work equally well for clustering. I've filed a Carrot2-specific issue for it and will try to come up with something.
        Hide
        Hoss Man added a comment -

        Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

        A unique token for finding these 240 issues in the future: hossversioncleanup20100527

        Show
        Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
        Hide
        Robert Muir added a comment -

        Bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - Bulk move 3.2 -> 3.3
        Hide
        Dawid Weiss added a comment -

        Grant, what remains to be done with this issue? Can I help?

        Show
        Dawid Weiss added a comment - Grant, what remains to be done with this issue? Can I help?
        Hide
        Stanislaw Osinski added a comment -

        Looking at the code, the issue is resolved, summaries (from highlighter) are used for clustering when configured. I see there's no unit test for the feature though, so I can write one and resolve the issue.

        Show
        Stanislaw Osinski added a comment - Looking at the code, the issue is resolved, summaries (from highlighter) are used for clustering when configured. I see there's no unit test for the feature though, so I can write one and resolve the issue.
        Hide
        Stanislaw Osinski added a comment -

        This issue was really fixed for 3.1.0 and documented in CHANGES under that release. It doesn't make sense to complicate things further as I suggested in the discussion above, so resolving.

        Show
        Stanislaw Osinski added a comment - This issue was really fixed for 3.1.0 and documented in CHANGES under that release. It doesn't make sense to complicate things further as I suggested in the discussion above, so resolving.
        Hide
        Uwe Schindler added a comment -

        Bulk close after release of 3.1

        Show
        Uwe Schindler added a comment - Bulk close after release of 3.1

          People

          • Assignee:
            Stanislaw Osinski
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development