Solr
  1. Solr
  2. SOLR-2917

Support for field-specific tokenizers, token- and character filters in search results clustering

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: contrib - Clustering
    • Labels:
      None

      Description

      Currently, Carrot2 search results clustering component creates clusters based on the raw text of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by using sequences of words taken directly from the documents' text (including stop words: Development of Lucene and Solr is more readable than Development Lucene Solr). The easiest way of providing input for such a process was feeding Carrot2 with raw (stored) document content.

      It is, however, possible to take into account some of the field's filters during clustering. Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering component would need to:

      1. get raw text of the field,
      2. run it through the field's char filters, tokenizers and selected token filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable cluster labels),
      3. glue the output back into a string and feed to Carrot2 for clustering.

      In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.

        Activity

        Hide
        Hoss Man added a comment -

        There is no indication that anyone is actively working on this issue, and it has no current patch, so removing 4.0 from the fixVersion.

        Show
        Hoss Man added a comment - There is no indication that anyone is actively working on this issue, and it has no current patch, so removing 4.0 from the fixVersion.
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Hoss Man added a comment -

        bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

        Show
        Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
        Hide
        Stanislaw Osinski added a comment -

        That was the idea behind the suggestion. Highlighter works a litle bit different so it does not need this: it uses the TermVectors only for finding the highlighting offsets but marks the highligts in the original text (from a stored field). It just spares to reanalyze again, which can be expensive if you e.g. use BASIS or whatever heavy analysis.

        Yeah, it's a bit different indeed because clustering would need the original text of the tokens instead of just the start offset and length. Ultimately, the choice between storing two different token streams and doing the analysis at runtime is a trade-off between storage size (doubled?) and slower runtime performance. Once we get Carrot2 to support pre-tokenized input (not hard conceptually, but tricky in terms of the API), both solutions would be possible.

        Show
        Stanislaw Osinski added a comment - That was the idea behind the suggestion. Highlighter works a litle bit different so it does not need this: it uses the TermVectors only for finding the highlighting offsets but marks the highligts in the original text (from a stored field). It just spares to reanalyze again, which can be expensive if you e.g. use BASIS or whatever heavy analysis. Yeah, it's a bit different indeed because clustering would need the original text of the tokens instead of just the start offset and length. Ultimately, the choice between storing two different token streams and doing the analysis at runtime is a trade-off between storage size (doubled?) and slower runtime performance. Once we get Carrot2 to support pre-tokenized input (not hard conceptually, but tricky in terms of the API), both solutions would be possible.
        Hide
        Uwe Schindler added a comment -

        On the other hand, the schema could define a parallel field with certain filters disabled, clustering should work nicely with such a stream.

        That was the idea behind the suggestion. Highlighter works a litle bit different so it does not need this: it uses the TermVectors only for finding the highlighting offsets but marks the highligts in the original text (from a stored field). It just spares to reanalyze again, which can be expensive if you e.g. use BASIS or whatever heavy analysis.

        Show
        Uwe Schindler added a comment - On the other hand, the schema could define a parallel field with certain filters disabled, clustering should work nicely with such a stream. That was the idea behind the suggestion. Highlighter works a litle bit different so it does not need this: it uses the TermVectors only for finding the highlighting offsets but marks the highligts in the original text (from a stored field). It just spares to reanalyze again, which can be expensive if you e.g. use BASIS or whatever heavy analysis.
        Hide
        Dawid Weiss added a comment -

        Step 3 is necessary because we have a different tokenization pipeline in C2... but it would be a step forward to more compact integration for sure.

        Show
        Dawid Weiss added a comment - Step 3 is necessary because we have a different tokenization pipeline in C2... but it would be a step forward to more compact integration for sure.
        Hide
        Stanislaw Osinski added a comment -

        Would a typical TVTokenStream contain stop words, original (unstemmed) forms and sentence separators? If not, the human-readability of cluster labels would suffer quite a bit. On the other hand, the schema could define a parallel field with certain filters disabled, clustering should work nicely with such a stream. Is there any other solution to this?

        Show
        Stanislaw Osinski added a comment - Would a typical TVTokenStream contain stop words, original (unstemmed) forms and sentence separators? If not, the human-readability of cluster labels would suffer quite a bit. On the other hand, the schema could define a parallel field with certain filters disabled, clustering should work nicely with such a stream. Is there any other solution to this?
        Hide
        Uwe Schindler added a comment -

        By eliminating step 3, carrot could also be fed by term vectors with crazy Highlighter's TVTokenStream?

        Show
        Uwe Schindler added a comment - By eliminating step 3, carrot could also be fed by term vectors with crazy Highlighter's TVTokenStream?

          People

          • Assignee:
            Stanislaw Osinski
            Reporter:
            Stanislaw Osinski
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development