Solr
  1. Solr
  2. SOLR-2917

Support for field-specific tokenizers, token- and character filters in search results clustering

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: contrib - Clustering
    • Labels:
      None

      Description

      Currently, Carrot2 search results clustering component creates clusters based on the raw text of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by using sequences of words taken directly from the documents' text (including stop words: Development of Lucene and Solr is more readable than Development Lucene Solr). The easiest way of providing input for such a process was feeding Carrot2 with raw (stored) document content.

      It is, however, possible to take into account some of the field's filters during clustering. Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering component would need to:

      1. get raw text of the field,
      2. run it through the field's char filters, tokenizers and selected token filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable cluster labels),
      3. glue the output back into a string and feed to Carrot2 for clustering.

      In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.

        Activity

        Stanislaw Osinski created issue -
        Robert Muir made changes -
        Field Original Value New Value
        Fix Version/s 4.0 [ 12314992 ]
        Fix Version/s 3.6 [ 12319065 ]
        Hoss Man made changes -
        Fix Version/s 4.0 [ 12322455 ]
        Fix Version/s 4.0-ALPHA [ 12314992 ]
        Robert Muir made changes -
        Fix Version/s 4.0 [ 12322551 ]
        Fix Version/s 4.0-BETA [ 12322455 ]
        Hoss Man made changes -
        Fix Version/s 4.0 [ 12322551 ]

          People

          • Assignee:
            Stanislaw Osinski
            Reporter:
            Stanislaw Osinski
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development