Lucene - Core
  1. Lucene - Core
  2. LUCENE-4286

Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA, 3.6.1
    • Fix Version/s: 4.0-BETA, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Add an optional flag to the CJKBigramFilter to tell it to also output unigrams. This would allow indexing of both bigrams and unigrams and at query time the analyzer could analyze queries as bigrams unless the query contained a single Han unigram.

      As an example here is a configuration a Solr fieldType with the analyzer for indexing with the "indexUnigrams" flag set and the analyzer for querying without the flag.

      <fieldType name="CJK" autoGeneratePhraseQueries="false">

      <analyzer type="index">
      <tokenizer class="solr.ICUTokenizerFactory"/>
      <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
      </analyzer>

      <analyzer type="query">
      <tokenizer class="solr.ICUTokenizerFactory"/>
      <filter class="solr.CJKBigramFilterFactory" han="true"/>
      </analyzer>
      </fieldType>

      Use case: About 10% of our queries that contain Han characters are single character queries. The CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input. This means we have to create a separate field to index Han unigrams in order to address single character queries and then write application code to search that separate field if we detect a single character Han query. This is rather kludgey. With the optional flag, we could configure Solr as above

      This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter used to allow single word queries (although that uses word n-grams rather than character n-grams.)

      1. LUCENE-4286.patch
        17 kB
        Robert Muir
      2. LUCENE-4286.patch
        7 kB
        Robert Muir
      3. LUCENE-4286.patch_3.x
        20 kB
        Tom Burton-West

        Activity

        Hide
        Robert Muir added a comment -

        first stab at a patch. I think its ok, but needs more tests just to be sure.

        Show
        Robert Muir added a comment - first stab at a patch. I think its ok, but needs more tests just to be sure.
        Hide
        Robert Muir added a comment -

        Updated patch with additional docs and tests.

        This is ready to commit.

        Show
        Robert Muir added a comment - Updated patch with additional docs and tests. This is ready to commit.
        Hide
        Lance Norskog added a comment -

        Is this a request by Han language readers?

        Show
        Lance Norskog added a comment - Is this a request by Han language readers?
        Hide
        Tom Burton-West added a comment -

        We haven't had a request for this specific feature from readers, we are just assuming that the 10% of Han queries in our logs that consist of a single character represent real use cases and we don't want such queries to produce zero results or produce misleading results.

        Tom

        Show
        Tom Burton-West added a comment - We haven't had a request for this specific feature from readers, we are just assuming that the 10% of Han queries in our logs that consist of a single character represent real use cases and we don't want such queries to produce zero results or produce misleading results. Tom
        Hide
        Robert Muir added a comment -

        The combined unigram+bigram technique is a general technique, which I think is useful to support.

        For examples see:
        http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.6782
        http://members.unine.ch/jacques.savoy/Papers/NTCIR6.pdf

        There are more references and studies linked from those.

        Tom would have to do tests for his "index-time-only" approach: I can't speak for that.

        Show
        Robert Muir added a comment - The combined unigram+bigram technique is a general technique, which I think is useful to support. For examples see: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.6782 http://members.unine.ch/jacques.savoy/Papers/NTCIR6.pdf There are more references and studies linked from those. Tom would have to do tests for his "index-time-only" approach: I can't speak for that.
        Hide
        Lance Norskog added a comment -

        If you do unigrams and bigrams in separate fields, you can bias bigrams over unigrams. We did that with one customer and it really helped. Our text was technical and tended towards "long" words: lots of bigrams & trigrams. Have you tried the Smart Chinese toolkit? It produces a lot less bigrams. Our project worked well with it. I would try that, with misfires further broken into bigrams, over general bigramming. C.f. SOLR-3653 about the "misfires" part.

        In general we found Chinese-language search a really hard problem, and doubly so when nobody on the team speaks Chinese.

        Show
        Lance Norskog added a comment - If you do unigrams and bigrams in separate fields, you can bias bigrams over unigrams. We did that with one customer and it really helped. Our text was technical and tended towards "long" words: lots of bigrams & trigrams. Have you tried the Smart Chinese toolkit? It produces a lot less bigrams. Our project worked well with it. I would try that, with misfires further broken into bigrams, over general bigramming. C.f. SOLR-3653 about the "misfires" part. In general we found Chinese-language search a really hard problem, and doubly so when nobody on the team speaks Chinese.
        Hide
        Tom Burton-West added a comment -

        Thanks Robert for all your work on non-English searching and for your quick response on this issue.

        >>If you do unigrams and bigrams in separate fields, you can bias bigrams over unigrams.
        That was our original intention.

        >>The combined unigram+bigram technique is a general technique, which I think is useful to support. ...Tom would have to do tests for his "index-time-only" approach: I can't speak for that.

        Originally I was going to use the combined unigram+bigram technique (with a boost for the bigram fields) and wrote some custom code to implement it. However, I started thinking about the size of our documents. With one exception, all the literature I found that got better results with a combination of bigrams and unigrams used newswire size documents (somewhere in the range of a few hundred words). Our documents are several orders of magnitude larger (around 100,000 words).

        My understanding is that the main reason adding unigrams to bigrams increases relevance is that often the unigram will have a related meaning to the larger word. So using unigrams is somewhat analogous to decompounding or stemming. I haven't done any tests, but my guess is that with our very large documents the additional recall added by unigrams will be offset by a decrease in precision.

        After I get a test suite set up for relevance ranking in English, I'll take a look at testing CJK

        Tom

        Show
        Tom Burton-West added a comment - Thanks Robert for all your work on non-English searching and for your quick response on this issue. >>If you do unigrams and bigrams in separate fields, you can bias bigrams over unigrams. That was our original intention. >>The combined unigram+bigram technique is a general technique, which I think is useful to support. ...Tom would have to do tests for his "index-time-only" approach: I can't speak for that. Originally I was going to use the combined unigram+bigram technique (with a boost for the bigram fields) and wrote some custom code to implement it. However, I started thinking about the size of our documents. With one exception, all the literature I found that got better results with a combination of bigrams and unigrams used newswire size documents (somewhere in the range of a few hundred words). Our documents are several orders of magnitude larger (around 100,000 words). My understanding is that the main reason adding unigrams to bigrams increases relevance is that often the unigram will have a related meaning to the larger word. So using unigrams is somewhat analogous to decompounding or stemming. I haven't done any tests, but my guess is that with our very large documents the additional recall added by unigrams will be offset by a decrease in precision. After I get a test suite set up for relevance ranking in English, I'll take a look at testing CJK Tom
        Hide
        Robert Muir added a comment -

        My understanding is that the main reason adding unigrams to bigrams increases relevance is that often the unigram will have a related meaning to the larger word. So using unigrams is somewhat analogous to decompounding or stemming. I haven't done any tests, but my guess is that with our very large documents the additional recall added by unigrams will be offset by a decrease in precision.

        After I get a test suite set up for relevance ranking in English, I'll take a look at testing CJK

        Well for your use case (doing this only at indexing time), I think it should work well.
        The 10% of your queries that are single-character will get reasonable results versus no/garbage results.
        The rest of your queries are essentially unchanged: IDF remains the same for the bigrams, and document length will be ~ the same,
        the "additional" bigrams do not count towards length normalization by default since they are synonyms over the unigrams.

        But you will pay an indexing performance/disk penalty to some extent since you are indexing more tokens for these documents.

        However if you decide to also always turn on unigrams at query-time too, this might be prohibitively expensive: you would have to test.
        If you do that, I don't think you need to do any special boosting: I'd first just let IDF take care of it

        Show
        Robert Muir added a comment - My understanding is that the main reason adding unigrams to bigrams increases relevance is that often the unigram will have a related meaning to the larger word. So using unigrams is somewhat analogous to decompounding or stemming. I haven't done any tests, but my guess is that with our very large documents the additional recall added by unigrams will be offset by a decrease in precision. After I get a test suite set up for relevance ranking in English, I'll take a look at testing CJK Well for your use case (doing this only at indexing time), I think it should work well. The 10% of your queries that are single-character will get reasonable results versus no/garbage results. The rest of your queries are essentially unchanged: IDF remains the same for the bigrams, and document length will be ~ the same, the "additional" bigrams do not count towards length normalization by default since they are synonyms over the unigrams. But you will pay an indexing performance/disk penalty to some extent since you are indexing more tokens for these documents. However if you decide to also always turn on unigrams at query-time too, this might be prohibitively expensive: you would have to test. If you do that, I don't think you need to do any special boosting: I'd first just let IDF take care of it
        Hide
        Tom Burton-West added a comment -

        We are still using Solr 3.6 in production so I backported the patch to Lucene/Solr 3.6. Attached as LUCENE-4286.patch_3.x

        Show
        Tom Burton-West added a comment - We are still using Solr 3.6 in production so I backported the patch to Lucene/Solr 3.6. Attached as LUCENE-4286 .patch_3.x
        Hide
        Shawn Heisey added a comment -

        I have just tried the indexUnigrams="true" on branch_4x checked out 2012/11/28 and it doesn't seem to be working. The analysis page (indexing) shows the bigrams, but no unigrams. Am I doing something wrong?

        my fieldType:

            <fieldType name="genText" class="solr.TextField" sortMissingLast="true" positionIncrementGap="100">
              <analyzer type="index">
                <tokenizer class="solr.ICUTokenizerFactory"/>
                <filter class="solr.PatternReplaceFilterFactory"
                  pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
                  replacement="$2"
                  allowempty="false"
                />
                <filter class="solr.WordDelimiterFilterFactory"
                  splitOnCaseChange="1"
                  splitOnNumerics="1"
                  stemEnglishPossessive="1"
                  generateWordParts="1"
                  generateNumberParts="1"
                  catenateWords="1"
                  catenateNumbers="1"
                  catenateAll="0"
                  preserveOriginal="1"
                />
                <filter class="solr.ICUFoldingFilterFactory"/>
                <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true"/>
                <filter class="solr.LengthFilterFactory" min="1" max="512"/>
              </analyzer>
              <analyzer type="query">
                <tokenizer class="solr.ICUTokenizerFactory"/>
                <filter class="solr.PatternReplaceFilterFactory"
                  pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
                  replacement="$2"
                  allowempty="false"
                />
                <filter class="solr.WordDelimiterFilterFactory"
                  splitOnCaseChange="1"
                  splitOnNumerics="1"
                  stemEnglishPossessive="1"
                  generateWordParts="1"
                  generateNumberParts="1"
                  catenateWords="0"
                  catenateNumbers="0"
                  catenateAll="0"
                  preserveOriginal="1"
                />
                <filter class="solr.ICUFoldingFilterFactory"/>
                <filter class="solr.CJKBigramFilterFactory" indexUnigrams="false"/>
                <filter class="solr.LengthFilterFactory" min="1" max="512"/>
              </analyzer>
            </fieldType>
        
        Show
        Shawn Heisey added a comment - I have just tried the indexUnigrams="true" on branch_4x checked out 2012/11/28 and it doesn't seem to be working. The analysis page (indexing) shows the bigrams, but no unigrams. Am I doing something wrong? my fieldType: <fieldType name= "genText" class= "solr.TextField" sortMissingLast= " true " positionIncrementGap= "100" > <analyzer type= "index" > <tokenizer class= "solr.ICUTokenizerFactory" /> <filter class= "solr.PatternReplaceFilterFactory" pattern= "^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement= "$2" allowempty= " false " /> <filter class= "solr.WordDelimiterFilterFactory" splitOnCaseChange= "1" splitOnNumerics= "1" stemEnglishPossessive= "1" generateWordParts= "1" generateNumberParts= "1" catenateWords= "1" catenateNumbers= "1" catenateAll= "0" preserveOriginal= "1" /> <filter class= "solr.ICUFoldingFilterFactory" /> <filter class= "solr.CJKBigramFilterFactory" indexUnigrams= " true " /> <filter class= "solr.LengthFilterFactory" min= "1" max= "512" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.ICUTokenizerFactory" /> <filter class= "solr.PatternReplaceFilterFactory" pattern= "^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement= "$2" allowempty= " false " /> <filter class= "solr.WordDelimiterFilterFactory" splitOnCaseChange= "1" splitOnNumerics= "1" stemEnglishPossessive= "1" generateWordParts= "1" generateNumberParts= "1" catenateWords= "0" catenateNumbers= "0" catenateAll= "0" preserveOriginal= "1" /> <filter class= "solr.ICUFoldingFilterFactory" /> <filter class= "solr.CJKBigramFilterFactory" indexUnigrams= " false " /> <filter class= "solr.LengthFilterFactory" min= "1" max= "512" /> </analyzer> </fieldType>
        Hide
        Robert Muir added a comment -

        There is no such option as "indexUnigrams".

        I think you mean outputUnigrams. See the documentation at http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html

        Show
        Robert Muir added a comment - There is no such option as "indexUnigrams". I think you mean outputUnigrams. See the documentation at http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html

          People

          • Assignee:
            Unassigned
            Reporter:
            Tom Burton-West
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development