Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9123

JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (9.0), 8.5
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with both of SynonymGraphFilter and SynonymFilter when JT generates multiple tokens as an output. If we use `mode=normal`, it should be fine. However, we would like to use decomposed tokens that can maximize to chance to increase recall.

      Snippet of schema:

          <fieldType name="text_custom_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
            <analyzer>
              <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
              <filter class="solr.SynonymGraphFilterFactory"
                          synonyms="lang/synonyms_ja.txt"
                          tokenizerFactory="solr.JapaneseTokenizerFactory"/>
      
              <filter class="solr.JapaneseBaseFormFilterFactory"/>
              <!-- Removes tokens with certain part-of-speech tags -->
              <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />
              <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
              <filter class="solr.CJKWidthFilterFactory"/>
              <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
              <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" /> -->
              <!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) -->
              <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
              <!-- Lower-cases romaji characters -->
              <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
          </fieldType>
      

      An synonym entry that generates error:

      株式会社,コーポレーション
      

      The following is an output on console:

      $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
      
      ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0)
      

        Attachments

        1. LUCENE-9123_8x.patch
          13 kB
          Kazuaki Hiraga
        2. LUCENE-9123.patch
          25 kB
          Kazuaki Hiraga

          Issue Links

            Activity

              People

              • Assignee:
                tomoko Tomoko Uchida
                Reporter:
                h.kazuaki Kazuaki Hiraga
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: