[LUCENE-9123] JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.0, 8.5
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with both of SynonymGraphFilter and SynonymFilter when JT generates multiple tokens as an output. If we use `mode=normal`, it should be fine. However, we would like to use decomposed tokens that can maximize to chance to increase recall.

Snippet of schema:

    <fieldType name="text_custom_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
        <filter class="solr.SynonymGraphFilterFactory"
                    synonyms="lang/synonyms_ja.txt"
                    tokenizerFactory="solr.JapaneseTokenizerFactory"/>

        <filter class="solr.JapaneseBaseFormFilterFactory"/>
        <!-- Removes tokens with certain part-of-speech tags -->
        <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />
        <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
        <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" /> -->
        <!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) -->
        <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
        <!-- Lower-cases romaji characters -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

An synonym entry that generates error:

株式会社,コーポレーション

The following is an output on console:

$ ./bin/solr create_core -c jp_test -d ../config/solrconfs

ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-9123.patch
22/Jan/20 03:44
25 kB
Kazuaki Hiraga
LUCENE-9123_8x.patch
21/Jan/20 15:52
13 kB
Kazuaki Hiraga

Issue Links

is related to

LUCENE-9173 SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph)

Open

SOLR-14295 Add the parameter description about "discardCompoundToken" for JapaneseTokenizer

Closed

Activity

People

Assignee:: Tomoko Uchida

Reporter:: Kazuaki Hiraga

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 09/Jan/20 06:41

Updated:: 28/Aug/22 15:55

Resolved:: 01/Feb/20 06:30