[LUCENE-9173] SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

This is a derived issue from ~~LUCENE-9123~~.

When the tokenizer that is given to SynonymGraphFilter decompound tokens or emit multiple tokens at the same position, SynonymGraphFilter cannot correctly handle them (an exception will be thrown).

For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two decompounded tokens for the text "株式会社":

株式会社 (positionIncrement=0, positionLength=2)
株式 (positionIncrement=1, positionLength=1)
会社 (positionIncrement=1, positionLength=1)

Then if we give a synonym "株式会社,コーポレーション" by SynonymGraphFilterFactory (set tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown.

Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0)
	at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
	at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
	at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
	at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
	at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]

This isn't only limited to JapaneseTokenizer but a more general issue about handling branched token graph (decompounded tokens in the midstream).

Attachments

Issue Links

relates to

LUCENE-9123 JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Tomoko Uchida

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Jan/20 09:59

Updated:: 15/Sep/24 22:23