Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9173

SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is a derived issue from LUCENE-9123.

      When the tokenizer that is given to SynonymGraphFilter decompound tokens or emit multiple tokens at the same position, SynonymGraphFilter cannot correctly handle them (an exception will be thrown).

      For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two decompounded tokens for the text "株式会社":

      株式会社 (positionIncrement=0, positionLength=2)
      株式 (positionIncrement=1, positionLength=1)
      会社 (positionIncrement=1, positionLength=1)
      

      Then if we give a synonym "株式会社,コーポレーション" by SynonymGraphFilterFactory (set tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown.

      Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0)
      	at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
      	at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
      	at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
      	at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
      	at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
      

      This isn't only limited to JapaneseTokenizer but a more general issue about handling branched token graph (decompounded tokens in the midstream).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tomoko Tomoko Uchida
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: