Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9088

JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      According to the JapaneseNumberFilter javadocs, it uses the attribute values of the last token used to compose the normalized number, which can be wrong. While this is documented it leads to a number of incompatibilities with other japanese token filters.

      For example, the PartOfSpeechAttribute of the last token used for an input text of "2008 2009" will lead to an the following output (some attributes left out...):

      ```

      { "token" : "2008", "start_offset" : 0, "end_offset" : 4, "type" : "word", [...] "partOfSpeech" : "記号-空白", "partOfSpeech (en)" : "symbol-space" [...] }

      ,

      { "token" : " ", "start_offset" : 4, "end_offset" : 5, "type" : "word", [...] "partOfSpeech" : "記号-空白", "partOfSpeech (en)" : "symbol-space", [...] }

      ,

      { "token" : "2009", "start_offset" : 5, "end_offset" : 9, "type" : "word", ... "partOfSpeech" : "名詞-数", "partOfSpeech (en)" : "noun-numeric", }

      ```

      so that e.g. a following `kuromoji_part_of_speech` filter will eliminate the "2008" token erroneously tagged as "symbol-space".

      Even without fixing the other token attrobutes, the POS attributes should IMHO be set to "noun-numeric", since that's what the filter is supposed to detect.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                cbuescher Christoph Büscher
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m