Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10290

analysis-stempel incorrect tokens generation for numbers

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 8.7
    • None
    • modules/analysis
    • None
    • *Elasticsearch version* 7.11.2:

      *Plugins installed*: [analysis-stempel]

      *OS version* CentOS

    • New

    Description

      Actual:
      I observed unexpected behaviour. Some numbers are affected by stemmer. It causes wrong search results.
      For example "2021" -> "20ć".

      Expected:
      string numbers should not be changed.

      Reproduce:

      Issue can be reproduced with elasticsearch:

      request:

      POST _analyze
      {
        "tokenizer": "standard",
        "filter": ["polish_stem"],
        "text": "2021"
      }
      

      response:

      {
        "tokens": [
          {
            "token": "20ć",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<NUM>",
            "position": 0
          }
        ]
      }
      

      I suspect the newer versions are also affected, but I don't have possibility to verify it.

      Attachments

        Activity

          People

            Unassigned Unassigned
            domsew Dominik
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: