Uploaded image for project: 'Lucene.Net'
  1. Lucene.Net
  2. LUCENENET-354

The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present in the original string

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • Lucene.Net 2.9.1

    Description

      The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present in the original string.

      I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before. When indexing "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:

      "bb hhh_ffff5_ssss"

      After some testing, I've found that this is because of the number. If I input

      "BB_HHH_FFFF_SSSS", I get

      "bb hhh ffff ssss"

      At this point, I'm leaning towards a tokenizer bug unless the presence of the number is supposed to have this behavior but I fail to see why.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mdufrasne Matt Dufrasne
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: