[LUCENENET-354] The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present in the original string - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- standardanalyzer
- tokenizer
Environment:

Lucene.Net 2.9.1

Description

The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present in the original string.

I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before. When indexing "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:

"bb hhh_ffff5_ssss"

After some testing, I've found that this is because of the number. If I input

"BB_HHH_FFFF_SSSS", I get

"bb hhh ffff ssss"

At this point, I'm leaning towards a tokenizer bug unless the presence of the number is supposed to have this behavior but I fail to see why.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Matt Dufrasne

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Apr/10 18:44

Updated:: 11/Apr/10 17:51

Resolved:: 11/Apr/10 17:51