[LUCENE-417] StandardTokenizer has problems with comma-separated values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 1.4
Fix Version/s: None
Component/s: modules/analysis
Labels:
None
Environment:

Operating System: other
Platform: Other

Bugzilla Id:
35971

Description

The StandardTokenizer assumes that if a phrase contains a comma and at least one
digit, the phrase has to be a number. We are trying to index comma-separated
values of SAP R/3 trancation codes along with standard text. Many of these code
contain digits, e.g. "VA01" or "SE80". While tokenizing text containing these
codes, lucene recognizes a comma-separated list of them as a digit, e.g.
"VA01,VA02,VA03". The grammar should be modified to recognize numbers correctly
(e.g. containing only digits).

Attachments

Activity

People

Assignee:: Unassigned

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 02/Aug/05 21:17

Updated:: 28/Aug/22 11:22

Resolved:: 12/Jan/08 23:04