[LUCENE-1373] Most of the contributed Analyzers suffer from invalid recognition of acronyms. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

~~LUCENE-1068~~ describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).

Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour

I refer to:

BrazilianAnalyzer
CzechAnalyzer
DutchAnalyzer
FrenchAnalyzer
GermanAnalyzer
GreekAnalyzer
ThaiAnalyzer

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1373.patch
05/Sep/08 08:28
8 kB
Mark Lassau

Issue Links

is part of

LUCENE-2002 Add oal.util.Version ctor to QueryParser

Closed

relates to

LUCENE-1403 StandardTokenizer - Improper Hostname Recognition

Closed

LUCENE-1068 Invalid behavior of StandardTokenizerImpl

Closed

LUCENE-1151 Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Mark Lassau

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Sep/08 00:56

Updated:: 28/Aug/22 11:52

Resolved:: 22/Oct/09 19:54