[LUCENE-2167] Implement StandardTokenizer with the UAX#29 Standard - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.1, 4.0-ALPHA
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/analysis
Labels:
None

Lucene Fields:

Patch Available

Description

It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.

Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:

This should be a good tokenizer for most European-language documents

The new StandardTokenizer could then say

This should be a good tokenizer for most languages.

All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StandardTokenizerImpl.jflex
19/Jul/10 05:18
14 kB
Steven Rowe
standard.zip
29/Jun/10 16:03
162 kB
Robert Muir
LUCENE-2167-lucene-buildhelper-maven-plugin.patch
27/May/10 06:23
39 kB
Steven Rowe
LUCENE-2167-jflex-tld-macro-gen.patch
27/May/10 15:19
14 kB
Uwe Schindler
LUCENE-2167-jflex-tld-macro-gen.patch
27/May/10 15:47
14 kB
Uwe Schindler
LUCENE-2167-jflex-tld-macro-gen.patch
01/Jun/10 08:10
14 kB
Uwe Schindler
LUCENE-2167.patch
16/Dec/09 18:55
3 kB
Shyamal Prasad
LUCENE-2167.patch
24/Feb/10 02:13
2 kB
Shyamal Prasad
LUCENE-2167.patch
06/May/10 05:01
56 kB
Steven Rowe
LUCENE-2167.patch
08/May/10 16:51
56 kB
Steven Rowe
LUCENE-2167.patch
09/May/10 05:07
46 kB
Steven Rowe
LUCENE-2167.patch
15/May/10 16:12
47 kB
Steven Rowe
LUCENE-2167.patch
15/May/10 22:52
49 kB
Steven Rowe
LUCENE-2167.patch
17/May/10 00:22
50 kB
Steven Rowe
LUCENE-2167.patch
17/May/10 05:49
50 kB
Steven Rowe
LUCENE-2167.patch
27/May/10 06:01
53 kB
Steven Rowe
LUCENE-2167.patch
07/Jun/10 08:00
859 kB
Steven Rowe
LUCENE-2167.patch
10/Jun/10 04:27
746 kB
Steven Rowe
LUCENE-2167.patch
30/Jun/10 14:21
812 kB
Robert Muir
LUCENE-2167.patch
01/Jul/10 00:02
529 kB
Robert Muir
LUCENE-2167.patch
19/Jul/10 06:44
588 kB
Steven Rowe
LUCENE-2167.patch
26/Jul/10 06:18
887 kB
Steven Rowe
LUCENE-2167.patch
26/Jul/10 06:27
874 kB
Steven Rowe
LUCENE-2167.patch
15/Sep/10 08:07
831 kB
Steven Rowe
LUCENE-2167.patch
28/Sep/10 06:06
885 kB
Steven Rowe
LUCENE-2167.benchmark.patch
16/May/10 01:13
31 kB
Steven Rowe
LUCENE-2167.benchmark.patch
13/Jun/10 04:08
33 kB
Steven Rowe
LUCENE-2167.benchmark.patch
11/Jul/10 20:41
34 kB
Steven Rowe

Issue Links

incorporates

LUCENE-1545 Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Closed

LUCENE-1702 Thai token type() bug

Closed

LUCENE-1556 some valid email address characters not correctly recognized

Closed

is related to

LUCENE-2763 Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Closed

relates to

LUCENE-2244 Improve StandardTokenizer's understanding of non ASCII punctuation and quotes

Closed

Activity

People

Assignee:: Steven Rowe

Reporter:: Shyamal Prasad

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Dec/09 18:48

Updated:: 28/Aug/22 12:17

Resolved:: 15/Nov/10 18:27

Time Tracking

Estimated:

0.5h

Remaining:

0.5h

Logged:

Not Specified