[OAK-3648] Use StandardTokenizer instead of ClassicTokenizer in OakAnalyzer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.34, 1.2.19, 1.3.11, 1.4
Component/s: lucene
Labels:
None

Description

This is related to ~~OAK-3276~~ where the intent was to use StandardAnalyzer by default (instead of OakAnalyzer). As discussed there, we need specific word delimiter which isn't possible with StandardAnalyzer, so we instead should switch over to StandardTokenizer in OakAnalyer itself.

A few motivations to do that:

Better unicode support
ClassicTokenizer is the old (~lucene 3.1) implementation of standard tokenizer

One of the key difference between classic and standard tokenizer is the way they delimit words (standard analyzer follows unicode text segmentation rules)... but that difference gets nullified as we have our own WordDelimiterFilter.

Attachments

Issue Links

is related to

OAK-3276 Make StandardAnalyzer as the default search analyzer

Resolved

is required by

OAK-4516 Configurable option to lucene index defs to index original (unanalyzed value as well)

Closed

Activity

People

Assignee:: Vikas Saurabh

Reporter:: Vikas Saurabh

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Nov/15 16:00

Updated:: 22/Aug/16 15:03

Resolved:: 17/Nov/15 16:31