[LUCENE-6879] Allow to define custom CharTokenizer using Java 8 Lambdas/Method references - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 6.0
Fix Version/s: 6.0
Component/s: modules/analysis
Labels:
- Java8

Lucene Fields:

New

Description

As a followup from ~~LUCENE-6874~~, I thought about how to generate custom CharTokenizers wthout subclassing. I had this quite often and I was a bit annoyed, that you had to create a subclass every time.

This issue is using the pattern like ThreadLocal or many collection methods in Java 8: You have the (abstract) base class and you define a factory method named fromXxxPredicate (like ThreadLocal.withInitial(() -> value).

public static CharTokenizer fromTokenCharPredicate(java.util.function.IntPredicate predicate)

This would allow to define a new CharTokenizer with a single line statement using any predicate:

// long variant with lambda:
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c -> !UCharacter.isUWhiteSpace(c));

// method reference for separator char predicate + normalization by uppercasing:
Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace, Character::toUpperCase);

// method reference to custom function:
private boolean myTestFunction(int c) {
 return (cracy condition);
}
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);

I know this would not help Solr users that want to define the Tokenizer in a config file, but for real Lucene users this Java 8-like way would be easy and elegant to use. It is fast as hell, as it is just a reference to a method and Java 8 is optimized for that.

The inverted factories fromSeparatorCharPredicate() are provided to allow quick definition without lambdas using method references. In lots of cases, like WhitespaceTokenizer, predicates are on the separator chars (isWhitespace(int), so using the 2nd set of factories you can define them without the counter-intuitive negation. Internally it just uses Predicate#negate().

The factories also allow to give the normalization function, e.g. to Lowercase, you may just give Character::toLowerCase as IntUnaryOperator reference.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-6879.patch
02/Nov/15 22:35
7 kB
Uwe Schindler
LUCENE-6879.patch
04/Nov/15 22:43
11 kB
Uwe Schindler

Issue Links

is related to

LUCENE-6874 WhitespaceTokenizer should tokenize on NBSP

Closed

Activity

People

Assignee:: Uwe Schindler

Reporter:: Uwe Schindler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Nov/15 22:33

Updated:: 28/Aug/22 14:45

Resolved:: 04/Nov/15 22:44