[LUCENE-4766] Pattern token filter which emits a token for every capturing group - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 4.1
Fix Version/s: 4.4, 6.0
Component/s: modules/analysis
Labels:
- analysis
- feature
- lucene

Lucene Fields:

New, Patch Available

Description

The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position.

I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern.
Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches.

For instance a pattern like :

    "(([a-z]+)(\d*))"

when matched against:

    "abc123def456"

would produce the tokens:

    abc123, abc, 123, def456, def, 456

Multiple patterns can be applied, eg these patterns could be used for camelCase analysis:

    "([A-Z]{2,})",
    "(?<![A-Z])([A-Z][a-z]+)",
    "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
    "([0-9]+)"

When matched against the string "letsPartyLIKEits1999_dude", they would produce the tokens:

    lets, Party, LIKE, its, 1999, dude

If no token is emitted, the original token is preserved.
If the preserveOriginal flag is true, it will output the full original token (ie "letsPartyLIKEits1999_dude") in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token).

Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand.

This is my first Java code, so apologies if I'm doing something stupid.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-4766.patch
24/Apr/13 09:05
31 kB
Simon Willnauer
LUCENE-4766.patch
13/Feb/13 13:31
29 kB
Clinton Gormley
LUCENE-4766.patch
13/Feb/13 11:59
29 kB
Clinton Gormley
LUCENE-4766.patch
11/Feb/13 11:15
29 kB
Simon Willnauer
LUCENE-4766.patch
10/Feb/13 12:35
23 kB
Clinton Gormley

Activity

People

Assignee:: Simon Willnauer

Reporter:: Clinton Gormley

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 10/Feb/13 12:31

Updated:: 28/Aug/22 13:38

Resolved:: 24/Apr/13 10:33