[LUCENE-9575] Add PatternTypingFilter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 8.9
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

One of the key asks when the Library of Congress was asking me to develop the Advanced Query Parser was to be able to recognize arbitrary patterns that included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they wanted 401k and 401(k) to match documents with either style reference, and NOT match documents that happen to have isolated 401 or k tokens (i.e. not documents about the http status code) And of course we wanted to give up as little of the text analysis features they were already using.

This filter in conjunction with the filters from ~~LUCENE-9572~~, ~~LUCENE-9574~~ and one solr specific filter in SOLR-14597 that re-analyzes tokens with an arbitrary analyzer defined for a type in the solr schema, combine to achieve this.

This filter has the job of spotting the patterns, and adding the intended synonym as at type to the token (from which minimal punctuation has been removed). It also sets flags on the token which are retained through the analysis chain, and at the very end the type is converted to a synonym and the original token(s) for that type are dropped avoiding the match on 401 (for example)

The pattern matching is specified in a file that looks like:

2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2
2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3
2 C\+\+ ::: c_plus_plus

That file would match match legal reference patterns such as 401(k), 401k, 501(c)3 and C++ The format is:

and groups in the pattern are substituted into the replacement so the first line above would create synonyms such as:

401k   --> legal2_401_k
401(k) --> legal2_401_k
503(c) --> legal2_503_c

Attachments

Issue Links

blocks

SOLR-14597 Advanced Query Parser

Patch Available

is blocked by

LUCENE-9572 Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types

Closed

links to

GitHub Pull Request #1995

GitHub Pull Request #2240

GitHub Pull Request #2241

GitHub Pull Request #2493

(1 links to)

Activity

People

Assignee:: Gus Heck

Reporter:: Gus Heck

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 08/Oct/20 17:37

Updated:: 28/Aug/22 16:09

Resolved:: 13/May/21 00:49

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

5h 40m