One of the key asks when the Library of Congress was asking me to develop the Advanced Query Parser was to be able to recognize arbitrary patterns that included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they wanted 401k and 401(k) to match documents with either style reference, and NOT match documents that happen to have isolated 401 or k tokens (i.e. not documents about the http status code) And of course we wanted to give up as little of the text analysis features they were already using.
This filter in conjunction with the filters from
LUCENE-9572, LUCENE-9574 and one solr specific filter in SOLR-14597 that re-analyzes tokens with an arbitrary analyzer defined for a type in the solr schema, combine to achieve this.
This filter has the job of spotting the patterns, and adding the intended synonym as at type to the token (from which minimal punctuation has been removed). It also sets flags on the token which are retained through the analysis chain, and at the very end the type is converted to a synonym and the original token(s) for that type are dropped avoiding the match on 401 (for example)
The pattern matching is specified in a file that looks like:
That file would match match legal reference patterns such as 401(k), 401k, 501(c)3 and C++ The format is:
<flagsInt> <pattern> ::: <replacement>
and groups in the pattern are substituted into the replacement so the first line above would create synonyms such as: