Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9575

Add PatternTypingFilter

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Implemented
    • None
    • 8.9
    • modules/analysis
    • None
    • New

    Description

      One of the key asks when the Library of Congress was asking me to develop the Advanced Query Parser was to be able to recognize arbitrary patterns that included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they wanted 401k and 401(k) to match documents with either style reference, and NOT match documents that happen to have isolated 401 or k tokens (i.e. not documents about the http status code) And of course we wanted to give up as little of the text analysis features they were already using.

      This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and one solr specific filter in SOLR-14597 that re-analyzes tokens with an arbitrary analyzer defined for a type in the solr schema, combine to achieve this. 

      This filter has the job of spotting the patterns, and adding the intended synonym as at type to the token (from which minimal punctuation has been removed). It also sets flags on the token which are retained through the analysis chain, and at the very end the type is converted to a synonym and the original token(s) for that type are dropped avoiding the match on 401 (for example) 

      The pattern matching is specified in a file that looks like: 

      2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2
      2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3
      2 C\+\+ ::: c_plus_plus
      

      That file would match match legal reference patterns such as 401(k), 401k, 501(c)3 and C++ The format is:

      <flagsInt> <pattern> ::: <replacement>

      and groups in the pattern are substituted into the replacement so the first line above would create synonyms such as:

      401k   --> legal2_401_k
      401(k) --> legal2_401_k
      503(c) --> legal2_503_c
      

      Attachments

        Issue Links

          Activity

            People

              gus Gus Heck
              gus Gus Heck
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 40m
                  5h 40m