Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7857

CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.7, 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Assigning to myself to not lose track of it.

      LUCENE-7705 introduced the ability to define the allowable token length for these tokenizers other than hard-code it to 255. It's always been the case that when the hard-coded limit was exceeded, multiple tokens would be emitted. However, the tests for LUCENE-7705 exposed a problem.

      Suppose the max length is 3 and the doc contains "letter". Two tokens are emitted and indexed: "let" and "ter".

      Now suppose the search is for "lett". If the default operator is AND or phrase queries are constructed the query fails since the tokens emitted are "let" and "t". Only if the operator is OR is the document found, and even then it won't be correct since searching for "lett" would match a document indexed with "bett" because it would match on the bare "t".

      Proposal:

      The remainder of the token should be ignored when maxTokenLen is exceeded.

      Robert Muir[~steve_rowe][~tomasflobbe] comments? Again, this behavior was not introduced by LUCENE-7705, it's just that it would be very hard to notice with the default 255 char limit.

      I'm not quite sure why master generates a parsed query of:
      field:let field:t
      and 6x generates
      field:"let t"
      so the tests succeeded on master but not on 6x....

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                erickerickson Erick Erickson
                Reporter:
                erickerickson Erick Erickson
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: