Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7857

CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 6.7, 7.0
    • None
    • None
    • New

    Description

      Assigning to myself to not lose track of it.

      LUCENE-7705 introduced the ability to define the allowable token length for these tokenizers other than hard-code it to 255. It's always been the case that when the hard-coded limit was exceeded, multiple tokens would be emitted. However, the tests for LUCENE-7705 exposed a problem.

      Suppose the max length is 3 and the doc contains "letter". Two tokens are emitted and indexed: "let" and "ter".

      Now suppose the search is for "lett". If the default operator is AND or phrase queries are constructed the query fails since the tokens emitted are "let" and "t". Only if the operator is OR is the document found, and even then it won't be correct since searching for "lett" would match a document indexed with "bett" because it would match on the bare "t".

      Proposal:

      The remainder of the token should be ignored when maxTokenLen is exceeded.

      rcmuirsteve_rowetomasflobbe comments? Again, this behavior was not introduced by LUCENE-7705, it's just that it would be very hard to notice with the default 255 char limit.

      I'm not quite sure why master generates a parsed query of:
      field:let field:t
      and 6x generates
      field:"let t"
      so the tests succeeded on master but not on 6x....

      Attachments

        Issue Links

          Activity

            People

              erickerickson Erick Erickson
              erickerickson Erick Erickson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: