Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7468

ASCIIFoldingFilter should not emit duplicated tokens when preserve original is on

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: trunk, 4.7
    • Fix Version/s: 6.3, 7.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The ASCIIFoldingFilter seems to make the bold assumption that any tokens that contain a char outside the ASCII range will be folded.
      The problem is that when preserve original is true we capture and restore the state even if the token remains unmodified.
      This causes term frequencies to double for such words and probably extra space used when positions/offsets are stored in the postings.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dcausse David Causse
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: