Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7468

ASCIIFoldingFilter should not emit duplicated tokens when preserve original is on

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.7, 5.x, trunk, 6.x
    • Fix Version/s: 7.0, 6.3
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The ASCIIFoldingFilter seems to make the bold assumption that any tokens that contain a char outside the ASCII range will be folded.
      The problem is that when preserve original is true we capture and restore the state even if the token remains unmodified.
      This causes term frequencies to double for such words and probably extra space used when positions/offsets are stored in the postings.

        Activity

        Hide
        steve_rowe Steve Rowe added a comment -

        David Causse, couldn't you just use RemoveDuplicatesTokenFilter?

        Show
        steve_rowe Steve Rowe added a comment - David Causse , couldn't you just use RemoveDuplicatesTokenFilter?
        Hide
        erickerickson Erick Erickson added a comment -

        I think it's still a good point that saving the same token twice isn't desired behavior here, having to add the RemoveDuplicatesTokenFilter seems unnecessarily trappy although a fine work-around in order to not have to wait for a new release......

        FWIW

        Show
        erickerickson Erick Erickson added a comment - I think it's still a good point that saving the same token twice isn't desired behavior here, having to add the RemoveDuplicatesTokenFilter seems unnecessarily trappy although a fine work-around in order to not have to wait for a new release...... FWIW
        Hide
        dcausse David Causse added a comment -

        Yes I plan to use a filter that removes duplicates to workaround the issue, concerning that patch itself to fix ASCIIFoldingFilter I agree with Erick, it seems to me (after reading the test) that this behavior is not expected and that the preserve_original option was only meant to keep original tokens when they are actually modified.

        Show
        dcausse David Causse added a comment - Yes I plan to use a filter that removes duplicates to workaround the issue, concerning that patch itself to fix ASCIIFoldingFilter I agree with Erick, it seems to me (after reading the test) that this behavior is not expected and that the preserve_original option was only meant to keep original tokens when they are actually modified.
        Hide
        jpountz Adrien Grand added a comment -

        David and Erick's comments make sense to me. I'll test the patch tomorrow and merge it if there are no objections until then.

        Show
        jpountz Adrien Grand added a comment - David and Erick's comments make sense to me. I'll test the patch tomorrow and merge it if there are no objections until then.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 739c0a7bf2c911e25ed40fb6717d9aed641a0a2f in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=739c0a7 ]

        LUCENE-7468: ASCIIFoldingFilter should not emit duplicated tokens when preserve original is on.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 739c0a7bf2c911e25ed40fb6717d9aed641a0a2f in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=739c0a7 ] LUCENE-7468 : ASCIIFoldingFilter should not emit duplicated tokens when preserve original is on.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 28d187acd1e391723eb6e1b5445f22abf5580a80 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=28d187a ]

        LUCENE-7468: ASCIIFoldingFilter should not emit duplicated tokens when preserve original is on.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 28d187acd1e391723eb6e1b5445f22abf5580a80 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=28d187a ] LUCENE-7468 : ASCIIFoldingFilter should not emit duplicated tokens when preserve original is on.
        Hide
        jpountz Adrien Grand added a comment -

        Merged. This change will be available in Lucene 6.3. Thanks David!

        Show
        jpountz Adrien Grand added a comment - Merged. This change will be available in Lucene 6.3. Thanks David!
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Closing after 6.3.0 release.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            dcausse David Causse
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development