Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6914

DecimalDigitFilter skips characters in some cases (supplemental?)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.4
    • Fix Version/s: 7.0, 6.3, 5.5.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Found this while writing up the solr ref guide for DecimalDigitFilter.

      With input like "𝟙𝟡𝟠𝟜" ("Double Struck" 1984) the filter produces "1𝟡8𝟜" (1, double struck 9, 8, double struck 4) add some non-decimal characters in between the digits (ie: "𝟙x𝟡x𝟠x𝟜") and you get the expected output ("1x9x8x4"). This doesn't affect all decimal characters though, as evident by the existing test cases.

      Perhaps this is an off by one bug in the "if the original was supplementary, shrink the string" code path?

      1. LUCENE-6914.patch
        8 kB
        Hoss Man
      2. LUCENE-6914.patch
        4 kB
        Hoss Man
      3. LUCENE-6914.patch
        0.9 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          hossman Hoss Man added a comment -

          failure produced by attached test patch...

             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestDecimalDigitFilter -Dtests.method=testDoubleStruck -Dtests.seed=3126DECB8CE805E -Dtests.slow=true -Dtests.locale=ga -Dtests.timezone=Africa/Juba -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
             [junit4] FAILURE 0.03s | TestDecimalDigitFilter.testDoubleStruck <<<
             [junit4]    > Throwable #1: org.junit.ComparisonFailure: term 0 expected:<1[984]> but was:<1[𝟡8𝟜]>
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([3126DECB8CE805E:92961DD9D4C68E38]:0)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:186)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:301)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:305)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:309)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertAnalyzesTo(BaseTokenStreamTestCase.java:359)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertAnalyzesTo(BaseTokenStreamTestCase.java:368)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkOneTerm(BaseTokenStreamTestCase.java:429)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestDecimalDigitFilter.testDoubleStruck(TestDecimalDigitFilter.java:74)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
          
          Show
          hossman Hoss Man added a comment - failure produced by attached test patch... [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestDecimalDigitFilter -Dtests.method=testDoubleStruck -Dtests.seed=3126DECB8CE805E -Dtests.slow=true -Dtests.locale=ga -Dtests.timezone=Africa/Juba -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 0.03s | TestDecimalDigitFilter.testDoubleStruck <<< [junit4] > Throwable #1: org.junit.ComparisonFailure: term 0 expected:<1[984]> but was:<1[𝟡8𝟜]> [junit4] > at __randomizedtesting.SeedInfo.seed([3126DECB8CE805E:92961DD9D4C68E38]:0) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:186) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:301) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:305) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:309) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertAnalyzesTo(BaseTokenStreamTestCase.java:359) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertAnalyzesTo(BaseTokenStreamTestCase.java:368) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkOneTerm(BaseTokenStreamTestCase.java:429) [junit4] > at org.apache.lucene.analysis.core.TestDecimalDigitFilter.testDoubleStruck(TestDecimalDigitFilter.java:74) [junit4] > at java.lang.Thread.run(Thread.java:745)
          Hide
          hossman Hoss Man added a comment -

          updated patch with beefed up randomized testing to reproduce the problem that way (either that or some other problem that looks similar to my naked eye)

          Then I took a shot in the dark at a fix to the call to StemmerUtil.delete and that seems to make the beast happy.

          Since i'm way out of my depth here i don't intend on commiting w/o explicit feedback from someone who understands this code. (i'm mainly worried i may have introduced some other equally bad bug w/o realizing it).

          Anybody who understands this and thinks my patch looks good is welcome to run with it for 5.4, no need to wait for me.

          Show
          hossman Hoss Man added a comment - updated patch with beefed up randomized testing to reproduce the problem that way (either that or some other problem that looks similar to my naked eye) Then I took a shot in the dark at a fix to the call to StemmerUtil.delete and that seems to make the beast happy. Since i'm way out of my depth here i don't intend on commiting w/o explicit feedback from someone who understands this code. (i'm mainly worried i may have introduced some other equally bad bug w/o realizing it). Anybody who understands this and thinks my patch looks good is welcome to run with it for 5.4, no need to wait for me.
          Hide
          hossman Hoss Man added a comment -

          the redundancy in the two randomized tests (and wasteful int[] I introduced) was bugging me, so I refactored some logic into determininghte set of all decimal digits for reuse in both tests.

          Show
          hossman Hoss Man added a comment - the redundancy in the two randomized tests (and wasteful int[] I introduced) was bugging me, so I refactored some logic into determininghte set of all decimal digits for reuse in both tests.
          Hide
          rcmuir Robert Muir added a comment -

          looks buggy indeed. Can we also add a simple test (𝟙x𝟡x𝟠x𝟜 -> 1x9x8x4) ?

          Show
          rcmuir Robert Muir added a comment - looks buggy indeed. Can we also add a simple test (𝟙x𝟡x𝟠x𝟜 -> 1x9x8x4) ?
          Hide
          rcmuir Robert Muir added a comment -

          or maybe the doubleStruck test is good enough? i guess the only oddity is, it actually has whitespace in between the characters, but at a glance its hard to tell its not e.g. full-width digits. Maybe we should just replace the spaces with x's.

          Show
          rcmuir Robert Muir added a comment - or maybe the doubleStruck test is good enough? i guess the only oddity is, it actually has whitespace in between the characters, but at a glance its hard to tell its not e.g. full-width digits. Maybe we should just replace the spaces with x's.
          Hide
          hossman Hoss Man added a comment - - edited

          i wrote it thinking it would be helpful to first demonstrate what worked (x, or space, or some non candidate char in between digits as a "gap"), but then when you replace those "gaps" when empty space strings it breaks. now that we know the problem and can trivially reproduce with randomized tests, i'm not sure it's really relevant – but you could always just clone those asserts and do a bunch of different varieties for the "gap" (single space, x, some wide supplemental character that isn't a digit, etc...)

          Show
          hossman Hoss Man added a comment - - edited i wrote it thinking it would be helpful to first demonstrate what worked (x, or space, or some non candidate char in between digits as a "gap"), but then when you replace those "gaps" when empty space strings it breaks. now that we know the problem and can trivially reproduce with randomized tests, i'm not sure it's really relevant – but you could always just clone those asserts and do a bunch of different varieties for the "gap" (single space, x, some wide supplemental character that isn't a digit, etc...)
          Hide
          jpountz Adrien Grand added a comment -

          +1 to the patch

          Show
          jpountz Adrien Grand added a comment - +1 to the patch
          Hide
          jpountz Adrien Grand added a comment -

          now that we know the problem and can trivially reproduce with randomized tests, i'm not sure it's really relevant

          I find explicit tests helpful because they are easier to dig into if they fail, so obvious problems can be more easily identified/fixed.

          I think we should commit this patch, this filter is used in several of our language analyzers so this bug might be hitting quite a lot of users?

          Show
          jpountz Adrien Grand added a comment - now that we know the problem and can trivially reproduce with randomized tests, i'm not sure it's really relevant I find explicit tests helpful because they are easier to dig into if they fail, so obvious problems can be more easily identified/fixed. I think we should commit this patch, this filter is used in several of our language analyzers so this bug might be hitting quite a lot of users?
          Hide
          jpountz Adrien Grand added a comment -

          Hoss Man Let's commit this patch?

          Show
          jpountz Adrien Grand added a comment - Hoss Man Let's commit this patch?
          Hide
          hossman Hoss Man added a comment -

          go for it Adrien Grand ... like i said before...

          Since i'm way out of my depth here i don't intend on commiting ... Anybody who understands this and thinks my patch looks good is welcome to run with it ... no need to wait for me.

          Show
          hossman Hoss Man added a comment - go for it Adrien Grand ... like i said before... Since i'm way out of my depth here i don't intend on commiting ... Anybody who understands this and thinks my patch looks good is welcome to run with it ... no need to wait for me.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6df4b4d7284e649bc92fc49cb92e0b2efd0fdc2c in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6df4b4d ]

          LUCENE-6914: Fix issue ID in the change log.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6df4b4d7284e649bc92fc49cb92e0b2efd0fdc2c in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6df4b4d ] LUCENE-6914 : Fix issue ID in the change log.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4498cc727f5a2f2db5ea8683b36a821b9b529ebb in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4498cc7 ]

          LUCENE-6914: Fix issue ID in the change log.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4498cc727f5a2f2db5ea8683b36a821b9b529ebb in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4498cc7 ] LUCENE-6914 : Fix issue ID in the change log.
          Hide
          jpountz Adrien Grand added a comment -

          Thanks Hoss Man.

          Show
          jpountz Adrien Grand added a comment - Thanks Hoss Man .
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.
          Hide
          jpountz Adrien Grand added a comment -

          Reopen for backport to 5.5.4.

          Show
          jpountz Adrien Grand added a comment - Reopen for backport to 5.5.4.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4355335153bacdcf3d81052e07164f2158d5e587 in lucene-solr's branch refs/heads/branch_5_5 from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4355335 ]

          LUCENE-6914: Fix ID in the change log.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4355335153bacdcf3d81052e07164f2158d5e587 in lucene-solr's branch refs/heads/branch_5_5 from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4355335 ] LUCENE-6914 : Fix ID in the change log.

            People

            • Assignee:
              Unassigned
              Reporter:
              hossman Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development