Lucene - Core
  1. Lucene - Core
  2. LUCENE-3913

HTMLStripCharFilter produces invalid final offset

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

      1. LUCENE-3913.patch
        18 kB
        Michael McCandless
      2. LUCENE-3913.patch
        60 kB
        Steve Rowe

        Activity

        Hide
        Robert Muir added a comment -

        thanks! let's just hope mike doesn't notice

        -reproducibility policeman

        Show
        Robert Muir added a comment - thanks! let's just hope mike doesn't notice -reproducibility policeman
        Hide
        Steve Rowe added a comment -

        I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales

        Hmm, I didn't think of the reproducibility angle. I'll fix.

        Committed to trunk and branch_3x.

        Thanks again Robert!

        Show
        Steve Rowe added a comment - I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales Hmm, I didn't think of the reproducibility angle. I'll fix. Committed to trunk and branch_3x. Thanks again Robert!
        Hide
        Steve Rowe added a comment -

        I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales

        Hmm, I didn't think of the reproducibility angle. I'll fix.

        Show
        Steve Rowe added a comment - I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales Hmm, I didn't think of the reproducibility angle. I'll fix.
        Hide
        Robert Muir added a comment -

        I don't think it will be interesting,
        instead just make seeds less reproducible
        across java 6 and 7 or other jre impls
        with different # of locales

        Show
        Robert Muir added a comment - I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales
        Hide
        Steve Rowe added a comment -

        Committed to branch_3x and trunk.

        Thanks Mike and Robert!

        Show
        Steve Rowe added a comment - Committed to branch_3x and trunk. Thanks Mike and Robert!
        Hide
        Steve Rowe added a comment -

        I think the toUpperCase/toLowerCase in recaseCodePoints should take Locale.ENGLISH?

        Yeah, I had that in at first, but then I thought it might be useful to use the randomized locale to trigger "interesting things". That's what we want, right?

        Show
        Steve Rowe added a comment - I think the toUpperCase/toLowerCase in recaseCodePoints should take Locale.ENGLISH? Yeah, I had that in at first, but then I thought it might be useful to use the randomized locale to trigger "interesting things". That's what we want, right?
        Hide
        Robert Muir added a comment -

        Thank you!

        I think the toUpperCase/toLowerCase in recaseCodePoints should take Locale.ENGLISH?

        Otherwise you will find this gives interesting things for <script> in the Turkish Locale

        Show
        Robert Muir added a comment - Thank you! I think the toUpperCase/toLowerCase in recaseCodePoints should take Locale.ENGLISH? Otherwise you will find this gives interesting things for <script> in the Turkish Locale
        Hide
        Michael McCandless added a comment -

        Awesome, thanks Steve!

        Show
        Michael McCandless added a comment - Awesome, thanks Steve!
        Hide
        Steve Rowe added a comment -

        ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null ...

        Actually, I ran it without -Dtestmethod=null.

        Show
        Steve Rowe added a comment - ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null ... Actually, I ran it without -Dtestmethod=null .
        Hide
        Steve Rowe added a comment - - edited

        Patch, a superset of Mike's:

        • fixes the identified problem: </br> offset was improperly calculated. (Added comments describing the offset calculations everywhere they're performed in the .jflex source.)
        • adds a new case emitting <\s*(/\s*)?(br|script|style)>? to _TestUtil.randomHtmlishString(), because <br>, <script>, and <style> are handled specially in HTMLStripCharFilter.
        • adds a new method _TestUtil.randomlyRecaseCodePoints(), used by the above-mentioned new randomHtmlishString() case, to produce things like <Br>, </sCriPT>, etc.
        • switches HTMLStripCharFilterTest.testRandomBrokenHTML() to use Mike's new BaseTokenStreamTestCase.checkAnalysisConsistency().
        • fixes the Jenkins test failure of HTMLStripCharFilterTest.testRandomHugeStrings() at https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12863/:
        ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=48bbf57c15b7aa2d:5bb640584c81078d:-7e916259eafd7e54 -Dtests.multiplier=5 -Dargs="-Dfile.encoding=ISO8859-1"
        

        Committing shortly.

        Show
        Steve Rowe added a comment - - edited Patch, a superset of Mike's: fixes the identified problem: </br> offset was improperly calculated. (Added comments describing the offset calculations everywhere they're performed in the .jflex source.) adds a new case emitting <\s*(/\s*)?(br|script|style)>? to _TestUtil.randomHtmlishString() , because <br>, <script>, and <style> are handled specially in HTMLStripCharFilter. adds a new method _TestUtil.randomlyRecaseCodePoints() , used by the above-mentioned new randomHtmlishString() case, to produce things like <Br> , </sCriPT> , etc. switches HTMLStripCharFilterTest.testRandomBrokenHTML() to use Mike's new BaseTokenStreamTestCase.checkAnalysisConsistency() . fixes the Jenkins test failure of HTMLStripCharFilterTest.testRandomHugeStrings() at https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12863/ : ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=48bbf57c15b7aa2d:5bb640584c81078d:-7e916259eafd7e54 -Dtests.multiplier=5 -Dargs="-Dfile.encoding=ISO8859-1" Committing shortly.
        Hide
        Steve Rowe added a comment -

        Since we know those are special, a good idea for the future could be to add both those elements to randomHtmlishString

        +1

        Show
        Steve Rowe added a comment - Since we know those are special, a good idea for the future could be to add both those elements to randomHtmlishString +1
        Hide
        Robert Muir added a comment -

        Since we know those are special, a good idea for the future could be to add both
        those elements to randomHtmlishString

        Show
        Robert Muir added a comment - Since we know those are special, a good idea for the future could be to add both those elements to randomHtmlishString
        Hide
        Steve Rowe added a comment -

        Minimal test failure trigger appears to be "x</br>", where "x" can be any non-whitespace character, and "</br>" must be "</br>". (No problems with "x<br>", "x</a>", etc.)

        <br> and </br> are handled specially, so this should narrow it down.

        Show
        Steve Rowe added a comment - Minimal test failure trigger appears to be "x</br>", where "x" can be any non-whitespace character, and "</br>" must be "</br>". (No problems with "x<br>", "x</a>", etc.) <br> and </br> are handled specially, so this should narrow it down.
        Hide
        Steve Rowe added a comment -

        I can reproduce - I'm digging.

        Show
        Steve Rowe added a comment - I can reproduce - I'm digging.
        Hide
        Michael McCandless added a comment -

        Good idea! I'll fix that test case.

        Here's the failure output:

            [junit] ------------- Standard Error -----------------
            [junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=testOddHTMLString -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs="-Dfile.encoding=UTF-8"
            [junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs="-Dfile.encoding=UTF-8"
            [junit] NOTE: test params are: locale=zh_SG, timezone=Europe/Minsk
            [junit] NOTE: all tests run in this JVM:
            [junit] [HTMLStripCharFilterTest]
            [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=163214064,total=189988864
            [junit] ------------- ---------------- ---------------
            [junit] Testcase: testOddHTMLString(org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest):	FAILED
            [junit] finalOffset  expected:<20> but was:<19>
            [junit] junit.framework.AssertionFailedError: finalOffset  expected:<20> but was:<19>
            [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$3.addError(JUnitTestRunner.java:975)
            [junit] 	at junit.framework.TestResult.addError(TestResult.java:38)
            [junit] 	at junit.framework.JUnit4TestAdapterCache$1.testFailure(JUnit4TestAdapterCache.java:51)
            [junit] 	at org.junit.runner.notification.RunNotifier$4.notifyListener(RunNotifier.java:100)
            [junit] 	at org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:41)
            [junit] 	at org.junit.runner.notification.RunNotifier.fireTestFailure(RunNotifier.java:97)
            [junit] 	at org.junit.internal.runners.model.EachTestNotifier.addFailure(EachTestNotifier.java:26)
            [junit] 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:267)
            [junit] 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
            [junit] 	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:146)
            [junit] 	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50)
            [junit] 	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
            [junit] 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
            [junit] 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
            [junit] 	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
            [junit] 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
            [junit] 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
            [junit] 	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
            [junit] 	at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74)
            [junit] 	at org.apache.lucene.util.StoreClassNameRule$1.evaluate(StoreClassNameRule.java:36)
            [junit] 	at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67)
            [junit] 	at org.junit.rules.RunRules.evaluate(RunRules.java:18)
            [junit] 	at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
            [junit] 	at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
            [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)
            [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)
            [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)
            [junit] Caused by: java.lang.AssertionError: finalOffset  expected:<20> but was:<19>
            [junit] 	at org.junit.Assert.fail(Assert.java:93)
            [junit] 	at org.junit.Assert.failNotEquals(Assert.java:647)
            [junit] 	at org.junit.Assert.assertEquals(Assert.java:128)
            [junit] 	at org.junit.Assert.assertEquals(Assert.java:472)
            [junit] 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:182)
            [junit] 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:574)
            [junit] 	at org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testOddHTMLString(HTMLStripCharFilterTest.java:550)
            [junit] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            [junit] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
            [junit] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            [junit] 	at java.lang.reflect.Method.invoke(Method.java:597)
            [junit] 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
            [junit] 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
            [junit] 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
            [junit] 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
            [junit] 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
            [junit] 	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
            [junit] 	at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:636)
            [junit] 	at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:542)
            [junit] 	at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67)
            [junit] 	at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:458)
            [junit] 	at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74)
            [junit] 	at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:516)
            [junit] 	at org.junit.rules.RunRules.evaluate(RunRules.java:18)
            [junit] 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
            [junit] 	... 19 more
            [junit] 
        

        Note that instead of -Dtests.testmethod=null you should pass -Dtests.testmethod=testOddHTMLString.

        Show
        Michael McCandless added a comment - Good idea! I'll fix that test case. Here's the failure output: [junit] ------------- Standard Error ----------------- [junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=testOddHTMLString -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs="-Dfile.encoding=UTF-8" [junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs="-Dfile.encoding=UTF-8" [junit] NOTE: test params are: locale=zh_SG, timezone=Europe/Minsk [junit] NOTE: all tests run in this JVM: [junit] [HTMLStripCharFilterTest] [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=163214064,total=189988864 [junit] ------------- ---------------- --------------- [junit] Testcase: testOddHTMLString(org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest): FAILED [junit] finalOffset expected:<20> but was:<19> [junit] junit.framework.AssertionFailedError: finalOffset expected:<20> but was:<19> [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$3.addError(JUnitTestRunner.java:975) [junit] at junit.framework.TestResult.addError(TestResult.java:38) [junit] at junit.framework.JUnit4TestAdapterCache$1.testFailure(JUnit4TestAdapterCache.java:51) [junit] at org.junit.runner.notification.RunNotifier$4.notifyListener(RunNotifier.java:100) [junit] at org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:41) [junit] at org.junit.runner.notification.RunNotifier.fireTestFailure(RunNotifier.java:97) [junit] at org.junit.internal.runners.model.EachTestNotifier.addFailure(EachTestNotifier.java:26) [junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:267) [junit] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:146) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50) [junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) [junit] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) [junit] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) [junit] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) [junit] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) [junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) [junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) [junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74) [junit] at org.apache.lucene.util.StoreClassNameRule$1.evaluate(StoreClassNameRule.java:36) [junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67) [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18) [junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:300) [junit] at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768) [junit] Caused by: java.lang.AssertionError: finalOffset expected:<20> but was:<19> [junit] at org.junit.Assert.fail(Assert.java:93) [junit] at org.junit.Assert.failNotEquals(Assert.java:647) [junit] at org.junit.Assert.assertEquals(Assert.java:128) [junit] at org.junit.Assert.assertEquals(Assert.java:472) [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:182) [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:574) [junit] at org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testOddHTMLString(HTMLStripCharFilterTest.java:550) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) [junit] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) [junit] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) [junit] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) [junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) [junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) [junit] at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:636) [junit] at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:542) [junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67) [junit] at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:458) [junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74) [junit] at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:516) [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18) [junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) [junit] ... 19 more [junit] Note that instead of -Dtests.testmethod=null you should pass -Dtests.testmethod=testOddHTMLString.
        Hide
        Robert Muir added a comment -

        I like the refactored method: though I think we should also call it from
        HTMLStripCharFilterTest.testRandomBrokenHTML?
        Currently this only does:

        while (reader.read() != -1);
        
        Show
        Robert Muir added a comment - I like the refactored method: though I think we should also call it from HTMLStripCharFilterTest.testRandomBrokenHTML? Currently this only does: while (reader.read() != -1);
        Hide
        Michael McCandless added a comment -

        I forgot to say: patch is against 3.x.

        Show
        Michael McCandless added a comment - I forgot to say: patch is against 3.x.
        Hide
        Michael McCandless added a comment -

        Patch showing the failure....

        It happens on input " Secretary)</br> [[M", in case anyone can see something obviously interesting

        I have no idea where the bug is...

        Show
        Michael McCandless added a comment - Patch showing the failure.... It happens on input " Secretary)</br> [[M", in case anyone can see something obviously interesting I have no idea where the bug is...

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development