Lucene - Core
  1. Lucene - Core
  2. LUCENE-3820

Wrong trailing index calculation in PatternReplaceCharFilter

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Reimplementation of PatternReplaceCharFilter to pass randomized tests (used to throw exceptions previously). Simplified code, dropped boundary characters, full input buffered for pattern matching.

      1. LUCENE-3820_test.patch
        9 kB
        Robert Muir
      2. LUCENE-3820_test.patch
        7 kB
        Robert Muir
      3. LUCENE-3820.patch
        23 kB
        Dawid Weiss
      4. LUCENE-3820.patch
        7 kB
        Dawid Weiss

        Activity

        Hide
        Michael McCandless added a comment -

        Thanks Dawid!

        Show
        Michael McCandless added a comment - Thanks Dawid!
        Hide
        Dawid Weiss added a comment -

        Yep, nice catch. That test case causes a beautiful exponential time pattern to be generated (I've added it as an @Ignored test . I limited the input size to 40 characters. With such input it should be possible to traverse the entire search space, even if it's exponential. I don't see a way to easily verify if a pattern is exponential or not (without resigning from certain types of patterns).

        Show
        Dawid Weiss added a comment - Yep, nice catch. That test case causes a beautiful exponential time pattern to be generated (I've added it as an @Ignored test . I limited the input size to 40 characters. With such input it should be possible to traverse the entire search space, even if it's exponential. I don't see a way to easily verify if a pattern is exponential or not (without resigning from certain types of patterns).
        Hide
        Michael McCandless added a comment -

        I've committed that thread-name-contains-seed thing.

        Awesome! Using this, I re-beasted and found this seed (was still going after 75 minutes when I killed it...):

        ant test -Dtestcase=TestPatternReplaceCharFilter -Dtestmethod=testRandomStrings -Dtests.seed=-27d641cb49b46a8e:-4b59e6886f1953b6:7d2fb14a457a628
        
        Show
        Michael McCandless added a comment - I've committed that thread-name-contains-seed thing. Awesome! Using this, I re-beasted and found this seed (was still going after 75 minutes when I killed it...): ant test -Dtestcase=TestPatternReplaceCharFilter -Dtestmethod=testRandomStrings -Dtests.seed=-27d641cb49b46a8e:-4b59e6886f1953b6:7d2fb14a457a628
        Hide
        Dawid Weiss added a comment -

        I've committed that thread-name-contains-seed thing. I've also tried to reproduce the long pattern but it's been running on my machine for a few minutes in a tight loop and all of them end below one second. Can you try to reproduce it again? I'm curious what's causing this. I'll add @Ignore once we know what the problem is.

        Show
        Dawid Weiss added a comment - I've committed that thread-name-contains-seed thing. I've also tried to reproduce the long pattern but it's been running on my machine for a few minutes in a tight loop and all of them end below one second. Can you try to reproduce it again? I'm curious what's causing this. I'll add @Ignore once we know what the problem is.
        Hide
        Michael McCandless added a comment -

        Yes, put the seed in your test running thread's name.

        Hmm maybe LTC can somehow do this for us...? EG maybe we make a utility method to launch a new thread... and it sets the name?

        Mike, resolve if this fix works for you.

        Well... I beasted w/ this change, but after 156 iterations it got a bad regexp again.

        Would be ashame to just @Ignore it though... can we somehow make the regexp generation less "evil"?

        Show
        Michael McCandless added a comment - Yes, put the seed in your test running thread's name. Hmm maybe LTC can somehow do this for us...? EG maybe we make a utility method to launch a new thread... and it sets the name? Mike, resolve if this fix works for you. Well... I beasted w/ this change, but after 156 iterations it got a bad regexp again. Would be ashame to just @Ignore it though... can we somehow make the regexp generation less "evil"?
        Hide
        Dawid Weiss added a comment -

        Mike, resolve if this fix works for you.

        Show
        Dawid Weiss added a comment - Mike, resolve if this fix works for you.
        Hide
        Dawid Weiss added a comment -

        I've added a time limit of 2 seconds. Seems to do the trick for me.

        can we somehow tap into kill -QUIT so that in addition to the JRE printing stack traces for all threads, it also prints our seed?)

        Yes, put the seed in your test running thread's name.

        Show
        Dawid Weiss added a comment - I've added a time limit of 2 seconds. Seems to do the trick for me. can we somehow tap into kill -QUIT so that in addition to the JRE printing stack traces for all threads, it also prints our seed?) Yes, put the seed in your test running thread's name.
        Hide
        Dawid Weiss added a comment -

        That's Robert's evil piece of random pattern generation... I told him this would stress the pattern engine more than our code and this seems to be the case The above seems like a heavy backtracking/ recursion pattern, something Russ Cox was writing about (http://swtch.com/~rsc/regexp/regexp1.html).

        I'd just add @Ignore to this test? Alternatively, I'd make it absolute-time bound and simply terminate after, say, 3 seconds.

        Show
        Dawid Weiss added a comment - That's Robert's evil piece of random pattern generation... I told him this would stress the pattern engine more than our code and this seems to be the case The above seems like a heavy backtracking/ recursion pattern, something Russ Cox was writing about ( http://swtch.com/~rsc/regexp/regexp1.html ). I'd just add @Ignore to this test? Alternatively, I'd make it absolute-time bound and simply terminate after, say, 3 seconds.
        Hide
        Michael McCandless added a comment -

        Reopening to somehow fix the too-evil regexp...

        Show
        Michael McCandless added a comment - Reopening to somehow fix the too-evil regexp...
        Hide
        Michael McCandless added a comment -

        I'm seeing TestPatternReplaceCharFilter.testRandomData sometimes take a REALLY long time (longer than I can wait ).

        The test is not hung (CPU is pegged).

        I don't have a seed (can we somehow tap into kill -QUIT so that in addition to the JRE printing stack traces for all threads, it also prints our seed?)... but here's the stack trace:

        "main" prio=10 tid=0x00007fbc64008000 nid=0x4095 runnable [0x00007fbc69b32000]
           java.lang.Thread.State: RUNNABLE
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4304)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
        	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
        	at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
        	at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
        	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
        	at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4311)
        	at java.util.regex.Pattern$Prolog.match(Pattern.java:4251)
        	at java.util.regex.Pattern$Start.match(Pattern.java:3055)
        	at java.util.regex.Matcher.search(Matcher.java:1105)
        	at java.util.regex.Matcher.find(Matcher.java:535)
        	at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.processPattern(PatternReplaceCharFilter.java:100)
        	at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.read(PatternReplaceCharFilter.java:80)
        	at java.io.Reader.read(Reader.java:104)
        	at org.apache.lucene.analysis.MockTokenizer.readCodePoint(MockTokenizer.java:138)
        	at org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:105)
        	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:346)
        	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:304)
        	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:259)
        	at org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings(TestPatternReplaceCharFilter.java:297)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
        	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
        	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
        	at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:707)
        	at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:606)
        	at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:511)
        	at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:569)
        	at org.junit.rules.RunRules.evaluate(RunRules.java:18)
        	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
        	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
        	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
        	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
        	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
        	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
        	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
        	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
        	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
        	at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
        	at org.junit.runners.Suite.runChild(Suite.java:128)
        	at org.junit.runners.Suite.runChild(Suite.java:24)
        	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
        	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
        	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
        	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
        	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
        	at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
        	at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98)
        	at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53)
        	at org.junit.runner.JUnitCore.main(JUnitCore.java:45)
        

        I guess the random regexp (_TestUtil.randomRegexpishString(random)) is sometimes too evil...?

        Show
        Michael McCandless added a comment - I'm seeing TestPatternReplaceCharFilter.testRandomData sometimes take a REALLY long time (longer than I can wait ). The test is not hung (CPU is pegged). I don't have a seed (can we somehow tap into kill -QUIT so that in addition to the JRE printing stack traces for all threads, it also prints our seed?)... but here's the stack trace: "main" prio=10 tid=0x00007fbc64008000 nid=0x4095 runnable [0x00007fbc69b32000] java.lang.Thread.State: RUNNABLE at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$Loop.match(Pattern.java:4304) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$Curly.match0(Pattern.java:3782) at java.util.regex.Pattern$Curly.match(Pattern.java:3744) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4311) at java.util.regex.Pattern$Prolog.match(Pattern.java:4251) at java.util.regex.Pattern$Start.match(Pattern.java:3055) at java.util.regex.Matcher.search(Matcher.java:1105) at java.util.regex.Matcher.find(Matcher.java:535) at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.processPattern(PatternReplaceCharFilter.java:100) at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.read(PatternReplaceCharFilter.java:80) at java.io.Reader.read(Reader.java:104) at org.apache.lucene.analysis.MockTokenizer.readCodePoint(MockTokenizer.java:138) at org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:105) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:346) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:304) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:259) at org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings(TestPatternReplaceCharFilter.java:297) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:707) at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:606) at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:511) at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:569) at org.junit.rules.RunRules.evaluate(RunRules.java:18) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:24) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.junit.runner.JUnitCore.run(JUnitCore.java:157) at org.junit.runner.JUnitCore.run(JUnitCore.java:136) at org.junit.runner.JUnitCore.run(JUnitCore.java:117) at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98) at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53) at org.junit.runner.JUnitCore.main(JUnitCore.java:45) I guess the random regexp ( _TestUtil.randomRegexpishString(random) ) is sometimes too evil...?
        Hide
        Dawid Weiss added a comment -

        I'll put this on hold and commit in a few days if nobody has anything against.

        Show
        Dawid Weiss added a comment - I'll put this on hold and commit in a few days if nobody has anything against.
        Hide
        Robert Muir added a comment -

        OK, after reviewing in detail...

        +1 I like the cleanup here!

        I don't have an opinion / don't know the use case behind block delimiters, but this is a
        small, clean, elegant implementation that doesnt cause offsets problems (from my futzing
        around there seemed to be at least 2 bugs, one at the beginning and one at the end?)

        Show
        Robert Muir added a comment - OK, after reviewing in detail... +1 I like the cleanup here! I don't have an opinion / don't know the use case behind block delimiters, but this is a small, clean, elegant implementation that doesnt cause offsets problems (from my futzing around there seemed to be at least 2 bugs, one at the beginning and one at the end?)
        Hide
        Dawid Weiss added a comment -

        The patch I attached works around this by pushing negative offsets to zero. Works for me.

        Show
        Dawid Weiss added a comment - The patch I attached works around this by pushing negative offsets to zero. Works for me.
        Hide
        Robert Muir added a comment -

        I'll go back to this later today, but I can tell you right now that from my paper considerations negative indexes make logical sense

        We cannot do this... this is the offset (character position in the reader).

        Offsets can never be negative.

        Show
        Robert Muir added a comment - I'll go back to this later today, but I can tell you right now that from my paper considerations negative indexes make logical sense We cannot do this... this is the offset (character position in the reader). Offsets can never be negative.
        Hide
        Dawid Weiss added a comment -

        A simplifying patch that includes Robert's random tests (and passes).

        I've made a deliberate decision to deprecate and not use block delimiters and block processing. If you think this is a backwards no-no then feel free to correct this patch... I think block processing may be worth dropping given the code clarity without it.

        Show
        Dawid Weiss added a comment - A simplifying patch that includes Robert's random tests (and passes). I've made a deliberate decision to deprecate and not use block delimiters and block processing. If you think this is a backwards no-no then feel free to correct this patch... I think block processing may be worth dropping given the code clarity without it.
        Hide
        Dawid Weiss added a comment -

        Unrelated to this change: we should be using StringBuilder here.

        if you mean the pattern matching bit then Matcher simply doesn't accept a StringBuilder...

        Show
        Dawid Weiss added a comment - Unrelated to this change: we should be using StringBuilder here. if you mean the pattern matching bit then Matcher simply doesn't accept a StringBuilder...
        Hide
        Dawid Weiss added a comment -

        Thanks for looking at this, Robert. I'll go back to this later today, but I can tell you right now that from my paper considerations negative indexes make logical sense in case of "prepended" characters. So:

        PATTERN: A
        INPUT: ABCDEF
        REPLACEMENT: XYZ
        OUTPUT:XYZBCDEF

        then (in my patch) X and Y would have negative offsets. It's a matter of agreement I guess. Negative indexes are consistent with something like this:

        PATTERN: ^
        INPUT: ABC
        REPLACEMENT: XYZ
        OUTPUT:XYZABC

        then all three characters (XYZ) have a negative index to indicate they're prepended. Thoughts?

        Show
        Dawid Weiss added a comment - Thanks for looking at this, Robert. I'll go back to this later today, but I can tell you right now that from my paper considerations negative indexes make logical sense in case of "prepended" characters. So: PATTERN: A INPUT: ABCDEF REPLACEMENT: XYZ OUTPUT:XYZBCDEF then (in my patch) X and Y would have negative offsets. It's a matter of agreement I guess. Negative indexes are consistent with something like this: PATTERN: ^ INPUT: ABC REPLACEMENT: XYZ OUTPUT:XYZABC then all three characters (XYZ) have a negative index to indicate they're prepended. Thoughts?
        Hide
        Robert Muir added a comment -

        updated patch, this tests only ascii (to avoid stupid problems in outdated regex support).

        But there are a lot of offset problems (perhaps this corresponds to the warning in the class's javadocs?), including things like offsets being corrected to negative numbers...

        Show
        Robert Muir added a comment - updated patch, this tests only ascii (to avoid stupid problems in outdated regex support). But there are a lot of offset problems (perhaps this corresponds to the warning in the class's javadocs?), including things like offsets being corrected to negative numbers...
        Hide
        Robert Muir added a comment -

        to fix dawid's problem we can probably modify this test only for ascii, i suspect the unicode "problems"
        are going to be impossible to fix given java's regex library (i think it does not treat "." as codepoint
        but code unit). I'll take another stab at that just to tackle the offsets issue he is seeing.

        Show
        Robert Muir added a comment - to fix dawid's problem we can probably modify this test only for ascii, i suspect the unicode "problems" are going to be impossible to fix given java's regex library (i think it does not treat "." as codepoint but code unit). I'll take another stab at that just to tackle the offsets issue he is seeing.
        Hide
        Robert Muir added a comment -

        Here's a simple random test showing some existing bugs in the filter...

        • there are offsets problems as dawid notices...
        • blockbuffer should always oversize by 1 character, if a block ends on a high surrogate (rare) it should do one additional read() so it doesnt create invalid unicode
        Show
        Robert Muir added a comment - Here's a simple random test showing some existing bugs in the filter... there are offsets problems as dawid notices... blockbuffer should always oversize by 1 character, if a block ends on a high surrogate (rare) it should do one additional read() so it doesnt create invalid unicode
        Hide
        Robert Muir added a comment -

        Unrelated to this change: we should be using StringBuilder here.

        Show
        Robert Muir added a comment - Unrelated to this change: we should be using StringBuilder here.
        Hide
        Dawid Weiss added a comment -

        A patch with reimplementation of getReplaceBlock and a test case that is failing with AIOOB (apply test changes without modifying PatternReplaceCharFilter to get the error).

        Show
        Dawid Weiss added a comment - A patch with reimplementation of getReplaceBlock and a test case that is failing with AIOOB (apply test changes without modifying PatternReplaceCharFilter to get the error).

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development