Lucene - Core
  1. Lucene - Core
  2. LUCENE-4078

PatternReplaceCharFilter assertion error

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Build: https://builds.apache.org/job/Lucene-trunk/1942/

      1 tests failed.
      REGRESSION: org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings

      Error Message:

      Stack Trace:
      java.lang.AssertionError
      at __randomizedtesting.SeedInfo.seed([8E91A6AC395FEED9:618A6129A5BB9EC]:0)
      at org.apache.lucene.analysis.MockTokenizer.readCodePoint(MockTokenizer.java:153)
      at org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:123)
      at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:558)
      at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:488)
      at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:430)
      at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:424)
      at org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings(TestPatternReplaceCharFilter.java:323)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:616)
      at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1969)
      at com.carrotsearch.randomizedtesting.RandomizedRunner.access$1100(RandomizedRunner.java:132)
      at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:814)
      at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:875)
      at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:889)
      at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
      at org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:32)
      at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
      at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
      at org.apache.lucene.util.TestRuleReportUncaughtExceptions$1.evaluate(TestRuleReportUncaughtExceptions.java:68)
      at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
      at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
      at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(Randomized

        Activity

        Hide
        Dawid Weiss added a comment -

        This bug is caused by Robert's evil random Pattern instances. I think it qualifies as a bug in the JDK's implementation... but then, I'm not sure. Here it is – the Pattern:

        Pattern p = Pattern.compile("]|");
        

        For those of you who wondered, this also is a valid pattern:

        Pattern p = Pattern.compile("|");
        

        What should these match? I have no clue. Are they even valid? I have no clue. Anyway, what happens is that surrogate pairs (and everything else) is divided, so:

            String s1 = "AB\uD840\uDC00C";
            String s2 = s1.replaceAll("]|", "xyz");
            System.out.println(s1);
            System.out.println(s2);
        

        results in:

        AB𠀀C
        xyzAxyzBxyz?xyz?xyzCxyz
        

        and an assertion is righteously thrown saying a surrogate pair is broken.

        Don't know what to do with this. Robert?

        Show
        Dawid Weiss added a comment - This bug is caused by Robert's evil random Pattern instances. I think it qualifies as a bug in the JDK's implementation... but then, I'm not sure. Here it is – the Pattern: Pattern p = Pattern.compile( "]|" ); For those of you who wondered, this also is a valid pattern: Pattern p = Pattern.compile( "|" ); What should these match? I have no clue. Are they even valid? I have no clue. Anyway, what happens is that surrogate pairs (and everything else) is divided, so: String s1 = "AB\uD840\uDC00C" ; String s2 = s1.replaceAll( "]|" , "xyz" ); System .out.println(s1); System .out.println(s2); results in: AB𠀀C xyzAxyzBxyz?xyz?xyzCxyz and an assertion is righteously thrown saying a surrogate pair is broken. Don't know what to do with this. Robert?
        Hide
        Dawid Weiss added a comment -

        I just checked perl... these do seem to be valid regexps! (they work in perl...) I don't know how to interpret them though.

        Show
        Dawid Weiss added a comment - I just checked perl... these do seem to be valid regexps! (they work in perl...) I don't know how to interpret them though.
        Hide
        Dawid Weiss added a comment -

        More insanity:

            String s1 = "AB\uD840\uDC00C";
            String s2 = s1.replaceAll("", "xyz");
        
            String s1 = "AB\uD840\uDC00C";
            String s2 = s1.replaceAll("|||||||||||||||||||||||||||||||", "xyz");
        

        These are equivalent.

        When you think of it '|' is not an operator, it is an alternative that simply splits (the automaton) into two conditional paths. Since preconditions and post-conditions are empty, these paths converge into an empty state. Which matches on every character. Which seems to be a bug because it matches on every char[] instead of codepoint ('.' matches each codepoint).

        Don't know what to make of it.

        Show
        Dawid Weiss added a comment - More insanity: String s1 = "AB\uD840\uDC00C" ; String s2 = s1.replaceAll( "", " xyz"); String s1 = "AB\uD840\uDC00C" ; String s2 = s1.replaceAll( "|||||||||||||||||||||||||||||||" , "xyz" ); These are equivalent. When you think of it '|' is not an operator, it is an alternative that simply splits (the automaton) into two conditional paths. Since preconditions and post-conditions are empty, these paths converge into an empty state. Which matches on every character. Which seems to be a bug because it matches on every char[] instead of codepoint ('.' matches each codepoint). Don't know what to make of it.
        Hide
        Hoss Man added a comment - - edited

        When you think of it '|' is not an operator ...

        I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string".

        Consider in particular when the regex is used for a replace or split type operation: "z|" means "match z or the empty string"...

        $ perl -MData::Dumper -le 'print Dumper split /z|/, "ABzCD";'
        $VAR1 = 'A';
        $VAR2 = 'B';
        $VAR3 = 'C';
        $VAR4 = 'D';
        

        You should see similar results in java with if you use Matcher.find() as an iterator.

        As for the original pattern: "]|" – that's just a convince form of (imposible to write this in jira markup w/o code tags)...

        \]|

        .. an unescaped close bracket (that has no matching open bracket) is treated as a literal...

        $ perl -MData::Dumper -le 'print Dumper split /]|/, "AB]CD";'
        $VAR1 = 'A';
        $VAR2 = 'B';
        $VAR3 = 'C';
        $VAR4 = 'D';
        

        What i can't explain, is why java treats "empty string" as something that matches in the middle of a code point. that certainly sounds like bug, unless there is some subtlety in Unicode TR#18 that i'm not seeing...

        http://www.unicode.org/reports/tr18/

        Show
        Hoss Man added a comment - - edited When you think of it '|' is not an operator ... I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string". Consider in particular when the regex is used for a replace or split type operation: "z|" means "match z or the empty string"... $ perl -MData::Dumper -le 'print Dumper split /z|/, "ABzCD" ;' $VAR1 = 'A'; $VAR2 = 'B'; $VAR3 = 'C'; $VAR4 = 'D'; You should see similar results in java with if you use Matcher.find() as an iterator. As for the original pattern: "]|" – that's just a convince form of (imposible to write this in jira markup w/o code tags)... \]| .. an unescaped close bracket (that has no matching open bracket) is treated as a literal... $ perl -MData::Dumper -le 'print Dumper split /]|/, "AB]CD" ;' $VAR1 = 'A'; $VAR2 = 'B'; $VAR3 = 'C'; $VAR4 = 'D'; What i can't explain, is why java treats "empty string" as something that matches in the middle of a code point. that certainly sounds like bug, unless there is some subtlety in Unicode TR#18 that i'm not seeing... http://www.unicode.org/reports/tr18/
        Hide
        Dawid Weiss added a comment -

        I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string".

        Yeah, I am a bit surprised at what "" matches. It doesn't match an empty string. It matches an empty string in between characters... or in other words, it matches what's not there. Makes sense when you think of it.

        As for '|', I looked at it from automata theory point of view – '|' doesn't need any arguments or post-arguments (or states), unlike '', '*' or the like which need a state to reference. I'd be convinced '|' is a consistent way of saying 'match empty string or empty string' if "" pattern worked ("match empty string one or more times"), but it doesn't – this will fail with an error. So '|' is kind of special here.

        I don't know much about regexp theory to argue if I'm right or wrong though. I don't even think there is one "right" way to do things if this is a true quote:

        I define UNIX as “30 definitions of regular expressions living under one roof.” —Don Knuth

        Dawid

        Show
        Dawid Weiss added a comment - I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string". Yeah, I am a bit surprised at what "" matches. It doesn't match an empty string. It matches an empty string in between characters... or in other words, it matches what's not there. Makes sense when you think of it. As for '|', I looked at it from automata theory point of view – '|' doesn't need any arguments or post-arguments (or states), unlike ' ', '*' or the like which need a state to reference. I'd be convinced '|' is a consistent way of saying 'match empty string or empty string' if " " pattern worked ("match empty string one or more times"), but it doesn't – this will fail with an error. So '|' is kind of special here. I don't know much about regexp theory to argue if I'm right or wrong though. I don't even think there is one "right" way to do things if this is a true quote: I define UNIX as “30 definitions of regular expressions living under one roof.” —Don Knuth Dawid
        Hide
        Hoss Man added a comment -

        It doesn't match an empty string. It matches an empty string in between characters...

        Well, it's more complicated then that. it does match the empty string (in the sense of "does this regex match this entire string which happens to be empty) but in the context of "find" or "replace" on a larger string you are correct that it matches nothing, which means it matches the emptiness between characters.

        I'd be convinced '|' is a consistent way of saying 'match empty string or empty string' if "+" pattern worked ("match empty string one or more times"), but it doesn't – this will fail with an error. So '|' is kind of special here.

        I think that's just a fluke of syntax/precedence ... if you use parens (capturing or otherwise) you can say "match the empty pattern 1 or more times)...

        $ perl -MData::Dumper -le 'print Dumper split /(?:)+/, "ABCD";'
        $VAR1 = 'A';
        $VAR2 = 'B';
        $VAR3 = 'C';
        $VAR4 = 'D';
        

        Bottom Line: these patterns are all valid and meaningful, and everything we've discussed is tangential to the problem – which seems to be that the JVM lets the empty pattern split in between chars instead of codepoints, which seems like a bug.

        Show
        Hoss Man added a comment - It doesn't match an empty string. It matches an empty string in between characters... Well, it's more complicated then that. it does match the empty string (in the sense of "does this regex match this entire string which happens to be empty) but in the context of "find" or "replace" on a larger string you are correct that it matches nothing, which means it matches the emptiness between characters. I'd be convinced '|' is a consistent way of saying 'match empty string or empty string' if "+" pattern worked ("match empty string one or more times"), but it doesn't – this will fail with an error. So '|' is kind of special here. I think that's just a fluke of syntax/precedence ... if you use parens (capturing or otherwise) you can say "match the empty pattern 1 or more times)... $ perl -MData::Dumper -le 'print Dumper split /(?:)+/, "ABCD" ;' $VAR1 = 'A'; $VAR2 = 'B'; $VAR3 = 'C'; $VAR4 = 'D'; Bottom Line: these patterns are all valid and meaningful, and everything we've discussed is tangential to the problem – which seems to be that the JVM lets the empty pattern split in between chars instead of codepoints, which seems like a bug.
        Hide
        Dawid Weiss added a comment -

        Thanks for the insight – interesting.

        is tangential to the problem – which seems to be that the JVM lets the empty pattern split in between chars instead of codepoints, which seems like a bug.

        Absolutely. This seems like a bug to me too.

        Show
        Dawid Weiss added a comment - Thanks for the insight – interesting. is tangential to the problem – which seems to be that the JVM lets the empty pattern split in between chars instead of codepoints, which seems like a bug. Absolutely. This seems like a bug to me too.
        Hide
        Dawid Weiss added a comment -

        I checked python3 and it works according to the intuition (works on codepoints). I also asked on openjdk's i18n mailing list about where to file it.

        In the mean time I'll fix it by looping the random pattern generator until it processes a non-bmp character so that the returned string is valid utf16.

        Show
        Dawid Weiss added a comment - I checked python3 and it works according to the intuition (works on codepoints). I also asked on openjdk's i18n mailing list about where to file it. In the mean time I'll fix it by looping the random pattern generator until it processes a non-bmp character so that the returned string is valid utf16.
        Hide
        Dawid Weiss added a comment -

        This doesn't filter out all invalid possibilities but should protect against some (including the one that caused this issue)?

        Show
        Dawid Weiss added a comment - This doesn't filter out all invalid possibilities but should protect against some (including the one that caused this issue)?
        Hide
        Dawid Weiss added a comment -

        This is very likely a JDK bug. I committed a patch that works around some of the trivial random patterns but it's definitely not a full resolution of the problem.

        Show
        Dawid Weiss added a comment - This is very likely a JDK bug. I committed a patch that works around some of the trivial random patterns but it's definitely not a full resolution of the problem.
        Hide
        Dawid Weiss added a comment -

        Follow-up discussion on core-libs-dev. The bottom line: this is the expected behavior...

        http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010455.html

        Show
        Dawid Weiss added a comment - Follow-up discussion on core-libs-dev. The bottom line: this is the expected behavior... http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010455.html

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development