Lucene - Core
  1. Lucene - Core
  2. LUCENE-3327

TestFSTs.testRandomWords throws AIOBE when "verbose"=true

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Seems like invalid utf-8 sometimes gets passed to Bytesref.utf8ToString() in the verbose "println"s.

      1. LUCENE-3327.patch
        13 kB
        James Dyer
      2. LUCENE-3327.patch
        3 kB
        James Dyer

        Activity

        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 3.4 [ 12316675 ]
        Fix Version/s 4.0 [ 12314025 ]
        Resolution Fixed [ 1 ]
        Hide
        Michael McCandless added a comment -

        Thanks James!

        Show
        Michael McCandless added a comment - Thanks James!
        Michael McCandless committed 1148968 (2 files)
        Hide
        Michael McCandless added a comment -

        Looks great James, thanks! I confirmed this fixes the above exc when I run w/ verbose. I'll commit shortly – I just changed the new arg's name to "isValidUnicode", and fixed up the whitespace.

        Show
        Michael McCandless added a comment - Looks great James, thanks! I confirmed this fixes the above exc when I run w/ verbose. I'll commit shortly – I just changed the new arg's name to "isValidUnicode", and fixed up the whitespace.
        James Dyer made changes -
        Attachment LUCENE-3327.patch [ 12487230 ]
        Hide
        James Dyer added a comment -

        Wherever the test is printing out a term prefix, it just calls IntsRef.toString() rather than try to convert this to something more human-readable.

        Show
        James Dyer added a comment - Wherever the test is printing out a term prefix , it just calls IntsRef.toString() rather than try to convert this to something more human-readable.
        Hide
        Michael McCandless added a comment -

        Ahhh that's what makes the invalid UTF8. OK. Can we just change that one place (that cuts a potentially invalid UTF8 prefix) to just use BytesRef.toString?

        Show
        Michael McCandless added a comment - Ahhh that's what makes the invalid UTF8. OK. Can we just change that one place (that cuts a potentially invalid UTF8 prefix) to just use BytesRef.toString?
        Hide
        James Dyer added a comment -

        Spooky because this test supposedly creates random valid unicode strings (_TestUtil.randomRealisticUnicodeString)... hmmm.

        but then it breaks them down into prefixes and those aren't always valid utf-8...

        Show
        James Dyer added a comment - Spooky because this test supposedly creates random valid unicode strings (_TestUtil.randomRealisticUnicodeString)... hmmm. but then it breaks them down into prefixes and those aren't always valid utf-8...
        Hide
        Michael McCandless added a comment -

        Hmm, I don't think this is quite right: in the BYTE1 case, these are the bytes from the term, and we shouldn't pretend they are unicode code points (which is what UnicodeUtil.newString is given).

        Ie, we really do need the inputMode to be passed to inputToString.

        Really, this test pretends a term is always a utf8 byte sequence, which in general is not the case (terms are arbitrary byte[]), it's just that this test only ever operates on terms that are in fact utf8 byte sequences (I think?).

        Indeed I'm also hitting AIOOBE (ant test-core -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-3451527662631579719:-3355372777860187201):

        There was 1 failure:
        1) testRandomWords(org.apache.lucene.util.fst.TestFSTs)
        java.lang.ArrayIndexOutOfBoundsException: 44
        	at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:586)
        	at org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:203)
        	at org.apache.lucene.util.fst.TestFSTs.inputToString(TestFSTs.java:989)
        	at org.apache.lucene.util.fst.TestFSTs.access$000(TestFSTs.java:53)
        	at org.apache.lucene.util.fst.TestFSTs$FSTTester.verifyPruned(TestFSTs.java:833)
        	at org.apache.lucene.util.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:507)
        	at org.apache.lucene.util.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:366)
        	at org.apache.lucene.util.fst.TestFSTs.doTest(TestFSTs.java:214)
        	at org.apache.lucene.util.fst.TestFSTs.testRandomWords(TestFSTs.java:963)
        	at org.apache.lucene.util.fst.TestFSTs.testRandomWords(TestFSTs.java:938)
        

        Spooky because this test supposedly creates random valid unicode strings (_TestUtil.randomRealisticUnicodeString)... hmmm.

        Show
        Michael McCandless added a comment - Hmm, I don't think this is quite right: in the BYTE1 case, these are the bytes from the term, and we shouldn't pretend they are unicode code points (which is what UnicodeUtil.newString is given). Ie, we really do need the inputMode to be passed to inputToString. Really, this test pretends a term is always a utf8 byte sequence, which in general is not the case (terms are arbitrary byte[]), it's just that this test only ever operates on terms that are in fact utf8 byte sequences (I think?). Indeed I'm also hitting AIOOBE (ant test-core -Dtestcase=TestFSTs -Dtestmethod=testRandomWords -Dtests.seed=-3451527662631579719:-3355372777860187201): There was 1 failure: 1) testRandomWords(org.apache.lucene.util.fst.TestFSTs) java.lang.ArrayIndexOutOfBoundsException: 44 at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:586) at org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:203) at org.apache.lucene.util.fst.TestFSTs.inputToString(TestFSTs.java:989) at org.apache.lucene.util.fst.TestFSTs.access$000(TestFSTs.java:53) at org.apache.lucene.util.fst.TestFSTs$FSTTester.verifyPruned(TestFSTs.java:833) at org.apache.lucene.util.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:507) at org.apache.lucene.util.fst.TestFSTs$FSTTester.doTest(TestFSTs.java:366) at org.apache.lucene.util.fst.TestFSTs.doTest(TestFSTs.java:214) at org.apache.lucene.util.fst.TestFSTs.testRandomWords(TestFSTs.java:963) at org.apache.lucene.util.fst.TestFSTs.testRandomWords(TestFSTs.java:938) Spooky because this test supposedly creates random valid unicode strings (_TestUtil.randomRealisticUnicodeString)... hmmm.
        James Dyer made changes -
        Field Original Value New Value
        Attachment LUCENE-3327.patch [ 12486882 ]
        Hide
        James Dyer added a comment -

        this just calls UnicodeUtil.newString(..) instead of BytesRef.utf8ToString() in all cases.

        Show
        James Dyer added a comment - this just calls UnicodeUtil.newString(..) instead of BytesRef.utf8ToString() in all cases.
        James Dyer created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            James Dyer
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development