Lucene - Core
  1. Lucene - Core
  2. LUCENE-3937

Workaround the XERCES-J bug in Benchmark

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In becnhmark we have a patched version of XERCES which is hard to compile from source. When looking at the code part patched and the source of EnwikiContentSource, to simply provide the XML parser a Reader instead of InputStream, so the broken code is not triggered. This assumes, that the XML-file is always UTF-8.... If not it will no longer work (because the XML parser cannot switch encoding, if it only has a Reader).

      1. LUCENE-3937-remaining-references.patch
        6 kB
        Steve Rowe
      2. LUCENE-3937-remaining-references.patch
        6 kB
        Steve Rowe
      3. LUCENE-3937.patch
        2 kB
        Robert Muir
      4. LUCENE-3937.patch
        1 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Simple patch. Mike can you test this (by first replacing with stock released XERCES)?

        Show
        Uwe Schindler added a comment - Simple patch. Mike can you test this (by first replacing with stock released XERCES)?
        Hide
        Michael McCandless added a comment -

        LUCENE-1591 is when we first tripped on the XERCESJ-1257 bug... and the bug also happens on enwiki-20110115-pages-articles.xml.bz2 export.

        Great idea to workaround Xercesj's bug by using the JVM to decode UTF8, instead of Xercesj...

        I'll test this patch now!

        Show
        Michael McCandless added a comment - LUCENE-1591 is when we first tripped on the XERCESJ-1257 bug... and the bug also happens on enwiki-20110115-pages-articles.xml.bz2 export. Great idea to workaround Xercesj's bug by using the JVM to decode UTF8, instead of Xercesj... I'll test this patch now!
        Hide
        Michael McCandless added a comment -

        Note: I just run benchmark's conf/extractWikipedia.alg task on the XML export... when XERCESJ-1257 strikes you get this:

             ...
             [java]  936.83 sec --> main Wrote 2801000 line docs
             [java]  937.04 sec --> main Wrote 2802000 line docs
             [java]  937.27 sec --> main Wrote 2803000 line docs
             [java]  937.53 sec --> main Wrote 2804000 line docs
             [java]  937.79 sec --> main Wrote 2805000 line docs
             [java]  938.04 sec --> main Wrote 2806000 line docs
             [java]  938.35 sec --> main Wrote 2807000 line docs
             [java]  938.65 sec --> main Wrote 2808000 line docs
             [java]  938.88 sec --> main Wrote 2809000 line docs
             [java]  939.09 sec --> main Wrote 2810000 line docs
             [java]  939.09 sec --> main Wrote 2810000 line docs
             [java] Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
             [java] 	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:198)
             [java] 	at java.lang.Thread.run(Thread.java:619)
             [java] ####################
             [java] ###  D O N E !!! ###
             [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
             [java] ####################
             [java] 	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
             [java] 	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
             [java] 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
             [java] 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
             [java] 	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
             [java] 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
             [java] 	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:175)
             [java] 	... 1 more
             [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
             [java] 	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
             [java] 	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
             [java] 	... 8 more
        
        Show
        Michael McCandless added a comment - Note: I just run benchmark's conf/extractWikipedia.alg task on the XML export... when XERCESJ-1257 strikes you get this: ... [java] 936.83 sec --> main Wrote 2801000 line docs [java] 937.04 sec --> main Wrote 2802000 line docs [java] 937.27 sec --> main Wrote 2803000 line docs [java] 937.53 sec --> main Wrote 2804000 line docs [java] 937.79 sec --> main Wrote 2805000 line docs [java] 938.04 sec --> main Wrote 2806000 line docs [java] 938.35 sec --> main Wrote 2807000 line docs [java] 938.65 sec --> main Wrote 2808000 line docs [java] 938.88 sec --> main Wrote 2809000 line docs [java] 939.09 sec --> main Wrote 2810000 line docs [java] 939.09 sec --> main Wrote 2810000 line docs [java] Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:198) [java] at java.lang.Thread.run(Thread.java:619) [java] #################### [java] ### D O N E !!! ### [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] #################### [java] at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) [java] at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) [java] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) [java] at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) [java] at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:175) [java] ... 1 more [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) [java] at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) [java] at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) [java] at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) [java] ... 8 more
        Hide
        Robert Muir added a comment -

        I agree this is an awesome idea... maybe the reader should not be passed a charset though,
        but a charsetdecoder with REPORT set for onXXXX() methods? This way if the xml is corrumpt
        or maybe not actually utf-8 (aren't all wikipedias xmls utf-8?), then you know about it.

        Show
        Robert Muir added a comment - I agree this is an awesome idea... maybe the reader should not be passed a charset though, but a charsetdecoder with REPORT set for onXXXX() methods? This way if the xml is corrumpt or maybe not actually utf-8 (aren't all wikipedias xmls utf-8?), then you know about it.
        Hide
        Michael McCandless added a comment -

        OK with this patch the decode of enwiki-20110115 finished!

        I agree we should tell the decoder to throw exception on any problems...

        Show
        Michael McCandless added a comment - OK with this patch the decode of enwiki-20110115 finished! I agree we should tell the decoder to throw exception on any problems...
        Hide
        Uwe Schindler added a comment -

        Robert, you know better how to do the problem reporting... I have no idea, I only know it's a nice builder-API

        Show
        Uwe Schindler added a comment - Robert, you know better how to do the problem reporting... I have no idea, I only know it's a nice builder-API
        Hide
        Robert Muir added a comment -

        I can do it... give me a sec

        Show
        Robert Muir added a comment - I can do it... give me a sec
        Hide
        Robert Muir added a comment -

        updated (untested) patch with issue # added to the comments, and throwing exception on broken encoding.

        Show
        Robert Muir added a comment - updated (untested) patch with issue # added to the comments, and throwing exception on broken encoding.
        Hide
        Uwe Schindler added a comment -

        Committed trunk 1307141, 3.x 1307144

        Show
        Uwe Schindler added a comment - Committed trunk 1307141, 3.x 1307144
        Hide
        Steve Rowe added a comment -

        Patch against branch_3x removing remaining references to the patched xercesImpl jar. Also adds benchmark CHANGES entry.

        Committing shortly, and then forward porting to trunk.

        Show
        Steve Rowe added a comment - Patch against branch_3x removing remaining references to the patched xercesImpl jar. Also adds benchmark CHANGES entry. Committing shortly, and then forward porting to trunk.
        Hide
        Robert Muir added a comment -

        wait: i don't think we should remove the licensing information totally?

        we still rely on xerces. it should just say 2.9.1 (not patched-hacked version)

        Show
        Robert Muir added a comment - wait: i don't think we should remove the licensing information totally? we still rely on xerces. it should just say 2.9.1 (not patched-hacked version)
        Hide
        Uwe Schindler added a comment -

        I added a changes entry?

        Show
        Uwe Schindler added a comment - I added a changes entry?
        Hide
        Steve Rowe added a comment -

        I added a changes entry?

        Benchmark has its own CHANGES.txt, and there is mention in there of this patched jar, so I thought it appropriate to add an entry there. I didn't think to check for your CHANGES entry. I'll go do that now.

        Show
        Steve Rowe added a comment - I added a changes entry? Benchmark has its own CHANGES.txt, and there is mention in there of this patched jar, so I thought it appropriate to add an entry there. I didn't think to check for your CHANGES entry. I'll go do that now.
        Hide
        Steve Rowe added a comment -

        wait: i don't think we should remove the licensing information totally?

        we still rely on xerces. it should just say 2.9.1 (not patched-hacked version)

        Right, thanks, I'll put it back and adjust the version.

        Show
        Steve Rowe added a comment - wait: i don't think we should remove the licensing information totally? we still rely on xerces. it should just say 2.9.1 (not patched-hacked version) Right, thanks, I'll put it back and adjust the version.
        Hide
        Steve Rowe added a comment -

        wait: i don't think we should remove the licensing information totally?

        we still rely on xerces. it should just say 2.9.1 (not patched-hacked version)

        Right, thanks, I'll put it back and adjust the version.

        So, I'll put it back and adjust the version in lucene/NOTICE.txt, but think it should be removed from solr/NOTICE.txt because it's not actually included in Solr? Here's what's in solr/NOTICE.txt now:

        Includes software from other Apache Software Foundation projects,
        including, but not limited to:
        [...]
         - Xerces (lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar)
        

        No xercesImpl jar exists under solr/lib/.

        Show
        Steve Rowe added a comment - wait: i don't think we should remove the licensing information totally? we still rely on xerces. it should just say 2.9.1 (not patched-hacked version) Right, thanks, I'll put it back and adjust the version. So, I'll put it back and adjust the version in lucene/NOTICE.txt , but think it should be removed from solr/NOTICE.txt because it's not actually included in Solr? Here's what's in solr/NOTICE.txt now: Includes software from other Apache Software Foundation projects, including, but not limited to: [...] - Xerces (lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar) No xercesImpl jar exists under solr/lib/ .
        Hide
        Steve Rowe added a comment -

        Updated branch_3x patch putting back the xercesImpl mention in lucene/NOTICE.txt.

        Uwe, I looked at your CHANGES entry, and I think the entry I wrote in benchmark CHANGES.txt should still be included there. Can you take a look and tell me if you disagree?

        Show
        Steve Rowe added a comment - Updated branch_3x patch putting back the xercesImpl mention in lucene/NOTICE.txt . Uwe, I looked at your CHANGES entry, and I think the entry I wrote in benchmark CHANGES.txt should still be included there. Can you take a look and tell me if you disagree?
        Hide
        Steve Rowe added a comment -

        Committed the remaining references patch to branch_3x and trunk. Uwe, you can kill the benchmark/CHANGES.txt entry I added if you don't like it.

        Show
        Steve Rowe added a comment - Committed the remaining references patch to branch_3x and trunk. Uwe, you can kill the benchmark/CHANGES.txt entry I added if you don't like it.

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development