Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • 3.6, 4.0-ALPHA
    • None
    • None
    • New

    Description

      In becnhmark we have a patched version of XERCES which is hard to compile from source. When looking at the code part patched and the source of EnwikiContentSource, to simply provide the XML parser a Reader instead of InputStream, so the broken code is not triggered. This assumes, that the XML-file is always UTF-8.... If not it will no longer work (because the XML parser cannot switch encoding, if it only has a Reader).

      Attachments

        1. LUCENE-3937.patch
          2 kB
          Robert Muir
        2. LUCENE-3937.patch
          1 kB
          Uwe Schindler
        3. LUCENE-3937-remaining-references.patch
          6 kB
          Steven Rowe
        4. LUCENE-3937-remaining-references.patch
          6 kB
          Steven Rowe

        Activity

          uschindler Uwe Schindler added a comment -

          Simple patch. Mike can you test this (by first replacing with stock released XERCES)?

          uschindler Uwe Schindler added a comment - Simple patch. Mike can you test this (by first replacing with stock released XERCES)?

          LUCENE-1591 is when we first tripped on the XERCESJ-1257 bug... and the bug also happens on enwiki-20110115-pages-articles.xml.bz2 export.

          Great idea to workaround Xercesj's bug by using the JVM to decode UTF8, instead of Xercesj...

          I'll test this patch now!

          mikemccand Michael McCandless added a comment - LUCENE-1591 is when we first tripped on the XERCESJ-1257 bug... and the bug also happens on enwiki-20110115-pages-articles.xml.bz2 export. Great idea to workaround Xercesj's bug by using the JVM to decode UTF8, instead of Xercesj... I'll test this patch now!

          Note: I just run benchmark's conf/extractWikipedia.alg task on the XML export... when XERCESJ-1257 strikes you get this:

               ...
               [java]  936.83 sec --> main Wrote 2801000 line docs
               [java]  937.04 sec --> main Wrote 2802000 line docs
               [java]  937.27 sec --> main Wrote 2803000 line docs
               [java]  937.53 sec --> main Wrote 2804000 line docs
               [java]  937.79 sec --> main Wrote 2805000 line docs
               [java]  938.04 sec --> main Wrote 2806000 line docs
               [java]  938.35 sec --> main Wrote 2807000 line docs
               [java]  938.65 sec --> main Wrote 2808000 line docs
               [java]  938.88 sec --> main Wrote 2809000 line docs
               [java]  939.09 sec --> main Wrote 2810000 line docs
               [java]  939.09 sec --> main Wrote 2810000 line docs
               [java] Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
               [java] 	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:198)
               [java] 	at java.lang.Thread.run(Thread.java:619)
               [java] ####################
               [java] ###  D O N E !!! ###
               [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
               [java] ####################
               [java] 	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
               [java] 	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
               [java] 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
               [java] 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
               [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
               [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
               [java] 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
               [java] 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
               [java] 	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
               [java] 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
               [java] 	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:175)
               [java] 	... 1 more
               [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
               [java] 	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
               [java] 	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
               [java] 	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
               [java] 	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
               [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
               [java] 	... 8 more
          
          mikemccand Michael McCandless added a comment - Note: I just run benchmark's conf/extractWikipedia.alg task on the XML export... when XERCESJ-1257 strikes you get this: ... [java] 936.83 sec --> main Wrote 2801000 line docs [java] 937.04 sec --> main Wrote 2802000 line docs [java] 937.27 sec --> main Wrote 2803000 line docs [java] 937.53 sec --> main Wrote 2804000 line docs [java] 937.79 sec --> main Wrote 2805000 line docs [java] 938.04 sec --> main Wrote 2806000 line docs [java] 938.35 sec --> main Wrote 2807000 line docs [java] 938.65 sec --> main Wrote 2808000 line docs [java] 938.88 sec --> main Wrote 2809000 line docs [java] 939.09 sec --> main Wrote 2810000 line docs [java] 939.09 sec --> main Wrote 2810000 line docs [java] Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:198) [java] at java.lang.Thread.run(Thread.java:619) [java] #################### [java] ### D O N E !!! ### [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] #################### [java] at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) [java] at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) [java] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) [java] at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) [java] at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:175) [java] ... 1 more [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) [java] at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) [java] at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) [java] at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) [java] ... 8 more
          rcmuir Robert Muir added a comment -

          I agree this is an awesome idea... maybe the reader should not be passed a charset though,
          but a charsetdecoder with REPORT set for onXXXX() methods? This way if the xml is corrumpt
          or maybe not actually utf-8 (aren't all wikipedias xmls utf-8?), then you know about it.

          rcmuir Robert Muir added a comment - I agree this is an awesome idea... maybe the reader should not be passed a charset though, but a charsetdecoder with REPORT set for onXXXX() methods? This way if the xml is corrumpt or maybe not actually utf-8 (aren't all wikipedias xmls utf-8?), then you know about it.

          OK with this patch the decode of enwiki-20110115 finished!

          I agree we should tell the decoder to throw exception on any problems...

          mikemccand Michael McCandless added a comment - OK with this patch the decode of enwiki-20110115 finished! I agree we should tell the decoder to throw exception on any problems...
          uschindler Uwe Schindler added a comment -

          Robert, you know better how to do the problem reporting... I have no idea, I only know it's a nice builder-API

          uschindler Uwe Schindler added a comment - Robert, you know better how to do the problem reporting... I have no idea, I only know it's a nice builder-API
          rcmuir Robert Muir added a comment -

          I can do it... give me a sec

          rcmuir Robert Muir added a comment - I can do it... give me a sec
          rcmuir Robert Muir added a comment -

          updated (untested) patch with issue # added to the comments, and throwing exception on broken encoding.

          rcmuir Robert Muir added a comment - updated (untested) patch with issue # added to the comments, and throwing exception on broken encoding.
          uschindler Uwe Schindler added a comment -

          Committed trunk 1307141, 3.x 1307144

          uschindler Uwe Schindler added a comment - Committed trunk 1307141, 3.x 1307144
          sarowe Steven Rowe added a comment -

          Patch against branch_3x removing remaining references to the patched xercesImpl jar. Also adds benchmark CHANGES entry.

          Committing shortly, and then forward porting to trunk.

          sarowe Steven Rowe added a comment - Patch against branch_3x removing remaining references to the patched xercesImpl jar. Also adds benchmark CHANGES entry. Committing shortly, and then forward porting to trunk.
          rcmuir Robert Muir added a comment -

          wait: i don't think we should remove the licensing information totally?

          we still rely on xerces. it should just say 2.9.1 (not patched-hacked version)

          rcmuir Robert Muir added a comment - wait: i don't think we should remove the licensing information totally? we still rely on xerces. it should just say 2.9.1 (not patched-hacked version)
          uschindler Uwe Schindler added a comment -

          I added a changes entry?

          uschindler Uwe Schindler added a comment - I added a changes entry?
          sarowe Steven Rowe added a comment -

          I added a changes entry?

          Benchmark has its own CHANGES.txt, and there is mention in there of this patched jar, so I thought it appropriate to add an entry there. I didn't think to check for your CHANGES entry. I'll go do that now.

          sarowe Steven Rowe added a comment - I added a changes entry? Benchmark has its own CHANGES.txt, and there is mention in there of this patched jar, so I thought it appropriate to add an entry there. I didn't think to check for your CHANGES entry. I'll go do that now.
          sarowe Steven Rowe added a comment -

          wait: i don't think we should remove the licensing information totally?

          we still rely on xerces. it should just say 2.9.1 (not patched-hacked version)

          Right, thanks, I'll put it back and adjust the version.

          sarowe Steven Rowe added a comment - wait: i don't think we should remove the licensing information totally? we still rely on xerces. it should just say 2.9.1 (not patched-hacked version) Right, thanks, I'll put it back and adjust the version.
          sarowe Steven Rowe added a comment -

          wait: i don't think we should remove the licensing information totally?

          we still rely on xerces. it should just say 2.9.1 (not patched-hacked version)

          Right, thanks, I'll put it back and adjust the version.

          So, I'll put it back and adjust the version in lucene/NOTICE.txt, but think it should be removed from solr/NOTICE.txt because it's not actually included in Solr? Here's what's in solr/NOTICE.txt now:

          Includes software from other Apache Software Foundation projects,
          including, but not limited to:
          [...]
           - Xerces (lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar)
          

          No xercesImpl jar exists under solr/lib/.

          sarowe Steven Rowe added a comment - wait: i don't think we should remove the licensing information totally? we still rely on xerces. it should just say 2.9.1 (not patched-hacked version) Right, thanks, I'll put it back and adjust the version. So, I'll put it back and adjust the version in lucene/NOTICE.txt , but think it should be removed from solr/NOTICE.txt because it's not actually included in Solr? Here's what's in solr/NOTICE.txt now: Includes software from other Apache Software Foundation projects, including, but not limited to: [...] - Xerces (lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar) No xercesImpl jar exists under solr/lib/ .
          sarowe Steven Rowe added a comment -

          Updated branch_3x patch putting back the xercesImpl mention in lucene/NOTICE.txt.

          Uwe, I looked at your CHANGES entry, and I think the entry I wrote in benchmark CHANGES.txt should still be included there. Can you take a look and tell me if you disagree?

          sarowe Steven Rowe added a comment - Updated branch_3x patch putting back the xercesImpl mention in lucene/NOTICE.txt . Uwe, I looked at your CHANGES entry, and I think the entry I wrote in benchmark CHANGES.txt should still be included there. Can you take a look and tell me if you disagree?
          sarowe Steven Rowe added a comment -

          Committed the remaining references patch to branch_3x and trunk. Uwe, you can kill the benchmark/CHANGES.txt entry I added if you don't like it.

          sarowe Steven Rowe added a comment - Committed the remaining references patch to branch_3x and trunk. Uwe, you can kill the benchmark/CHANGES.txt entry I added if you don't like it.
          tomoko Tomoko Uchida added a comment -

          This issue was moved to GitHub issue: #5010.

          tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #5010 .

          People

            uschindler Uwe Schindler
            uschindler Uwe Schindler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: