Uploaded image for project: 'Xerces2-J'
  1. Xerces2-J
  2. XERCESJ-1614

ArrayIndexOutOfBoundsException: 2048 and Invalid byte 2 of 4-byte UTF-8 sequence.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.7.0, 2.7.1, 2.8.0, 2.8.1, 2.9.0, 2.9.1, 2.10.0, 2.11.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Environment:
      Ubuntu 10.04, openjdk6

      Description

      Upon importing files from wikipedia using mwdumper the script fails in several files. This happens in multiple dumps (I tried the dumps of May and January). A file that you can try and it is about ~200 mb is: enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z found in http://dumps.wikimedia.org/enwiki/20130503/

      In mwdumper version using xerces 2.7.1 the error is the following:
      7za e -so enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z |java -server -jar mwdumper-1.16.jar --format=sql:1.5 | gzip -vc > temp.sql.gz

      7-Zip (A) 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
      p7zip Version 9.04 (locale=en_US.ISO-8859-15,Utf16=on,HugeFiles=on,8 CPUs)

      Processing archive: enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z

      Extracting enwiki-20130503-pages-meta-history1.xml-p000006887p0000093163 pages (1.165/sec), 1,000 revs (388.35/sec)
      3 pages (0.356/sec), 2,000 revs (237.164/sec)
      8 pages (0.677/sec), 3,000 revs (253.807/sec)
      13 pages (1.058/sec), 4,000 revs (325.627/sec)
      13 pages (0.992/sec), 5,000 revs (381.505/sec)
      16 pages (1.169/sec), 6,000 revs (438.436/sec)
      16 pages (1.016/sec), 7,000 revs (444.501/sec)
      17 pages (0.854/sec), 8,000 revs (401.849/sec)
      17 pages (0.695/sec), 9,000 revs (367.752/sec)
      18 pages (0.675/sec), 10,000 revs (374.967/sec)
      18 pages (0.653/sec), 11,000 revs (399.332/sec)
      18 pages (0.626/sec), 12,000 revs (417.043/sec)
      18 pages (0.6/sec), 13,000 revs (433.117/sec)
      18 pages (0.555/sec), 14,000 revs (431.766/sec)
      18 pages (0.499/sec), 15,000 revs (416.17/sec)
      19 pages (0.509/sec), 16,000 revs (428.483/sec)
      22 pages (0.58/sec), 17,000 revs (448.43/sec)
      22 pages (0.571/sec), 18,000 revs (467.302/sec)
      23 pages (0.546/sec), 19,000 revs (450.835/sec)
      24 pages (0.564/sec), 20,000 revs (469.649/sec)
      26 pages (0.587/sec), 21,000 revs (473.912/sec)
      28 pages (0.623/sec), 22,000 revs (489.182/sec)
      31 pages (0.684/sec), 23,000 revs (507.469/sec)
      31 pages (0.647/sec), 24,000 revs (500.584/sec)
      33 pages (0.655/sec), 25,000 revs (495.835/sec)
      Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
      at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
      at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
      at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
      at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
      at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
      at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
      77.4%

      In mwdumper build for another dump with 2.11.0 xerces the error is the following(pasting the final lines):

      $ cat enwiki-20130102-pages-meta-history1.xml-p000004284p000005735 | java -server -jar mwdumper-1.16-2.11.0.jar --format=sql:1.5 > temp.sql
      289 pages (0.233/sec), 360,000 revs (290.012/sec)
      289 pages (0.229/sec), 361,000 revs (286.432/sec)
      289 pages (0.226/sec), 362,000 revs (283.608/sec)
      289 pages (0.225/sec), 363,000 revs (282.209/sec)
      289 pages (0.222/sec), 364,000 revs (280.006/sec)
      289 pages (0.22/sec), 365,000 revs (277.282/sec)
      Exception in thread "main" java.io.IOException: Invalid byte 2 of 4-byte UTF-8 sequence.
      at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
      at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
      Caused by: org.xml.sax.SAXParseException; lineNumber: 128484149; columnNumber: 94; Invalid byte 2 of 4-byte UTF-8 sequence.
      at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
      at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
      at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
      at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
      at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
      at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
      ... 1 more
      Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
      at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
      at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
      ... 11 more

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tsikerdekis Michael Tsikerdekis
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified