Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Duplicate
-
2.7.0, 2.7.1, 2.8.0, 2.8.1, 2.9.0, 2.9.1, 2.10.0, 2.11.0
-
None
-
None
-
Ubuntu 10.04, openjdk6
Description
Upon importing files from wikipedia using mwdumper the script fails in several files. This happens in multiple dumps (I tried the dumps of May and January). A file that you can try and it is about ~200 mb is: enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z found in http://dumps.wikimedia.org/enwiki/20130503/
In mwdumper version using xerces 2.7.1 the error is the following:
7za e -so enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z |java -server -jar mwdumper-1.16.jar --format=sql:1.5 | gzip -vc > temp.sql.gz
7-Zip (A) 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=en_US.ISO-8859-15,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z
Extracting enwiki-20130503-pages-meta-history1.xml-p000006887p0000093163 pages (1.165/sec), 1,000 revs (388.35/sec)
3 pages (0.356/sec), 2,000 revs (237.164/sec)
8 pages (0.677/sec), 3,000 revs (253.807/sec)
13 pages (1.058/sec), 4,000 revs (325.627/sec)
13 pages (0.992/sec), 5,000 revs (381.505/sec)
16 pages (1.169/sec), 6,000 revs (438.436/sec)
16 pages (1.016/sec), 7,000 revs (444.501/sec)
17 pages (0.854/sec), 8,000 revs (401.849/sec)
17 pages (0.695/sec), 9,000 revs (367.752/sec)
18 pages (0.675/sec), 10,000 revs (374.967/sec)
18 pages (0.653/sec), 11,000 revs (399.332/sec)
18 pages (0.626/sec), 12,000 revs (417.043/sec)
18 pages (0.6/sec), 13,000 revs (433.117/sec)
18 pages (0.555/sec), 14,000 revs (431.766/sec)
18 pages (0.499/sec), 15,000 revs (416.17/sec)
19 pages (0.509/sec), 16,000 revs (428.483/sec)
22 pages (0.58/sec), 17,000 revs (448.43/sec)
22 pages (0.571/sec), 18,000 revs (467.302/sec)
23 pages (0.546/sec), 19,000 revs (450.835/sec)
24 pages (0.564/sec), 20,000 revs (469.649/sec)
26 pages (0.587/sec), 21,000 revs (473.912/sec)
28 pages (0.623/sec), 22,000 revs (489.182/sec)
31 pages (0.684/sec), 23,000 revs (507.469/sec)
31 pages (0.647/sec), 24,000 revs (500.584/sec)
33 pages (0.655/sec), 25,000 revs (495.835/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
77.4%
In mwdumper build for another dump with 2.11.0 xerces the error is the following(pasting the final lines):
$ cat enwiki-20130102-pages-meta-history1.xml-p000004284p000005735 | java -server -jar mwdumper-1.16-2.11.0.jar --format=sql:1.5 > temp.sql
289 pages (0.233/sec), 360,000 revs (290.012/sec)
289 pages (0.229/sec), 361,000 revs (286.432/sec)
289 pages (0.226/sec), 362,000 revs (283.608/sec)
289 pages (0.225/sec), 363,000 revs (282.209/sec)
289 pages (0.222/sec), 364,000 revs (280.006/sec)
289 pages (0.22/sec), 365,000 revs (277.282/sec)
Exception in thread "main" java.io.IOException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
Caused by: org.xml.sax.SAXParseException; lineNumber: 128484149; columnNumber: 94; Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
... 1 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
... 11 more
Attachments
Issue Links
- duplicates
-
XERCESJ-1257 buffer overflow in UTF8Reader for characters out of BMP
- Reopened