Using the Xmark benchmark (found at http://monetdb.cwi.nl/xml/index.html) I tried to pare a really big file using SAX (doing nothing but parsing). When piping the output of <xmarkbinary> -f 20 through sax (approx. 2GB) I got the following: java.lang.RuntimeException: Internal Error: fPreviousChunk == NULL at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1094) at niagara.search_engine.xmark.DummyParser.main(DummyParser.java:22) For values of -f such as 10,15,18 there is no problem. The binary can be made using the file at http://monetdb.cwi.nl/xml/Assets/unix.c
I reproduced this. The problem is the input file is more than 2^31 bytes long. The offset (XMLEntityReader.fCurrentOffset) hence wraps around to a negative number. Shortly after xerces falls over in org.apache.xerces.utils.UTF8DataChunk.addSymbol I don't know what should be done. I would guess this is a WONTFIX, but the error messages could be improved. Difficult to choose best place to catch it though; I would assume that a minor change in the file would cause the sympton (i.e. the exact place things go wrong) to be very different. The value of the argument offset to UTF8DataChunk.addSymbol when it crashes is -2147483551, there have been numerous calls to addSymbol with very large values of offset near Integer.MAX_VALUE.
This is a show-stopper for many applications. Other Java parsers do not have this problem...
While this is true, Xerces 1 is not really where the current focus of the Apache parser development lies at this point. Has anyone tried this with Xerces 2? If it is not a problem, then the answer would be for you to switch to the new version. If the problem does still exists, then the version of this defect should be changed to reflect that. There are a great many things that could be done to improve Xerces 1 at this point, but with limited resources the main development effort is on Xerces 2 now. Considering that Xerces 1 has never been able to parse documents that large, it is not a regression but a limitation of the old architecture that Xerces 1 was based upon.