3013 – Large File Parsing

Bug 3013 - Large File Parsing

Summary: Large File Parsing

Status:	NEW

Alias:	None

Product:	Xerces-J
Classification:	Unclassified
Component:	SAX (show other bugs)
Version:	1.4.2
Hardware:	PC Linux

Importance:	P3 normal
Target Milestone:	---
Assignee:	Xerces-J Developers Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2001-08-06 20:47 UTC by lgalanis
Modified:	2004-11-16 19:05 UTC (History)
CC List:	0 users

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description lgalanis 2001-08-06 20:47:58 UTC

Using the Xmark benchmark (found at http://monetdb.cwi.nl/xml/index.html) I
tried to pare a really big file using SAX (doing nothing but parsing). When
piping the output of 

<xmarkbinary> -f 20 through sax (approx. 2GB) I got the following:

java.lang.RuntimeException: Internal Error: fPreviousChunk == NULL
        at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1094)
        at niagara.search_engine.xmark.DummyParser.main(DummyParser.java:22)

For values of -f such as 10,15,18  there is no problem. The binary can be made
using the file at http://monetdb.cwi.nl/xml/Assets/unix.c

Comment 1 jjc 2001-08-07 11:53:23 UTC

I reproduced this.

The problem is the input file is more than 2^31 bytes long.

The offset (XMLEntityReader.fCurrentOffset) hence wraps around to a negative 
number.
Shortly after xerces falls over in
org.apache.xerces.utils.UTF8DataChunk.addSymbol

I don't know what should be done. I would guess this is a WONTFIX, but the 
error messages could be improved. Difficult to choose best place to catch it 
though; I would assume that a minor change in the file would cause the sympton 
(i.e. the exact place things go wrong) to be very different.

The value of the argument offset to UTF8DataChunk.addSymbol when it crashes is
-2147483551, there have been numerous calls to addSymbol with very large values 
of offset near Integer.MAX_VALUE.

Comment 2 robw 2002-01-30 21:16:25 UTC

This is a show-stopper for many applications. Other Java parsers do not have
this problem...

Comment 3 Glenn Marcy 2002-01-30 23:29:20 UTC

While this is true, Xerces 1 is not really where the current focus of the Apache 
parser development lies at this point.  Has anyone tried this with Xerces 2?  If 
it is not a problem, then the answer would be for you to switch to the new 
version.  If the problem does still exists, then the version of this defect 
should be changed to reflect that.  There are a great many things that could be 
done to improve Xerces 1 at this point, but with limited resources the main 
development effort is on Xerces 2 now.  Considering that Xerces 1 has never been 
able to parse documents that large, it is not a regression but a limitation of 
the old architecture that Xerces 1 was based upon.