Description
The load time for loading documents into PDFBox (PDDocument) is too slow.
One culprit is the method: org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)
The current implementation of this method uses a very slow test for end of stream conditions. A profile of the readUntilEndStream() method shows that a huge chunk of the method's processing time is being consumed in the cmpCircularBuffer() call - which is purely part of the test for for the end of stream marker. In other words, the readUntilEndOfStream() is spending twice as much time testing for the end of stream marker as it is reading bytes from the stream.
A better solution is to use a simpler, direct fail-fast test conditional structure that uses byte primitives. I strongly recommend that the current method be removed and replaced with the following code below. This results in a relative speed up of readUntilEndStream() method of a little over a factor of 3 (a ratio of 113/37 = 3.05 if you want to be more precise). This in turn helps the overall performance of PDDocument.parse() by about a factor of 2.7.
Note the addition of some byte constants used to make the code readable.
-----------------------------------------------------------------
private static final int E = 101;
private static final int N = 110;
private static final int D = 100;
private static final int S = 115;
private static final int T = 116;
private static final int R = 114;
private static final int A = 97;
private static final int M = 109;
private static final int O = 111;
private static final int B = 98;
private static final int J = 106;
/**
- This method will read through the current stream object until
- we find the keyword "endstream" meaning we're at the end of this
- object. Some pdf files, however, forget to write some endstream tags
- and just close off objects with an "endobj" tag so we have to handle
- this case as well.
- @param out The stream we write out to.
- @throws IOException
*/
private void readUntilEndStream( OutputStream out ) throws IOException{
int byteRead;
do{ //use a fail fast test for end of stream markers
byteRead = pdfSource.read();
if(byteRead==E){//only branch if "e"
byteRead = pdfSource.read();
if(byteRead==N){ //only continue branch if "en"
byteRead = pdfSource.read();
if(byteRead==D){//up to "end" now
byteRead = pdfSource.read();
if(byteRead==S){
byteRead = pdfSource.read();
if(byteRead==T){
byteRead = pdfSource.read();
if(byteRead==R){
byteRead = pdfSource.read();
if(byteRead==E){
byteRead = pdfSource.read();
if(byteRead==A)Unknown macro: { byteRead = pdfSource.read(); if(byteRead==M){ //found the whole marker pdfSource.unread( ENDSTREAM ); return; } }else
{ out.write(ENDSTREAM, 0, 7); }}else
{ out.write(ENDSTREAM, 0, 6); }}else
{ out.write(ENDSTREAM, 0, 5); }}else
{ out.write(ENDSTREAM, 0, 4); }}else if(byteRead==O){
byteRead = pdfSource.read();
if(byteRead==B)Unknown macro: { byteRead = pdfSource.read(); if(byteRead==J){ //found whole marker pdfSource.unread( ENDOBJ ); return; } }else
{ out.write(ENDOBJ, 0, 4); }}else
{ out.write(E); out.write(N); out.write(D); }}else
{ out.write(E); out.write(N); }}else
{ out.write(E); }}
if(byteRead!=-1)out.write(byteRead);
}while(byteRead!=-1);
}
Attachments
Attachments
Issue Links
- is duplicated by
-
PDFBOX-556 Performance regression from 0.7.3 to 0.8.0
- Closed