[PDFBOX-4973] Parsing truncated files no longer throws IOException: Error reading stream, expected='endstream' actual='' at offset ... - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.0.7, 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13, 2.0.14, 2.0.15, 2.0.16, 2.0.17, 2.0.18, 2.0.19, 2.0.20, 2.0.21
Fix Version/s: None
Component/s: Parsing
Labels:
None
Environment:
Ubuntu 16.04

Description

Issue:

An exception is no longer thrown post-2.0.6, when a stream of a truncated PDF file is parsed.

In 2.0.6 COSParser's parseCOSStream throws "java.io.IOException: Error reading stream, expected='endstream' actual='' at offset ...". Whereas >= 2.0.7 the parsing is successful. Shall an EOF marker be added to the truncated file, however, the expected exception is thrown once again.

The code below is the minimum setup for reproducing the behavior (in conjunction with the respective file attached):

import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.File;
import java.io.IOException;

public class Main {

        public static void main(String[] args) {

                File inputFile = new File("/path/to/parent/folder", "truncated.pdf");

                try {
                        // metadata will be extracted by Tika
                        Metadata meta = new Metadata();
                        meta.set(Metadata.CONTENT_TYPE, "application/pdf");

                        BodyContentHandler ch = new BodyContentHandler(-1);

                        AutoDetectParser parser = new AutoDetectParser();

                        PDFParserConfig pdfParserConfig = new PDFParserConfig();
                        pdfParserConfig.setOcrStrategy("no_ocr");
                        pdfParserConfig.setMaxMainMemoryBytes(209715200);

                        ParseContext parseContext = new ParseContext();
                        parseContext.set(PDFParserConfig.class, pdfParserConfig);

                        try (TikaInputStream is = TikaInputStream.get(inputFile.toPath())) {
                                // try to parse the document
                                parser.parse(is, ch, meta, parseContext);
                        }

                } catch (TikaException | SAXException | IOException ex) {

                        // expect to enter catch
                } finally {

                        // instead catch is skipped
                }
        }
}

The stack looks like this:

parseCOSStream	COSParser	(pdfbox)
parseFileObject	COSParser	(pdfbox)
parseObjectDynamically	COSParser	(pdfbox)
parseDictObjects	COSParser	(pdfbox)
initialParse	PDFParser	(pdfbox)
parse	PDFParser	(pdfbox)
load	PDDocument	(pdfbox)
parse	PDFParser	(tika-parsers)
parse	CompositeParser	(tika-parsers)

In 2.0.6 the IOException thrown in parseCOSStream is caught in tika's CompositeParser parse method, and rethrown as TikaException, which we then expect internally and handle it in the sample code provided.

Why I believe this is a regression:

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf:

In this specification Adobe describes the structure of PDF1.7, the basis for the ISO 32000 standard.

Under the (7) Syntax clause, there is a (7.5) File Structure sub-clause which describes the valid pdf file structure.

This abstract is from sub-sub clause (7.5.5) File Trailer:

------
The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section.
------

Additionally the document in question cannot be previewed as it is considered broken by pdf previewers.

What introduced this change in parsing:

I investigated and tested what introduced this change in behavior.

The ~~PDFBOX-3798~~ issue's resolution https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java?r1=1795704&r2=1795703&pathrev=1795704 is where the change in behavior stems from.

I have tested rebuilding both 2.0.7 and 2.0.19 from their source code after reverting the change introduced by the commit above. This brings the behavior back to throwing "java.io.IOException: Error reading stream, expected='endstream' actual='' at offset ..." again.

Parsing truncated files no longer throws IOException: Error reading stream, expected='endstream' actual='' at offset ...

Details

Description

Issue:

Why I believe this is a regression:

What introduced this change in parsing:

Attachments

Attachments

Activity

People

Dates