[TIKA-4098] Detection fails on PDF with garbage before header - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.8.0
Fix Version/s: None
Component/s: core
Labels:
None

Description

PDF detection fails on files that contain too much garbage before the header 'PDF%-'.

Those PDFs do not respect the specification, but are nonetheless correctly handled by PDF viewers.

The joined PDF is an example on the garbage found in a real-life PDF (looks like email headers that 'leaked' onto the PDF file). The PDF itself is one that I generated so that the exemple si small.

The current magic for PDFs limits the search for the '%PDF-%' header to 512 bytes, and in the joined PDF it's located after 702 garbage bytes.

I looked at the sources of PdfBox and Ghostscript to see how they handle this case and:

Ghostscript searches through the entire file (see https://github.com/ArtifexSoftware/ghostpdl/blob/master/pdf/ghostpdf.c lines 1323-1339)
PdfBox reads the file line by line, and stops looking for the header when it encounters a line that starts with a digit (see https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java lines 1561-)

From the doc in tika-mimetypes.xml for the application/pdf MIME type, I understand that increasing the maximum offset can trigger false positives. I increased it to 768, and the unit tests pass, but I didn't find any PDF that tests this particular case, so either it doesn't exist or there are integration tests that aren't part of this repo ?

How can I go about testing for regressions ? I can provide a pull request for this change, but where do I put the test PDF and a unit test?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

garbageBeforeHeader.pdf
10/Jul/23 09:07
2 kB
Thierry Guérin

Issue Links

links to

GitHub Pull Request #1231

Activity

People

Assignee:: Unassigned

Reporter:: Thierry Guérin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Jul/23 09:34

Updated:: 11/Jul/23 14:29