Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.8.0
-
None
-
None
Description
PDF detection fails on files that contain too much garbage before the header 'PDF%-'.
Those PDFs do not respect the specification, but are nonetheless correctly handled by PDF viewers.
The joined PDF is an example on the garbage found in a real-life PDF (looks like email headers that 'leaked' onto the PDF file). The PDF itself is one that I generated so that the exemple si small.
The current magic for PDFs limits the search for the '%PDF-%' header to 512 bytes, and in the joined PDF it's located after 702 garbage bytes.
I looked at the sources of PdfBox and Ghostscript to see how they handle this case and:
- Ghostscript searches through the entire file (see https://github.com/ArtifexSoftware/ghostpdl/blob/master/pdf/ghostpdf.c lines 1323-1339)
- PdfBox reads the file line by line, and stops looking for the header when it encounters a line that starts with a digit (see https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java lines 1561-)
From the doc in tika-mimetypes.xml for the application/pdf MIME type, I understand that increasing the maximum offset can trigger false positives. I increased it to 768, and the unit tests pass, but I didn't find any PDF that tests this particular case, so either it doesn't exist or there are integration tests that aren't part of this repo ?
How can I go about testing for regressions ? I can provide a pull request for this change, but where do I put the test PDF and a unit test?
Attachments
Attachments
Issue Links
- links to