Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4098

Detection fails on PDF with garbage before header

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.8.0
    • None
    • core
    • None

    Description

      PDF detection fails on files that contain too much garbage before the header 'PDF%-'.

      Those PDFs do not respect the specification, but are nonetheless correctly handled by PDF viewers.

      The joined PDF is an example on the garbage found in a real-life PDF (looks like email headers that 'leaked' onto the PDF file). The PDF itself is one that I generated so that the exemple si small.

      The current magic for PDFs  limits the search for the '%PDF-%' header to 512 bytes, and in the joined PDF it's located after 702 garbage bytes.

      I looked at the sources of PdfBox and Ghostscript to see how they handle this case and:

      From the doc in tika-mimetypes.xml for the application/pdf MIME type, I understand that increasing the maximum offset can trigger false positives. I increased it to 768, and the unit tests pass, but I didn't find any PDF that  tests this particular case, so either it doesn't exist or there are integration tests that aren't part of this repo ?

      How can I go about testing for regressions ? I can provide a pull request for this change, but where do I put the test PDF and a unit test?

      Attachments

        1. garbageBeforeHeader.pdf
          2 kB
          Thierry Guérin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tguerin Thierry Guérin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: