Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2845

Override ProcessPages in PDFTextStripper

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.21
    • None
    • None

    Description

      On the PDFBox user list, lehmi confirmed (and tilman clarified) that PDFTextStripper's processPages skips pages that lack a "Contents" element[1].  Inline images are part of the "Contents" element and would still be processed (e.g. in OCR).  

       

      However, there are other elements that might be on a page that does not have a "Contents" element, such as an annotation with an embedded file.

       

      We should override processPages() to process all pages.

      [1] Start of thread: https://lists.apache.org/thread.html/9f34f71f764ef2ac48bb2fe3d19aa0496fd989040a6df0c1d899a885@%3Cusers.pdfbox.apache.org%3E

      Attachments

        Activity

          People

            tallison Tim Allison
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: