Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3026

Consider extracting structure/tags where possible in PDFs with the PDFMarkedContentExtractor

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.24
    • None
    • None

    Description

      Some PDFs contain tags that may be useful in understanding the structure of the elements within a PDF, e.g. table markup, paragraph breaks, headers, etc.  

       

       

      The quality of the tags depends entirely on the software and human generating the PDF.  There are no guarantees.  Nevertheless, it might be useful in some cases for users to be able to extract content with structure tags.

       

      Some references:

      https://acrobatusers.com/tutorials/what-are-pdf-tags-and-why-should-i-care/

      https://www.adobe.com/accessibility/products/acrobat/pdf-repair-add-tags.html

      https://www.pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/

      Attachments

        Activity

          People

            tallison Tim Allison
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: