Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3026

Consider extracting structure/tags where possible in PDFs with the PDFMarkedContentExtractor

    XMLWordPrintableJSON

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.24
    • Component/s: None
    • Labels:
      None

      Description

      Some PDFs contain tags that may be useful in understanding the structure of the elements within a PDF, e.g. table markup, paragraph breaks, headers, etc.  

       

       

      The quality of the tags depends entirely on the software and human generating the PDF.  There are no guarantees.  Nevertheless, it might be useful in some cases for users to be able to extract content with structure tags.

       

      Some references:

      https://acrobatusers.com/tutorials/what-are-pdf-tags-and-why-should-i-care/

      https://www.adobe.com/accessibility/products/acrobat/pdf-repair-add-tags.html

      https://www.pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/

        Attachments

          Activity

            People

            • Assignee:
              tallison Tim Allison
              Reporter:
              tallison Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: