[TIKA-3026] Consider extracting structure/tags where possible in PDFs with the PDFMarkedContentExtractor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.24
Component/s: None
Labels:
None

Description

Some PDFs contain tags that may be useful in understanding the structure of the elements within a PDF, e.g. table markup, paragraph breaks, headers, etc.

The quality of the tags depends entirely on the software and human generating the PDF. There are no guarantees. Nevertheless, it might be useful in some cases for users to be able to extract content with structure tags.

Some references:

https://acrobatusers.com/tutorials/what-are-pdf-tags-and-why-should-i-care/

https://www.adobe.com/accessibility/products/acrobat/pdf-repair-add-tags.html

https://www.pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/