[TIKA-973] PDF form data isn't included in extracted content. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.2
Fix Version/s: 1.5
Component/s: general
Labels:
None

Description

When extracting content from PDFs, PDF form data isn't extracted.

The following code extracts this data via PDF box, but it seems like something Tika should be doing.

PDDocumentCatalog docCatalog = load.getDocumentCatalog();
if (docCatalog != null) {
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm != null) {
@SuppressWarnings("unchecked")
List<PDField> fields = acroForm.getFields();
if (fields != null && fields.size() > 0) {
documentContent.append(" ");
for (PDField field : fields) {
if (field.getValue()!=null)

{ documentContent.append(field.getValue()); documentContent.append(" "); }

}
}
}
}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-973-patch.tar.gz
27/Jun/13 01:30
469 kB
Tim Allison
TIKA-973.patch.tar.gz
02/Jul/13 13:16
524 kB
Tim Allison
i-9_screenshot.png
27/Jun/13 15:37
227 kB
Tim Allison

Activity

People

Assignee:: Tim Allison

Reporter:: Michael Graessle

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Aug/12 19:53

Updated:: 25/Mar/14 16:21

Resolved:: 04/Feb/14 23:16