[TIKA-973] PDF form data isn't included in extracted content. - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.2
Fix Version/s: 1.5
Component/s: general
Labels:
None

Description

When extracting content from PDFs, PDF form data isn't extracted.

The following code extracts this data via PDF box, but it seems like something Tika should be doing.

PDDocumentCatalog docCatalog = load.getDocumentCatalog();
if (docCatalog != null) {
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm != null) {
@SuppressWarnings("unchecked")
List<PDField> fields = acroForm.getFields();
if (fields != null && fields.size() > 0) {
documentContent.append(" ");
for (PDField field : fields) {
if (field.getValue()!=null)

{ documentContent.append(field.getValue()); documentContent.append(" "); }

}
}
}
}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Manage Attachments

i-9_screenshot.png
27/Jun/13 15:37
227 kB
Tim Allison
TIKA-973.patch.tar.gz
02/Jul/13 13:16
524 kB
Tim Allison
TIKA-973-patch.tar.gz
27/Jun/13 01:30
469 kB
Tim Allison

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Tim Allison

Reporter:: Michael Graessle

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Aug/12 19:53

Updated:: 25/Mar/14 16:21

Resolved:: 04/Feb/14 23:16

Agile

View on Board

PDF form data isn't included in extracted content.

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment