[TIKA-861] Parse links in PDF - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0
Fix Version/s: 1.2
Component/s: parser
Labels:
- links
- pdfbox

Description

Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.

The PDF2XHTML method loops through the annotations.

See:

136: for(Object o : page.getAnnotations()) {

I found some code for dealing with links in annotations:
http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link

It involves checking the class.

if( annotation instanceof PDAnnotationLink ) {
                PDAnnotationLink link = (PDAnnotationLink)annotation;

I hope this helps someone.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-861.patch
23/Apr/12 21:54
2 kB
Ryan Quam
TIKA-861-test.patch
24/Apr/12 13:37
0.8 kB
Ryan Quam

Activity

People

Assignee:: Unassigned

Reporter:: Sasha Goodman

Votes:: 2 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Feb/12 01:17

Updated:: 27/Apr/12 14:01

Resolved:: 27/Apr/12 14:01

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified