[TIKA-2702] Different behavior between TIKA and pdfbox - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.18
Fix Version/s: None
Component/s: app
Labels:
None

Description

As far as I understand, TIKA is using pdfbox for extracting text from pdf files

During a side benchmark I'm doing, I'm seeing that the text I'm getting using PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in most cases, when there is a hyperlink inside the pdf file, the pdfbox ignore the link itself, while TIKA is extracting the text, for example:

https://www.linkedin.com/in/jhonDo
[jhondo@yahoo.com |mailto:jhondo@yahoo.com]

This is really a deal breaker for me, because I'm using pdfbox for another process I'm doing and I need the text to be the same, so I can't use TIKA at the moment....

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Lior

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Aug/18 12:05

Updated:: 03/Aug/18 17:06

Resolved:: 03/Aug/18 17:06