Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
1.18
-
None
-
None
Description
As far as I understand, TIKA is using pdfbox for extracting text from pdf files
During a side benchmark I'm doing, I'm seeing that the text I'm getting using PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in most cases, when there is a hyperlink inside the pdf file, the pdfbox ignore the link itself, while TIKA is extracting the text, for example:
https://www.linkedin.com/in/jhonDo
[jhondo@yahoo.com |mailto:jhondo@yahoo.com]
This is really a deal breaker for me, because I'm using pdfbox for another process I'm doing and I need the text to be the same, so I can't use TIKA at the moment....