Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2702

Different behavior between TIKA and pdfbox

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.18
    • None
    • app
    • None

    Description

      As far as I understand, TIKA is using pdfbox for extracting text from pdf files

      During a side benchmark I'm doing, I'm seeing that the text I'm getting using PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in most cases, when there is a hyperlink inside the pdf file, the pdfbox ignore the link itself, while TIKA is extracting the text, for example:

      https://www.linkedin.com/in/jhonDo
      [jhondo@yahoo.com |mailto:jhondo@yahoo.com]

       

      This is really a deal breaker for me, because I'm using pdfbox for another process I'm doing and I need the text to be the same, so I can't use TIKA at the moment....

      Attachments

        Activity

          People

            Unassigned Unassigned
            Yaffe Lior
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: