Tika
  1. Tika
  2. TIKA-911

Converted PDF document contains question marks in place of spaces and inconsistent case

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.1
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using

      $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
      

      Produces substantially worse output than xpdf's pdftotext program.

      Specifically, we see...

      Some 'spaces' replaced with question marks

      ...
      <body><div class="page"><p/>
      <p>How can I help?
      When you're overseas:
      • ?wherever?possible,?don't?visit?crops?—?contact?with?
      </p>
      <p>growing?crops?greatly?increases?the?risk?of?contaminating?
      footwear?or?clothing;?
      ...
      

      and some odd case conversions

      <p>stem rust in wheat.  
       (soURce: BRAd collIs)</p>
      <p/>
      </div>
      

      (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.

      To compare that with pdftotext

      $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
      

      This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

      1. Rust Biosecurity Brochure.pdf.html
        6 kB
        Matt Sheppard
      2. Rust Biosecurity Brochure.pdf
        738 kB
        Matt Sheppard

        Activity

          People

          • Assignee:
            Unassigned
            Reporter:
            Matt Sheppard
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development