Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-911

Converted PDF document contains question marks in place of spaces and inconsistent case

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.8
    • None
    • parser
    • None

    Description

      The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using

      $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
      

      Produces substantially worse output than xpdf's pdftotext program.

      Specifically, we see...

      Some 'spaces' replaced with question marks

      ...
      <body><div class="page"><p/>
      <p>How can I help?
      When you're overseas:
      • ?wherever?possible,?don't?visit?crops?—?contact?with?
      </p>
      <p>growing?crops?greatly?increases?the?risk?of?contaminating?
      footwear?or?clothing;?
      ...
      

      and some odd case conversions

      <p>stem rust in wheat.  
       (soURce: BRAd collIs)</p>
      <p/>
      </div>
      

      (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.

      To compare that with pdftotext

      $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
      

      This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

      Attachments

        1. Rust Biosecurity Brochure.pdf
          738 kB
          Matt Sheppard
        2. Rust Biosecurity Brochure.pdf.html
          6 kB
          Matt Sheppard

        Activity

          People

            Unassigned Unassigned
            mattsheppard Matt Sheppard
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: