Tika
  1. Tika
  2. TIKA-911

Converted PDF document contains question marks in place of spaces and inconsistent case

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.1
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using

      $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
      

      Produces substantially worse output than xpdf's pdftotext program.

      Specifically, we see...

      Some 'spaces' replaced with question marks

      ...
      <body><div class="page"><p/>
      <p>How can I help?
      When you're overseas:
      • ?wherever?possible,?don't?visit?crops?—?contact?with?
      </p>
      <p>growing?crops?greatly?increases?the?risk?of?contaminating?
      footwear?or?clothing;?
      ...
      

      and some odd case conversions

      <p>stem rust in wheat.  
       (soURce: BRAd collIs)</p>
      <p/>
      </div>
      

      (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.

      To compare that with pdftotext

      $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
      

      This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

      1. Rust Biosecurity Brochure.pdf.html
        6 kB
        Matt Sheppard
      2. Rust Biosecurity Brochure.pdf
        738 kB
        Matt Sheppard

        Activity

        Hide
        Matt Sheppard added a comment -

        Attached PDF document in case is removed from the source site.

        Show
        Matt Sheppard added a comment - Attached PDF document in case is removed from the source site.
        Hide
        Michael McCandless added a comment -

        Hmm, I can't reproduce these issues.

        I downloaded the PDF from the URL, downloaded tika-app-1.1.jar, ran java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf, and I don't see the ? for spaces nor the mixed casing. I'm using Java 1.7.0_04 on Ubuntu 12.04.

        Show
        Michael McCandless added a comment - Hmm, I can't reproduce these issues. I downloaded the PDF from the URL, downloaded tika-app-1.1.jar, ran java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf, and I don't see the ? for spaces nor the mixed casing. I'm using Java 1.7.0_04 on Ubuntu 12.04.
        Hide
        Matt Sheppard added a comment -

        Interesting - I was running Mac OS 10.7.3. Will confirm the version of java when I'm back in the office.

        Show
        Matt Sheppard added a comment - Interesting - I was running Mac OS 10.7.3. Will confirm the version of java when I'm back in the office.
        Hide
        Matt Sheppard added a comment -

        Confirmed that it still occurs for me on a different mac (with freshly downloaded PDF and tika-app-1.1.jar).

        mercury:Downloads matt$ system_profiler SPSoftwareDataType
        Software:
        
            System Software Overview:
        
              System Version: Mac OS X 10.7.3 (11D50d)
              Kernel Version: Darwin 11.3.0
              Boot Volume: Macintosh HD
              Boot Mode: Normal
              Computer Name: Mercury
              User Name: Matthew Sheppard (matt)
              Secure Virtual Memory: Enabled
              64-bit Kernel and Extensions: Yes
              Time since boot: 3 days 1:10
        
        mercury:Downloads matt$ java -version
        java version "1.6.0_31"
        Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635)
        Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode)
        mercury:Downloads matt$ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf 
        <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
        <head>
        <meta name="xmpTPg:NPages" content="2"/>
        <meta name="Creation-Date" content="2008-06-06T02:53:07Z"/>
        <meta name="trapped" content="False"/>
        <meta name="created" content="Fri Jun 06 12:53:07 EST 2008"/>
        <meta name="Content-Length" content="755665"/>
        <meta name="Last-Modified" content="2008-06-06T02:53:23Z"/>
        <meta name="producer" content="Adobe PDF Library 7.0"/>
        <meta name="Content-Type" content="application/pdf"/>
        <meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/>
        <meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/>
        <title/>
        </head>
        <body><div class="page"><p/>
        <p>How can I help?
        When you’re overseas:
        • �wherever�possible,�don’t�visit�crops�—�contact�with�
        </p>
        <p>growing�crops�greatly�increases�the�risk�of�contaminating�
        footwear�or�clothing;�
        ...[snip]...
        <p>Initial detection  
        points of exotic wheat 
        rust incursions
        </p>
        <p>stem rust in wheat.  
         (soURce: BRAd collIs)</p>
        <p/>
        </div>
        </body></html>
        

        Note that the ?s reported appear to display differently on this machine.

        Will attach a copy of the output as a file for reference.

        Show
        Matt Sheppard added a comment - Confirmed that it still occurs for me on a different mac (with freshly downloaded PDF and tika-app-1.1.jar). mercury:Downloads matt$ system_profiler SPSoftwareDataType Software: System Software Overview: System Version: Mac OS X 10.7.3 (11D50d) Kernel Version: Darwin 11.3.0 Boot Volume: Macintosh HD Boot Mode: Normal Computer Name: Mercury User Name: Matthew Sheppard (matt) Secure Virtual Memory: Enabled 64-bit Kernel and Extensions: Yes Time since boot: 3 days 1:10 mercury:Downloads matt$ java -version java version "1.6.0_31" Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode) mercury:Downloads matt$ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="xmpTPg:NPages" content="2"/> <meta name="Creation-Date" content="2008-06-06T02:53:07Z"/> <meta name="trapped" content="False"/> <meta name="created" content="Fri Jun 06 12:53:07 EST 2008"/> <meta name="Content-Length" content="755665"/> <meta name="Last-Modified" content="2008-06-06T02:53:23Z"/> <meta name="producer" content="Adobe PDF Library 7.0"/> <meta name="Content-Type" content="application/pdf"/> <meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/> <meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/> <title/> </head> <body><div class="page"><p/> <p>How can I help? When you’re overseas: • �wherever�possible,�don’t�visit�crops�—�contact�with� </p> <p>growing�crops�greatly�increases�the�risk�of�contaminating� footwear�or�clothing;� ...[snip]... <p>Initial detection points of exotic wheat rust incursions </p> <p>stem rust in wheat. (soURce: BRAd collIs)</p> <p/> </div> </body></html> Note that the ?s reported appear to display differently on this machine. Will attach a copy of the output as a file for reference.
        Hide
        Michael McCandless added a comment -

        So strange ... I tested on a Mac (10.6.8) with Java 1.6.0_31, and I don't see the ? for spaces nor the mixed case.

        Hmm, my header has a different content-length then yours:

        <meta name="xmpTPg:NPages" content="2"/>
        <meta name="Creation-Date" content="2012-05-02T10:25:00Z"/>
        <meta name="created" content="Wed May 02 06:25:00 EDT 2012"/>
        <meta name="Content-Length" content="639985"/>
        <meta name="Last-Modified" content="2012-05-02T10:25:00Z"/>
        <meta name="producer" content="Mac OS X 10.6.8 Quartz PDFContext"/>
        <meta name="Content-Type" content="application/pdf"/>
        <meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/>
        <meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/>
        

        OK! If I used the PDF attached to the issue, I indeed see these problems (I had downloaded from the web site). Maybe the web site has since changed/fixed the PDF? Hmm.

        So, the extra characters (where there should be spaces) are U+FFFD (the unicode replacement character); Tika outputs this whenever there is a character it can't safely output into the XHTML (this is done in SafeContentHanderl.java). Tika used to (before 0.10) simply replace such characters with space (ASCII 32), so, to get back to pre-0.10 behaviour you can replace U+FFFD with space.

        Not sure about the mixed case issue...

        Show
        Michael McCandless added a comment - So strange ... I tested on a Mac (10.6.8) with Java 1.6.0_31, and I don't see the ? for spaces nor the mixed case. Hmm, my header has a different content-length then yours: <meta name="xmpTPg:NPages" content="2"/> <meta name="Creation-Date" content="2012-05-02T10:25:00Z"/> <meta name="created" content="Wed May 02 06:25:00 EDT 2012"/> <meta name="Content-Length" content="639985"/> <meta name="Last-Modified" content="2012-05-02T10:25:00Z"/> <meta name="producer" content="Mac OS X 10.6.8 Quartz PDFContext"/> <meta name="Content-Type" content="application/pdf"/> <meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/> <meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/> OK! If I used the PDF attached to the issue, I indeed see these problems (I had downloaded from the web site). Maybe the web site has since changed/fixed the PDF? Hmm. So, the extra characters (where there should be spaces) are U+FFFD (the unicode replacement character); Tika outputs this whenever there is a character it can't safely output into the XHTML (this is done in SafeContentHanderl.java). Tika used to (before 0.10) simply replace such characters with space (ASCII 32), so, to get back to pre-0.10 behaviour you can replace U+FFFD with space. Not sure about the mixed case issue...

          People

          • Assignee:
            Unassigned
            Reporter:
            Matt Sheppard
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development